Progress (EOSC Life)

From ECRIN-MDR Wiki
Revision as of 17:53, 11 December 2020 by Admin (talk | contribs)
Jump to navigation Jump to search

The table below summarises the progress on the project in terms of the goals listed within the H2020 Project EOSC Life.

Last updated 11/12/2020

# Task Status Comments
1 Re-examination and refinement of the current metadata model, to enhance searching and filtering capability. Completed:
Model now at version 5 for both studies and data objects. A variety of changes introduced, as listed in comments.
Recent changes:
a) Provenance information introduced for both study and data object data (required to support terms of use of source registries)
b) EOSC risk and DUO consent classification incorporated (see 2 below)
c) DUO like scheme introduced for de-identification categories.
d) Topic data re-organised to better support MESH coding
e) Field names made more consistent.
f) Simplification by removal of some unused fields.
2 Exploration of the inter-relationships and possible alignments with other ontologies and approaches to discoverability (e. g. OmicsDI, data tags, BioSchema). Done:
a) DUO consent classification examined and incorporated.
b) EOSC data tag for risk examined and incorporated.
To do:
a) Explore and promote further compatibility with PID developments (e.g. from RDA PID forum), in particular Organisation ids (ROR) for contextual data.
Investigation of possible mappings, to schema.org and DCAT schemas
3 Obtaining and integrating metadata from more (at least 6) major study registries. Completed
a) All 18 WHO registries now serve as sources, though 15 of these are via WHO data. Creates about 580,000 study records.
b) WHO data set processing now much improved – data from different registries split on initial download and then processed separately.
To do:
a) German, Dutch and Australian registries to be interrogated directly rather than through WHO data.EUPAS dataset to be added for additonal observational studies
4 Extending extraction to other data repositories (at least 10). Ongoing
a) Only BioLINCC and Yoda being targeted at present, through web scraping.
b) Vivli data downloaded and analysed but appears too incomplete at the moment for use.
To do:
a) Examine suitability of DataDryad, Zenodo, CrossRef as potential data sources, and add each of these if possible.
b) Examine the possibility of using other NIH sponsored repositories, (i.e. similar to BioLINCC) as data sources.
c) Examine the possibility of using designated protocol documents, as published in Trials.
d) Examine possible contributions of one or two institutional repositories.
5 Establishing algorithms for identifying the links between new data and already extracted studies and data objects Completed
a) Mechanisms introduced into the system based around md5 hashes, to identify data objects and studies without PID.
b) A revised procedure for identifying links between studies now in place, based upon 'other study identifiers' listed in registries.
c) Management of cross-source study-study one-to-many relationships now added to the system.
To do:
a) Study linkage based on title should be explored.
b) The possible use of text mining and ML techniques for establishing links and duplications needs to be explored.
6 Modification of data extraction to better handle periodic interrogation of the same source (i.e. only handle new or revised data). Almost completed
Data extraction and processing mechanisms now brought within a generic framework, for better control and monitoring and scheduled operation introduced. Scheduling being checked, logging of 'unsupervised' operations being improved.
Recent changes:
a) Download and processing mechanisms brought within a generic framework, for better control and monitoring.
b) Local data stores established for all sources.
c) Logging and tracking mechanisms introduced to identify correct candidate studies / objects for each process
d) Introduction of data download, processing and aggregation tasks as scheduled tasks (weekly at present).
7 Modularising data extraction architecture where possible, with a view to providing interchangeable components for uptake by other RIs. Completed
a) Different stages of extraction process now separated in order to better support modularisation and independent functioning. Not possible to assess possible usefulness to other RIs until the full range of systems developed.
To do:
Documentation of the systems still required - should be done shortly.
8 Developing ways of rationalising topics / keywords against a common UMLS based schema, to reduce duplication and enhance searchability. Ongoing
MESH codes selected as the best interim method of rationalising topic terms, and applied to the system. (Much source data is already MESH coded)
To do:
Further exploration of UMLS systems and related services. Need to find as comprehensive a solution as possible.
9 Developing ways of processing names (of research, organisations, people) to better support matching and searching. Ongoing
Algorithms introduced for applying standardised versions of names during the import process, but not 100%.
To do:
a) Explore the text indexing capabilities of Postgres.
b) Explore if developments in PID management (e.g. from PID forum) can be applied to entities in the MDR.
10 Maintaining comprehensive documentation of all aspects of the system, including each extraction routine, within a project Wiki. Ongoing
a) Wiki re-organised but a lot of new material required
By its nature this task always 'ongoing'
a) Wiki currently being re-organised and rewritten, both in general and for specific sources.
b) Issue tracking system still missing, though relatively little impact at present.
11 Maintaining all extraction and data processing code in GitHub. Ongoing
a) Source code made more uniform and Github repository tidied up
b) Revised Readme files created for all 4 main data collection / extraction systems.
By its nature this task always 'ongoing'
12 Development of tests (including test data) for regular testing of extraction accuracy. Ongoing
A strategy now outlined using a sampe of studies from each source
To do:
a) Initial selection of relevant material required for each source (or source type). b) An automated system for comparing actual versus expected values required.
13 Creation of a co-ordinating system for scheduling, triggering, monitoring and logging extraction activity. Not yet begun
Now ready for introduction (see 6 and 7). An urgent task for the next few weeks.
14 Preparation and testing of a Restful (or possibly GraphQL) API for supporting data access. Not yet begun a) Characterisation of data demands from the portal interface need to be clarified.
b) To explore usefulness of GraphQL instead of or in addition to a RESTful API.
15 Developing a web-based support tool to help data generators more easily apply the metadata at source. Not yet begun a) Low priority compared with most other activities – likely to be 2021
16 Integration with AAI, developed and provided by EOSC-Life. Although the data itself will be public, access to development and data management systems will need to be controlled. Not yet begun a) Needs further details on how the portal will be integrated within EOSC hub and how development / production versions will be managed.
17 Contributing to an overall strategy around discoverability of data sources, within EOSC as a whole and within life science RIs in particular. Not yet begun a) Not clear how this can be progressed at present.
18 Exploration of how data sharing can be improved by the MDR (i.e. demonstrations of usefulness). Not yet begun a) The system needs to reach a certain degree of maturity before it can be properly evaluated. Once that is done a dialogue can be begun with users, both in general and with a designated test group.
b) Dialog should include mechanisms for feedback, user requests and issue tracking.
19 Publication of journal papers around the MDR. Not yet begun a) Scope and number of papers to be clarified.
b) A general description, however, required relatively urgently.