Progress (EOSC Life)

From ECRIN-MDR Wiki
Jump to navigation Jump to search

The table below summarises the progress on the project in terms of the goals listed within the H2020 Project EOSC Life.

Last updated 07/10/2021

# Task Status Comments
1 Re-examination and refinement of the current metadata model, to enhance searching and filtering capability. Completed:
Model now at version 6 for both studies and data objects. A variety of changes introduced, as listed in comments.
Recent changes:
a) Provenance information introduced for both study and data object data (required to support terms of use of source registries)
b) EOSC risk and DUO consent classification incorporated (see 2 below)
c) DUO like scheme introduced for de-identification categories.
d) Topic data re-organised to better support MESH coding
e) Field names made more consistent.
f) Simplification by removal of some unused fields.
2 Exploration of the inter-relationships and possible alignments with other ontologies and approaches to discoverability (e. g. OmicsDI, data tags, BioSchema). Ongoing:
a) DUO consent classification examined and incorporated.
b) EOSC data tag for risk examined and incorporated.
c) Organisation ids (ROR) added to contextual data.
To do:
a) Investigation of additional possible mappings, to schema.org and DCAT schemas
3 Obtaining and integrating metadata from more (at least 6) major study registries. Completed:
a) All 18 WHO registries now serve as sources, though 15 of these are via WHO data. Creates about 580,000 study records.
b) WHO data set processing now much improved – data from different registries split on initial download and then processed separately.
Further work:
a) German, Dutch and Australian registries to be interrogated directly rather than through WHO data.
b) EUPAS dataset to be added for additonal observational studies
4 Extending extraction to other data repositories (at least 10). Ongoing:
a) Only BioLINCC and Yoda being targeted at present, through web scraping.
b) Vivli data downloaded and analysed but appears too incomplete at the moment for use.
To do:
a) Examine suitability of DataDryad, Zenodo, CrossRef as potential data sources, and add each of these if possible.
b) Examine the possibility of using other NIH sponsored repositories, (i.e. similar to BioLINCC) as data sources.
c) Examine the possibility of using designated protocol documents, as published in Trials.
d) Examine possible contributions of one or two institutional repositories.
5 Establishing algorithms for identifying the links between new data and already extracted studies and data objects Completed:
a) Mechanisms introduced into the system based around md5 hashes, to identify data objects and studies without PID.
b) A revised procedure for identifying links between studies now in place, based upon 'other study identifiers' listed in registries.
c) Management of cross-source study-study one-to-many relationships now added to the system.
Further work:
a) Study linkage based on title should be explored.
b) The possible use of text mining and ML techniques for establishing links and duplications needs to be explored.
6 Modification of data extraction to better handle periodic interrogation of the same source (i.e. only handle new or revised data). Completed:
Data extraction and processing mechanisms now brought within a generic framework and scheduled operation introduced, scheduling and logging improved.
Supporting changes:
a) Download and processing mechanisms brought within a generic framework, for better control and monitoring.
b) Local data stores established for all sources.
c) Logging and tracking mechanisms introduced to identify correct candidate studies / objects for each process
d) Introduction of data download, processing and aggregation tasks as scheduled tasks (weekly at present).
7 Modularising data extraction architecture where possible, with a view to providing interchangeable components for uptake by other RIs. Completed:
a) Different stages of extraction process now separated into modules in order to better support modularisation and independent functioning.
b) Greater use of dependency injection within systems.
To do:
a) Documentation of the systems needs bringing up to date in MDR wiki.
b) Assessment of possible usefulness to other RIs, but this is difficult until the full range of systems is developed, including APIs.
8 Developing ways of rationalising topics / keywords against a common UMLS based schema, to reduce duplication and enhance searchability. Ongoing:
MESH codes selected as the best interim method of rationalising topic terms, and applied to the system. (Much source data is already MESH coded)
b) Mesh coding of several hundred of the most common non-MESH coded terms carried out
To do:
a) MESH coding, where possible, of further uncoded terms against their MESH equivalents
b) Further exploration of UMLS systems and related services. Need to find as comprehensive a solution as possible.
9 Developing ways of processing names (of research, organisations, people) to better support matching and searching. Ongoing:
a) Algorithms introduced for applying standardised versions of names during the import process, but not 100%.
b) Links established to ROR data for organisations.
To do:
a) Explore the text indexing capabilities of Postgres.
10 Maintaining comprehensive documentation of all aspects of the system, including each extraction routine, within a project Wiki. Ongoing:
a) Wiki re-organised and new material introduced for metadata and data extraction sections
To do:
a) Portal documentation needs bringing up to date
b) Latest system developments need to be reflected in the documentation
By its nature this task always 'ongoing'
11 Maintaining all extraction and data processing code in GitHub. Ongoing:
a) Source code made more uniform and Github repository tidied up
b) Revised Readme files created for all 4 main data collection / extraction systems
Comment:
By its nature this task always 'ongoing'
12 Development of tests (including test data) for regular testing of extraction accuracy. Ongoing:
a) A strategy developed. Different types of tests required for different parts of the system.
b) Initial selection of relevant test material made (e.g. sample studies) for each source (or source type) - 60 studies in total b) Automated systems for comparing actual versus expected values developed for harvest and import processes.
To do:
Testing for aggregation mechanism still to be put in place.
13 Publication of journal papers around the MDR. Begun:
Outline of initial paper circulated and agreed
To do:
Text to be written in near future
14 Preparation and testing of a Restful (or possibly GraphQL) API for supporting data access. Not yet begun:
Other tasks have had to take priority up to now.
Comment:
a) Characterisation of data demands from the portal interface need to be clarified.
b) To explore usefulness of GraphQL instead of or in addition to a RESTful API.
15 Developing a web-based support tool to help data generators more easily apply the metadata at source. Begun:
a) Initial design work being carried out (to support metadata capture in EOSC Life WP14)
b) Initial version of forms and code developed within COVID-19 repository management system.
Comment:
Version 1 expected Spring 2022
16 Integration with AAI, developed and provided by EOSC-Life. Although the data itself will be public, access to development and data management systems will need to be controlled. Not yet begun
Not needed at the moment
Comment:
Needs further details on how the portal will be integrated within EOSC hub and how development / production versions will be managed.
17 Contributing to an overall strategy around discoverability of data sources, within EOSC as a whole and within life science RIs in particular. Not yet begun Comment:
Not clear how this can best be progressed at present, but developments in EOSC strategy and related systems monitored.
18 Exploration of how data sharing can be improved by the MDR (i.e. demonstrations of usefulness). Not yet begun Comment:
The system needs to reach a certain degree of maturity before it can be properly evaluated. Once that is done a dialogue can be begun with users, both in general and with a designated test group.