Progress (EOSC Life)

From ECRIN-MDR Wiki
Revision as of 15:36, 28 October 2020 by Admin (talk | contribs)
Jump to navigation Jump to search

The table below summarises the progress on the project in terms of the goals listed within the H2020 Project EOSC Life.

Last updated 27/10/2020

# Task Status Comments
1 Re-examination and refinement of the current metadata model, to enhance searching and filtering capability. Completed
Model now at version 5 for both studies and data objects. A variety of changes introduced to enhance filtering, improve alignment with other systems and simplify by losing unused data points.
Changes include:
a) Provenance information introduced for both study and data object data (required to support terms of use of source registries)
b) Consent classification aligned to DUO categories and similar scheme introduced for de-identification.
c) EOSC risk categorisation added to data objects.
d) Topic data re-organised to better support MESH coding
e) Field names made more consistent.
f) Distribution of 'contains html' fields clarified.
g) Some unused fields relating to people dropped.
2 Exploration of the inter-relationships and possible alignments with other ontologies and approaches to discoverability (e. g. OmicsDI, data tags, BioSchema). Ongoing
Done:
a) DUO consent classification examined and incorporated.
b) EOSC data tag for risk examined and incorporated.
c) Bioschema examined but at this stage does not appear very relevant - fundamentally has a different purpose.
To do:
a) Explore and promote further compatibility with PID developments (e.g. from RDA PID forum)
b) In particular examine developing Organisation id schemas for contextual data.
3 Obtaining and integrating metadata from more (at least 6) major studies registries. All 18 WHO registries now serve as sources, though 15 of these are via WHO data. Creates about 560,000 study records. a) 3 repositories interrogated individually (ClinicalTrials.gov, EU CTR and ISRCTN).
b) German, Dutch and Australian registries should be the next to be interrogated in this way, and they then can be removed from the WHO dataset.
c) WHO dataset's processing to be improved – data from different registries split on initial download and then processed separately.
4 Extending extraction to other data repositories (at least 10). Only BioLINCC and Yoda being targeted at present, through web scraping. a) Examine suitability of DataDryad, Zenodo, CrossRef as potential data sources, and add each of these if possible.
b) Examine the possibility of using other NIH sponsored repositories, (i.e. similar to BioLINCC) a data sources.
c) Examine how data from Vivli can be incorporated.
d) Examine the possibility of using designated protocol documents, as published in Trials.
e) Explore if data can be obtained from IDDO.
f) Examine possible contributions of one or two institutional repositories.
5 Establishing algorithms for identifying the links between new data and already extracted studies and data objects Mechanisms recently introduced into the system base around using hashes to identify data objects and studies without PID.
A revised procedure for identifying links between studies now in place, based upon 'other study identifiers' listed in registries.
a) The hash based mechanisms need further deployment, to other data sources, and testing.
b) Hash based mechanisms need documentation and incorporation into the wiki.
c) Study linkage based on title to be explored.
d) Better management of cross-source study-study one-to-many relationships needs to be added to the system.
e) The possible use of text mining and ML techniques for establishing links and duplications need to be explored.
6 Modification of data extraction to better handle periodic interrogation of the same source (i.e. only handle new or revised data). Download and processing mechanisms increasingly being brought within a generic framework, for better control and monitoring.
Local data stores being established for all sources. Coding system introduced on records to indicate different levels of change since the last import.
a) Gradual migration required of currently separate extraction and processing systems into a more generic framework, with a collection of generic systems that can be controlled and logged more easily.
b) UI needs to be placed around generic systems once they are developed.
c) Gradual introduction of data download tasks as scheduled tasks (frequency will depend on data source).
d) Gradual introduction of data processing and aggregation tasks as scheduled tasks, ultimately aiming for a nightly frequency but will be less frequent initially.
7 Modularising data extraction architecture where possible, with a view to providing interchangeable components for uptake by other RIs. Different stages of extraction process now separated in order to better support modularisation. Not possible to assess possible usefulness to other RIs until the full range of systems developed. a) Separate system required for data download, data harvesting (producing sd data), data importing (to ad data), data aggregation (to central single system) and json file generation. Developing these will be a major development task.
8 Developing ways of rationalising topics / keywords against a common UMLS based schema, to reduce duplication and enhance searchability. Not yet begun. a) Explore nature of UMLS system and its relationship to MESH and other systems.
b) Design a system that can translate topic text into a standardised scheme wherever possible (major work).
9 Developing ways of processing names (of research, organisations, people) to better support matching and searching. Initial exploration work done only. a) Introduce algorithms for introducing standardised versions of text during the import process.
b) In parallel explore the text indexing capabilities of Postgres – the two features may work well together, or one may reduce the need for the other.
c) Explore if developments in PID management (e.g. from PID forum) can be applied to entities in the MDR.
10 Maintaining comprehensive documentation of all aspects of the system, including each extraction routine, within a project Wiki. Ongoing. a) Wiki currently needs a considerable amount of catch up, both in general and for specific sources.
b) Issue tracking needs to be improved, with the introduction of a suitable tool.
11 Maintaining all extraction and data processing code in GitHub. Ongoing. a) Update readme files and licences for newer systems
12 Development of tests (including test data) for regular testing of extraction accuracy. Not yet begun. a) A strategy for this major work is required.
b) Initial selection of relevant material required for each source (or source type)
13 Creation of a co-ordinating system for scheduling, triggering, monitoring and logging extraction activity, with a GUI for ease of use. Ongoing (see 6 and 7). a) See 6 and 7
14 Preparation and testing of a Restful (or possibly GraphQL) API for supporting data access. Not yet begun. a) Requires characterisation of data demands from the portal interface.
b) Full implementation dependent on finalising of database structure.
c) Any API will also need full documentation.
d) To explore usefulness of GraphQL instead of or in addition to a RESTful API.
15 Developing a web-based support tool to help data generators more easily apply the metadata at source. Not yet begun. a) Low priority compared with most other activities – likely to be 2021
16 Integration with AAI, developed and provided by EOSC-Life. Although the data itself will be public, access to development and data management systems will need to be controlled. Not yet begun. a) Needs further details on how the portal will be integrated within EOSC hub (if it is).
17 Contributing to an overall strategy around discoverability of data sources, within EOSC as a whole and within life science RIs in particular. Not yet begun. a) Not clear how this could be progressed at present.
18 Exploration of how data sharing can be improved by the MDR (i.e. demonstrations of usefulness). Not yet begun. a) The system needs to reach a certain degree of maturity before it can be properly evaluated. Once that is done a dialogue can be begun with users, both in general and with a designated test group.
b) Dialog should include mechanisms for feedback, user requests and issue tracking.
19 Publication of journal papers around the MDR. Not yet begun. a) Scope and number of papers to be clarified.
b) A general description, however, required relatively urgently.