Progress (EOSC Life)
Jump to navigation
Jump to search
The table below summarises the progress on the project in terms of the goals listed within the H2020 Project EOSC Life.
Last updated 11/12/2020
# | Task | Status | Comments |
---|---|---|---|
1 | Re-examination and refinement of the current metadata model, to enhance searching and filtering capability. | Completed: Model now at version 5 for both studies and data objects. A variety of changes introduced, as listed in comments. |
Recent changes: a) Provenance information introduced for both study and data object data (required to support terms of use of source registries) b) EOSC risk and DUO consent classification incorporated (see 2 below) c) DUO like scheme introduced for de-identification categories. d) Topic data re-organised to better support MESH coding e) Field names made more consistent. f) Simplification by removal of some unused fields. |
2 | Exploration of the inter-relationships and possible alignments with other ontologies and approaches to discoverability (e. g. OmicsDI, data tags, BioSchema). | Ongoing: a) DUO consent classification examined and incorporated. b) EOSC data tag for risk examined and incorporated. |
To do: a) Explore and promote further compatibility with PID developments (e.g. from RDA PID forum), in particular Organisation ids (ROR) for contextual data. Investigation of possible mappings, to schema.org and DCAT schemas |
3 | Obtaining and integrating metadata from more (at least 6) major study registries. | Completed: a) All 18 WHO registries now serve as sources, though 15 of these are via WHO data. Creates about 580,000 study records. b) WHO data set processing now much improved – data from different registries split on initial download and then processed separately. |
To do: a) German, Dutch and Australian registries to be interrogated directly rather than through WHO data.EUPAS dataset to be added for additonal observational studies |
4 | Extending extraction to other data repositories (at least 10). | Ongoing: a) Only BioLINCC and Yoda being targeted at present, through web scraping. b) Vivli data downloaded and analysed but appears too incomplete at the moment for use. |
To do: a) Examine suitability of DataDryad, Zenodo, CrossRef as potential data sources, and add each of these if possible. b) Examine the possibility of using other NIH sponsored repositories, (i.e. similar to BioLINCC) as data sources. c) Examine the possibility of using designated protocol documents, as published in Trials. d) Examine possible contributions of one or two institutional repositories. |
5 | Establishing algorithms for identifying the links between new data and already extracted studies and data objects | Completed: a) Mechanisms introduced into the system based around md5 hashes, to identify data objects and studies without PID. b) A revised procedure for identifying links between studies now in place, based upon 'other study identifiers' listed in registries. c) Management of cross-source study-study one-to-many relationships now added to the system. |
To do: a) Study linkage based on title should be explored. b) The possible use of text mining and ML techniques for establishing links and duplications needs to be explored. |
6 | Modification of data extraction to better handle periodic interrogation of the same source (i.e. only handle new or revised data). | Almost completed: Data extraction and processing mechanisms now brought within a generic framework and scheduled operation introduced. Scheduling now being checked, logging of 'unsupervised' operations being improved. |
Recent changes: a) Download and processing mechanisms brought within a generic framework, for better control and monitoring. b) Local data stores established for all sources. c) Logging and tracking mechanisms introduced to identify correct candidate studies / objects for each process d) Introduction of data download, processing and aggregation tasks as scheduled tasks (weekly at present). |
7 | Modularising data extraction architecture where possible, with a view to providing interchangeable components for uptake by other RIs. | Completed: a) Different stages of extraction process now separated into modules in order to better support modularisation and independent functioning. Documentation of the systems brought up to date in MDR wiki. |
To do: Not possible to assess possible usefulness to other RIs until the full range of systems developed, including APIs. . |
8 | Developing ways of rationalising topics / keywords against a common UMLS based schema, to reduce duplication and enhance searchability. | Ongoing: MESH codes selected as the best interim method of rationalising topic terms, and applied to the system. (Much source data is already MESH coded) |
To do: a)MESH coding, where possible, of uncoded terms against their MESH equivalents b)Further exploration of UMLS systems and related services. Need to find as comprehensive a solution as possible. |
9 | Developing ways of processing names (of research, organisations, people) to better support matching and searching. | Ongoing: Algorithms introduced for applying standardised versions of names during the import process, but not 100%. |
To do: a) Explore the text indexing capabilities of Postgres. b) Explore how developments in PID management (e.g. from PID forum) can be applied to entities in the MDR in particular try to integrate with ROR organisation PIDs |
10 | Maintaining comprehensive documentation of all aspects of the system, including each extraction routine, within a project Wiki. | Ongoing: a) Wiki re-organised and new material introduced for metadata and data extraction sections |
a) To do: a) Portal documentation needs bringing up to date b) Shared 'to do' and issue tracking system needs to be introduced. By its nature this task always 'ongoing' |
11 | Maintaining all extraction and data processing code in GitHub. | Ongoing: a) Source code made more uniform and Github repository tidied up b) Revised Readme files created for all 4 main data collection / extraction systems |
Comment: By its nature this task always 'ongoing' |
12 | Development of tests (including test data) for regular testing of extraction accuracy. | Ongoing: A strategy now outlined. Different types of tests required for different parts of the system. |
To do: a) Initial selection of relevant test material (e.g. sample studies) for each source (or source type). b) Automated systems for comparing actual versus expected values required. |
13 | Preparation and testing of a Restful (or possibly GraphQL) API for supporting data access. | Not yet begun: Other tasks have had to take priority up to now. |
Comment: Characterisation of data demands from the portal interface need to be clarified. b) To explore usefulness of GraphQL instead of or in addition to a RESTful API. |
14 | Developing a web-based support tool to help data generators more easily apply the metadata at source. | Begun: Initial design work being carried out (to support metadata capture in EOSC Life WP14) |
Comment: Version 1 expected Spring 2021 |
15 | Integration with AAI, developed and provided by EOSC-Life. Although the data itself will be public, access to development and data management systems will need to be controlled. | Not yet begun Not needed at the moment |
Comment: Needs further details on how the portal will be integrated within EOSC hub and how development / production versions will be managed. |
16 | Contributing to an overall strategy around discoverability of data sources, within EOSC as a whole and within life science RIs in particular. | Not yet begun | Comment: Not clear how this can be progressed at present. |
17 | Exploration of how data sharing can be improved by the MDR (i.e. demonstrations of usefulness). | Not yet begun | Comment: The system needs to reach a certain degree of maturity before it can be properly evaluated. Once that is done a dialogue can be begun with users, both in general and with a designated test group. |
18 | Publication of journal papers around the MDR. | Begun: Outline of paper circulated and agreed |
To do: Text to be written in near future |