Background and History

From ECRIN-MDR Wiki
Jump to navigation Jump to search

Initial Planning, 2016-2017

Beginning in 2016, ECRIN's work within the H2020 CORBEL project, in particular the leadership of a group looking at 'data sharing' issues within clinical research, highlighted the need to improve the FAIRness of clinical research data. It became clear that if researchers made more and more data objects available to others, as they were being encouraged to do, those objects would often be in a wide variety of places and available under a wide range of conditions. Even discovering where the various data objects associated with a study were located might become difficult and time-consuming, and therefore costly, and once found there would be the additional problem of understanding how to access them - because many such objects would only be available under controlled access. The concept of a 'metadata repository', that could bring all this discoverability, access and provenance (DAP) metadata together, evolved out of these concerns.
The initial task was seen as the creation of a metadata schema that focused on the required discoverability, access and provenance data points. The first version of such a schema [1] was published in late 2016. In fact that metadata schema (now at version 5) has evolved into a combination of two schemas, one for studies and the other for the associated data objects. The first is based on a subset of the data points within the ClinicalTrials.gov trial registry (by far the largest trial registry in the world) and the second is based on DataCite. Two separate schemas are necessary because the relationship between studies and data objects in clinical research is many-to-many. It is therefore necessary to store study details and data object details separately, with a separate 'link' table indicating which data objects are associated with which study.

The XDC project and the pilot MDR, 2017-2020

The opportunity to actually build a demonstrator MDR came in 2017, when the H2020 project Extreme Data Cloud (XDC) was developed, with the MDR as one of the proposed use cases. This project focused on developing services for very large or very heterogeneous data sets. Clinical research data is not large in volume (not compared to the huge volumes generated by, for example, high energy or particle physics research) but considered as a whole it is extremely heterogeneous in nature, with many hundreds of thousands of small files, in different formats, located in many different places. ECRIN therefore set about specifying the MDR portal, as well as developing systems to collect and extract data from different sources.
The system was to be developed with two partners in XDC: ONEDATA , based in Poland and INFN (Istituto Nazionale di Fisica Nucleare) at Bologna. INFN would provide the IT infrastructure and carry out indexing of the collected data using Elastic Search, where as OneData would provide the file storage system and also the web portal (to ECRIN's specification).
Unfortunately ECRIN suffered from severe staff shortages in 2018 and early 2019, and progress was not as fast as we would have liked. Nevertheless by the end of 2019 a functional MDR demonstrator had been built, which was further enhanced in early 2020. This was able to take a full global set of study data and augment that with data object data, both from study sources like trial registries and from a few data object repositories, and make that metadata searchable using the web portal.
The demonstrator was successful and well received but there were some issues with the XDC infrastructure. In particular, while the focus and expectation of most XDC systems was on large-scale file management, with the MDR we had no files at all - just a large collection of metadata. This sometimes made data ingestion into the OneData infrastructure an awkward and slow process. Although the indexing support provided by INFN and the portal developed by OneData were both of very high quality, it seemed that further development of the MDR would be better served by using a more conventional, direct approach to data access, and the development of specialist APIs around the data.

The European Open Science Cloud and the MDR, 2020 onwards

The development of the European Open Science Cloud's H2020 project EOSC-Life, provided the opportunity for ECRIN to take most of the development 'in-house', though working in collaboration with other EOSC-Life participants. To support this development ECRIN has purchased a cluster of servers housed on the French cloud supplier OVH.
There are now two parallel tracks of development. The expansion of data sources, the automation of the system, the further refinement of the schema, and the development of tools to aid metadata production is funded from EOSC-Life, while the development of an ECRIN designed web portal, and the associated indexing, has been funded internally by ECRIN itself. API development will be funded by both. It is the intention to integrate the completed portal with other services in the European Open Science Cloud's EOSC-hub.
The current progress of the project within EOSC life, against objectives, is tabulated on the Progress (EOSC Life) web page.

Notes

  1. Canham, S., Ohmann, C. A metadata schema for data objects in clinical research. Trials 17, 557 (2016). https://doi.org/10.1186/s13063-016-1686-5