The ECRIN Metadata Schemas

From ECRIN-MDR Wiki
Revision as of 21:00, 27 October 2020 by Admin (talk | contribs)
Jump to navigation Jump to search

Introduction

A metadata schema for clinical research data objects was first developed by ECRIN in 2016 [1], as a mechanism for supporting increased discovery of the wide range of data objects, scattered across many different repositories, that are generated by clinical research activity, and in particular to support the development of a proposed metadata repository, or MDR, for clinical study data objects.

There are in fact 2 schemas, one for studies and one for data objects.
The study schema is based on the main data points used by ClinicalTrials.gov, by far the largest trial registry in the world, with about 350,000 existing study entries. Those data points are themselves based around the core dataset required by the WHO and so – in broad terms – are also supported by the other 18 globally recognised trial registries. Trial registry data, and that from Clinicaltrials.gov in particular, represents the de facto standard data model for describing clinical research studies.
The data object schema is based on the DataCite standard (version 3.1), extended to cover the needs of clinical researchers, specifically to provide additional data points covering:

  • Location, ownership and access arrangements for data objects, many of which would not be immediately or publicly available, and instead require an application process, usually to the study investigator or sponsor, for access to be granted.
  • Links to the generating studies. Apart from journal articles most of the data objects generated by clinical research are closely linked to the study or studies that generated them, and are usually discovered using the study’s name or identifiers.

Discoverability – Access – Provenance

Taken together the study and data object schemas provide a Discoverability – Access – Provenance (DAP) metadata schema. The ECRIN scheme does not attempt to cover descriptive metadata, e.g. the detailed data dictionaries describing the structure of a dataset. (Such descriptive metadata files are of course themselves data objects, and ECRIN schema data could and should be used to structure their DAP metadata as well as the data to which they refer).

Note that the relationship between studies and data objects is many-to-many rather than one-to-one. Any system dealing with this information needs to take this into account by maintaining the data for studies and data objects separately, linking them as appropriate. Each element has to have a reference to the other element type – a study record has one or more references to linked data object records, whilst a data object includes one or more references to ‘parent’ studies.

The schemas have 42 main data points (though many of these are composite), split into six sections, A – F. Section A has 15 data points relating to study objects, while sections B - F have 27 data points relating to the data objects themselves.

Please note that this paper presents summaries of the metadata schemas and does not fully describe how the data would be stored, e.g. within databases or json files. In those contexts additional identifiers would be used to provide record keys and to link the data points. For example, in a database some form of join table would be used to link study and data object records, rather than the reference lists used in the schema. Appendices 2 and 3 provide more details on the data points required in any practical implementation of the schema, as both a data dictionary and as JSON definition files.

History

In April 2018, this metadata schema was updated as version 2 [2], and a further version followed in February 2019 (version 2.2)[3]. Version 3.0 was developed in November 2019, [4] after extensive work with different data sources had revealed some deficiencies with the original schema, and a slightly revised version 4.0 was created in September 2020. This current version (5.0) is from October 2020.