The ECRIN Metadata Schemas
A metadata schema for clinical research data objects was first developed by ECRIN in 2016 , as a mechanism for supporting increased discovery of the wide range of data objects, scattered across many different repositories, that are generated by clinical research activity, and in particular to support the development of a proposed metadata repository, or MDR, for clinical study data objects.
There are in fact 2 schemas, one for studies and one for data objects.
The study schema is based on the main data points used by ClinicalTrials.gov, by far the largest trial registry in the world, with about 380,000 existing study entries. Those data points are themselves based around the core dataset required by the WHO and so – in broad terms – are also supported by the other 18 globally recognised trial registries. Trial registry data, and that from Clinicaltrials.gov in particular, represents the de facto standard data model for describing clinical research studies.
The data object schema is based on the DataCite standard (version 3.1), extended to cover the needs of clinical researchers, specifically to provide additional data points covering:
- Location, ownership and access arrangements for data objects, many of which would not be immediately or publicly available, and instead require an application process, usually to the study investigator or sponsor, for access to be granted.
- Links to the generating studies. Apart from journal articles most of the data objects generated by clinical research are closely linked to the study or studies that generated them, and are usually discovered using the study’s name or identifiers.
Note that the relationship between studies and data objects is many-to-many rather than one-to-one. Any system dealing with this information needs to take this into account by maintaining the data for studies and data objects separately, linking them as appropriate. Each element has to have a reference to the other element type – a study record has one or more references to linked data object records, whilst a data object includes one or more references to ‘parent’ studies.
Discoverability – Access – Provenance
Taken together the study and data object schemas provide a Discoverability – Access – Provenance (DAP) metadata schema. The ECRIN scheme does not attempt to cover descriptive metadata, e.g. the detailed data dictionaries describing the structure of a dataset. (Such descriptive metadata files are of course themselves data objects, and ECRIN schema data could and should be used to structure their DAP metadata as well as the data to which they refer).
The schemas have 42 main data points (though many of these are composite), split into six sections, A – F. Section A has 15 data points relating to study objects, while sections B - F have 27 data points relating to the data objects themselves.
A summary of these data points, arranged in these sections, and categorised as mandatory, recommended or optional (following DataCite) is available at Summary Tables, while a more detailed description of each of the data points is at Schema Description.
In April 2018, the metadata schema was updated to version 2, and a further version followed in February 2019 (version 2.2). Version 3.0 was developed in November 2019, after extensive work with different data sources had revealed some deficiencies with the original schema, and slightly revised versions 4.0 and 5.0 were created in September and October 2020 respectively. The current version (6.0) is from August 2021.
A full history of the various versions of the schema (as JSON definition files) and summaries of the changes between each version can be found at Metadata Change History.