Difference between revisions of "The ECRIN Metadata Schemas"

From ECRIN-MDR Wiki
Jump to navigation Jump to search
(5 intermediate revisions by the same user not shown)
Line 8: Line 8:
  
 
* Links to the generating studies. Apart from journal articles most of the data objects generated by clinical research are closely linked to the study or studies that generated them, and are usually discovered using the study’s name or identifiers.
 
* Links to the generating studies. Apart from journal articles most of the data objects generated by clinical research are closely linked to the study or studies that generated them, and are usually discovered using the study’s name or identifiers.
 +
Note that the relationship between studies and data objects is many-to-many rather than one-to-one. Any system dealing with this information needs to take this into account by maintaining the data for studies and data objects separately, linking them as appropriate. Each element has to have a reference to the other element type – a study record has one or more references to linked data object records, whilst a data object includes one or more references to ‘parent’ studies.
  
 
===Discoverability – Access – Provenance===
 
===Discoverability – Access – Provenance===
 
Taken together the study and data object schemas provide a Discoverability – Access – Provenance (DAP) metadata schema. The ECRIN scheme does not attempt to cover descriptive metadata, e.g. the detailed data dictionaries describing the structure of a dataset. (Such descriptive metadata files are  of course themselves data objects, and ECRIN schema data could and should be used to structure their DAP metadata as well as the data to which they refer).
 
Taken together the study and data object schemas provide a Discoverability – Access – Provenance (DAP) metadata schema. The ECRIN scheme does not attempt to cover descriptive metadata, e.g. the detailed data dictionaries describing the structure of a dataset. (Such descriptive metadata files are  of course themselves data objects, and ECRIN schema data could and should be used to structure their DAP metadata as well as the data to which they refer).
  
Note that the relationship between studies and data objects is many-to-many rather than one-to-one. Any system dealing with this information needs to take this into account by maintaining the data for studies and data objects separately, linking them as appropriate. Each element has to have a reference to the other element type – a study record has one or more references to linked data object records, whilst a data object includes one or more references to ‘parent’ studies.  
+
The schemas have 42 main data points (though many of these are composite), split into six sections, A – F. Section A has 15 data points relating to study objects,  while sections B - F have 27 data points relating to the data objects themselves.  
  
The schemas have 42 main data points (though many of these are composite), split into six sections, A – F. Section A has 15 data points relating to study objects, while sections B - F have 27 data points relating to the data objects themselves.  
+
A summary of these data points, arranged in these sections, and categorised as mandatory, recommended or optional (following DataCite) is available at '''[[Summary Tables]]''', while a more detailed description of each of the data points is at '''[[Schema Description]]'''.
  
 
===History===
 
===History===
 
In April 2018, the metadata schema was updated to version 2, and a further version followed in February 2019 (version 2.2). Version 3.0 was developed in November 2019, after extensive work with different data sources had revealed some deficiencies with the original schema, and a slightly revised version 4.0 was created in September 2020. The current version (5.0) is from October 2020.<br/>
 
In April 2018, the metadata schema was updated to version 2, and a further version followed in February 2019 (version 2.2). Version 3.0 was developed in November 2019, after extensive work with different data sources had revealed some deficiencies with the original schema, and a slightly revised version 4.0 was created in September 2020. The current version (5.0) is from October 2020.<br/>
A full history of the various versions of the schema (as JSON definition files) and summaries of the changes between each version can be found at [[Metadata Change History]].
+
A full history of the various versions of the schema (as JSON definition files) and summaries of the changes between each version can be found at '''[[Metadata Change History]]'''.

Revision as of 22:49, 27 October 2020

Introduction

A metadata schema for clinical research data objects was first developed by ECRIN in 2016 [1], as a mechanism for supporting increased discovery of the wide range of data objects, scattered across many different repositories, that are generated by clinical research activity, and in particular to support the development of a proposed metadata repository, or MDR, for clinical study data objects.

There are in fact 2 schemas, one for studies and one for data objects.
The study schema is based on the main data points used by ClinicalTrials.gov, by far the largest trial registry in the world, with about 350,000 existing study entries. Those data points are themselves based around the core dataset required by the WHO and so – in broad terms – are also supported by the other 18 globally recognised trial registries. Trial registry data, and that from Clinicaltrials.gov in particular, represents the de facto standard data model for describing clinical research studies.
The data object schema is based on the DataCite standard (version 3.1), extended to cover the needs of clinical researchers, specifically to provide additional data points covering:

  • Location, ownership and access arrangements for data objects, many of which would not be immediately or publicly available, and instead require an application process, usually to the study investigator or sponsor, for access to be granted.
  • Links to the generating studies. Apart from journal articles most of the data objects generated by clinical research are closely linked to the study or studies that generated them, and are usually discovered using the study’s name or identifiers.

Note that the relationship between studies and data objects is many-to-many rather than one-to-one. Any system dealing with this information needs to take this into account by maintaining the data for studies and data objects separately, linking them as appropriate. Each element has to have a reference to the other element type – a study record has one or more references to linked data object records, whilst a data object includes one or more references to ‘parent’ studies.

Discoverability – Access – Provenance

Taken together the study and data object schemas provide a Discoverability – Access – Provenance (DAP) metadata schema. The ECRIN scheme does not attempt to cover descriptive metadata, e.g. the detailed data dictionaries describing the structure of a dataset. (Such descriptive metadata files are of course themselves data objects, and ECRIN schema data could and should be used to structure their DAP metadata as well as the data to which they refer).

The schemas have 42 main data points (though many of these are composite), split into six sections, A – F. Section A has 15 data points relating to study objects, while sections B - F have 27 data points relating to the data objects themselves.

A summary of these data points, arranged in these sections, and categorised as mandatory, recommended or optional (following DataCite) is available at Summary Tables, while a more detailed description of each of the data points is at Schema Description.

History

In April 2018, the metadata schema was updated to version 2, and a further version followed in February 2019 (version 2.2). Version 3.0 was developed in November 2019, after extensive work with different data sources had revealed some deficiencies with the original schema, and a slightly revised version 4.0 was created in September 2020. The current version (5.0) is from October 2020.
A full history of the various versions of the schema (as JSON definition files) and summaries of the changes between each version can be found at Metadata Change History.