Difference between revisions of "The ECRIN Metadata Schemas"
(→History) |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | <p style="color:blue; text-align:right"><small>'''''Last updated: 20/09/2022'''''</small></p> | ||
===Introduction=== | ===Introduction=== | ||
A metadata schema for clinical research data objects was first developed by ECRIN in 2016 <ref>Canham S & Ohmann C. A metadata schema for data objects in clinical research. Trials, volume 17, Article number: 557 (2016)</ref>, as a mechanism for supporting increased discovery of the wide range of data objects, scattered across many different repositories, that are generated by clinical research activity, and in particular to support the development of a proposed metadata repository, or MDR, for clinical study data objects. | A metadata schema for clinical research data objects was first developed by ECRIN in 2016 <ref>Canham S & Ohmann C. A metadata schema for data objects in clinical research. Trials, volume 17, Article number: 557 (2016)</ref>, as a mechanism for supporting increased discovery of the wide range of data objects, scattered across many different repositories, that are generated by clinical research activity, and in particular to support the development of a proposed metadata repository, or MDR, for clinical study data objects. | ||
There are in fact 2 schemas, one for studies and one for data objects.<br/> | There are in fact 2 schemas, one for studies and one for data objects.<br/> | ||
− | The '''study schema''' is based on the main data points used by ClinicalTrials.gov, by far the largest trial registry in the world, with about | + | The '''study schema''' is based on the main data points used by ClinicalTrials.gov, by far the largest trial registry in the world, with about 440,000 existing study entries. Those data points are themselves based around the core dataset required by the WHO and so – in broad terms – are also supported by the other 18 globally recognised trial registries. Trial registry data, and that from Clinicaltrials.gov in particular, represents the de facto standard data model for describing clinical research studies.<br/> |
− | The '''data object schema''' is based on the DataCite standard (version 3.1), extended to cover the needs of clinical researchers, specifically to provide additional data points covering: | + | The '''data object schema''' is based on the DataCite standard ([https://schema.datacite.org/meta/kernel-3.1/ version 3.1]), extended to cover the needs of clinical researchers, specifically to provide additional data points covering: |
* Location, ownership and access arrangements for data objects, many of which would not be immediately or publicly available, and instead require an application process, usually to the study investigator or sponsor, for access to be granted. | * Location, ownership and access arrangements for data objects, many of which would not be immediately or publicly available, and instead require an application process, usually to the study investigator or sponsor, for access to be granted. | ||
* Links to the generating studies. Apart from journal articles most of the data objects generated by clinical research are closely linked to the study or studies that generated them, and are usually discovered using the study’s name or identifiers. | * Links to the generating studies. Apart from journal articles most of the data objects generated by clinical research are closely linked to the study or studies that generated them, and are usually discovered using the study’s name or identifiers. | ||
− | Note that the relationship between studies and data objects is many-to-many rather than one-to- | + | Note that the relationship between studies and data objects is many-to-many rather than one-to-many. Any system managing this information needs to take this into account by maintaining the data for studies and data objects separately, linking them as appropriate. Each element has to have a reference to the other element type – a study record has one or more references to linked data object records, whilst a data object includes one or more references to ‘parent’ studies. |
===Discoverability – Access – Provenance=== | ===Discoverability – Access – Provenance=== | ||
− | Taken together the study and data object schemas provide a Discoverability – Access – Provenance (DAP) metadata schema. The ECRIN | + | Taken together the study and data object schemas provide a Discoverability – Access – Provenance (DAP) metadata schema. The ECRIN schema does not attempt to cover ''descriptive'' metadata, e.g. the detailed data dictionaries describing the structure of a dataset. Such descriptive metadata files are of course themselves data objects, and ECRIN schema data could and should be used to structure their DAP metadata as well as the data to which they refer. |
− | The schemas have | + | The schemas have 46 main data points (though many of these are composite), split into six sections, A – F. Section A has 19 data points relating to study objects, while sections B - F have 27 data points relating to the data objects themselves. |
A summary of these data points, arranged in these sections, and categorised as mandatory, recommended or optional (following DataCite) is available at '''[[Summary Tables]]''', while a more detailed description of each of the data points is at '''[[Schema Description]]'''. | A summary of these data points, arranged in these sections, and categorised as mandatory, recommended or optional (following DataCite) is available at '''[[Summary Tables]]''', while a more detailed description of each of the data points is at '''[[Schema Description]]'''. | ||
===History=== | ===History=== | ||
− | In April 2018, the metadata schema was updated to version 2, and a further version followed in February 2019 (version 2.2). Version 3.0 was developed in November 2019, after extensive work with different data sources had revealed some deficiencies with the original schema, and slightly revised versions 4.0 and 5.0 were created in September and October 2020 respectively. | + | In April 2018, the metadata schema was updated to version 2, and a further version followed in February 2019 (version 2.2). Version 3.0 was developed in November 2019, after extensive work with different data sources had revealed some deficiencies with the original schema, and slightly revised versions 4.0 and 5.0 were created in September and October 2020 respectively. Version 6.0 appeared in August 2021 after some minor simplifications of the schema and integration with ROR organisation data, while the current version (7.1), including geographical data on study location for the first time, and clarifying the use of study and object contributor data, dates from November 2022. <br/> |
A full history of the various versions of the schema (as JSON definition files) and summaries of the changes between each version can be found at '''[[Metadata Change History]]'''. | A full history of the various versions of the schema (as JSON definition files) and summaries of the changes between each version can be found at '''[[Metadata Change History]]'''. | ||
<br/> | <br/> |
Latest revision as of 10:50, 11 November 2022
Last updated: 20/09/2022
Introduction
A metadata schema for clinical research data objects was first developed by ECRIN in 2016 [1], as a mechanism for supporting increased discovery of the wide range of data objects, scattered across many different repositories, that are generated by clinical research activity, and in particular to support the development of a proposed metadata repository, or MDR, for clinical study data objects.
There are in fact 2 schemas, one for studies and one for data objects.
The study schema is based on the main data points used by ClinicalTrials.gov, by far the largest trial registry in the world, with about 440,000 existing study entries. Those data points are themselves based around the core dataset required by the WHO and so – in broad terms – are also supported by the other 18 globally recognised trial registries. Trial registry data, and that from Clinicaltrials.gov in particular, represents the de facto standard data model for describing clinical research studies.
The data object schema is based on the DataCite standard (version 3.1), extended to cover the needs of clinical researchers, specifically to provide additional data points covering:
- Location, ownership and access arrangements for data objects, many of which would not be immediately or publicly available, and instead require an application process, usually to the study investigator or sponsor, for access to be granted.
- Links to the generating studies. Apart from journal articles most of the data objects generated by clinical research are closely linked to the study or studies that generated them, and are usually discovered using the study’s name or identifiers.
Note that the relationship between studies and data objects is many-to-many rather than one-to-many. Any system managing this information needs to take this into account by maintaining the data for studies and data objects separately, linking them as appropriate. Each element has to have a reference to the other element type – a study record has one or more references to linked data object records, whilst a data object includes one or more references to ‘parent’ studies.
Discoverability – Access – Provenance
Taken together the study and data object schemas provide a Discoverability – Access – Provenance (DAP) metadata schema. The ECRIN schema does not attempt to cover descriptive metadata, e.g. the detailed data dictionaries describing the structure of a dataset. Such descriptive metadata files are of course themselves data objects, and ECRIN schema data could and should be used to structure their DAP metadata as well as the data to which they refer.
The schemas have 46 main data points (though many of these are composite), split into six sections, A – F. Section A has 19 data points relating to study objects, while sections B - F have 27 data points relating to the data objects themselves.
A summary of these data points, arranged in these sections, and categorised as mandatory, recommended or optional (following DataCite) is available at Summary Tables, while a more detailed description of each of the data points is at Schema Description.
History
In April 2018, the metadata schema was updated to version 2, and a further version followed in February 2019 (version 2.2). Version 3.0 was developed in November 2019, after extensive work with different data sources had revealed some deficiencies with the original schema, and slightly revised versions 4.0 and 5.0 were created in September and October 2020 respectively. Version 6.0 appeared in August 2021 after some minor simplifications of the schema and integration with ROR organisation data, while the current version (7.1), including geographical data on study location for the first time, and clarifying the use of study and object contributor data, dates from November 2022.
A full history of the various versions of the schema (as JSON definition files) and summaries of the changes between each version can be found at Metadata Change History.
- ↑ Canham S & Ohmann C. A metadata schema for data objects in clinical research. Trials, volume 17, Article number: 557 (2016)