Difference between revisions of "Metadata standards"

From ECRIN-MDR Wiki
Jump to navigation Jump to search
(Summary tables)
 
(42 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Introduction ==
 
A metadata schema for clinical research data objects was first developed by ECRIN in 2016 <ref>Canham, S., Ohmann, C. A metadata schema for data objects in clinical research. Trials 17, 557 (2016). https://doi.org/10.1186/s13063-016-1686-5</ref>, as a mechanism for supporting increased discovery of the wide range of data objects, scattered across many different repositories, that are generated by clinical research activity, and in particular to support the development of a proposed metadata repository, or MDR, for clinical study data objects.
 
<br/><br/>
 
It was based on the DataCite standard (version 3.1)<ref>https://schema.datacite.org/meta/kernel-3.1/</ref>, extended to cover the needs of clinical researchers, specifically to provide additional data covering:
 
* Research study identifiers and characteristics, including links to clinical trial registries. These  were added because – apart from journal articles – most of the data objects in clinical research are closely linked to the study that generated them, and are usually discovered using the study’s name or identifiers.
 
* Location, ownership and access arrangements for data objects, many of which would not be immediately or publicly available, and instead require an application process, usually to the study investigator or sponsor, for access to be granted.
 
<br/>
 
In April 2018, this metadata schema was updated as version 2, and a further version followed in February 2019 (version 2.2)<ref>https://zenodo.org/record/3534313</ref>. Version 3.0<ref>https://zenodo.org/record/3562911</ref> was developed in November 2019, after extensive work with different data sources had revealed some deficiencies with the original schema. Version 4 was created in September 2020 and brought a major revision to the dataset specific properties as well as minor changes elsewhere, including the introduction of provenance strings for both data and data objects. Version 5 followed in October 2020, bringing changes to the topic related data and simplifying some aspects of the schema. (Links to details of changes between versions can be found on the JSON schema pages).
 
<br/><br/>
 
There are in fact 2 related schemas, one for studies and one for data objects. This is because the relationship between studies and data objects is many-to-many rather than one-to-one, and any system needs to take this into account by maintaining the data for studies and data objects separately, linking them as appropriate. It is therefore more accurate to describe two related schemas.
 
<br/>
 
Each element has to have a reference to the other element type – a study record has one or more references to linked data object records, whilst a data object includes one or more references to ‘parent’ studies.
 
<br/><br/>
 
The proposed schemas have 42 main data points (though some of these are composite), split into six sections, A – F. Section A has 15 data points relating to study objects, while sections B - F have 27 data points relating to the data objects themselves.
 
<br/><br/>
 
Please note that this page presents summaries of the metadata schemas and does not fully describe how the data would be stored, e.g. within databases or json files. In those contexts additional identifiers would be used to provide record keys and to link the data points. For example, in a database some form of join table would be used to link study and data object records, rather than the reference lists used in the schema.
 
 
 
== Summary tables ==
 
== Summary tables ==
 
<br/>
 
<br/>
Line 27: Line 10:
 
|- style="vertical-align:top; background-color:white"
 
|- style="vertical-align:top; background-color:white"
  
| '''A.1 Display Title'''<span style="color:red;font-size:150%;font-weight:bold"> </span><br/>''{display title, language code<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> }''<br/><br/><br/><br/><br/><br/>'''A.8 Study Type'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>'''A.9 Study Status'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span> '''<br/><br/><br/><br/><br/>A.14 Linked Data Objects'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/> ''{object identifiers}''<br/> '''<br/>A.15 Provenance data'''<br/>   
+
| '''A.1 Display Title'''<span style="color:red;font-size:150%;font-weight:bold"> </span><br/>''{display title, language code<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> }''<br/><br/><br/><br/><br/><br/>'''A.8 Study Type'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>'''A.9 Study Status'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span> '''<br/><br/><br/><br/><br/>A.14 Linked Data Objects'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/> ''{object identifiers}''<br/> '''<br/>A.15 Provenance String'''<br/>   
  
 
|| '''A.2 Study Identifiers''' <span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{identifier type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , Identifier value, source organisation, date, url link}''<br/><br/>  
 
|| '''A.2 Study Identifiers''' <span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{identifier type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , Identifier value, source organisation, date, url link}''<br/><br/>  
Line 61: Line 44:
 
| style="color:darkblue;font-weight:bold" colspan="3"| B. Data Object Identifiers   
 
| style="color:darkblue;font-weight:bold" colspan="3"| B. Data Object Identifiers   
 
|- style="vertical-align:top; background-color:white"
 
|- style="vertical-align:top; background-color:white"
| '''B.1 DOI'''<span style="font-size:150%"> </span><br/><br/><br/>'''B.2 Display Title'''<br/><br/>'''B.6 Linked Studies'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{study identifiers}''<br/><br/>'''B.7 Provenance Data'''<br/><br/>
+
| '''B.1 DOI'''<span style="font-size:150%"> </span><br/><br/><br/>'''B.2 Display Title'''<br/><br/>'''B.6 Linked Studies'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{study identifiers}''<br/><br/>'''B.7 Provenance String'''<br/><br/>
 
|| '''B.3 Version'''<span style="font-size:150%"> </span>  
 
|| '''B.3 Version'''<span style="font-size:150%"> </span>  
 
|| '''B.4 Object Identifiers'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{Identifier type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , Identifier value, source organisation, application date}''<br/> <br/>  
 
|| '''B.4 Object Identifiers'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{Identifier type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , Identifier value, source organisation, application date}''<br/> <br/>  
Line 70: Line 53:
 
| style="color:darkblue;font-weight:bold" colspan="3"| C. Creators and Contributors   
 
| style="color:darkblue;font-weight:bold" colspan="3"| C. Creators and Contributors   
 
|- style="vertical-align:top; background-color:white"
 
|- style="vertical-align:top; background-color:white"
| '''C.1 Creators'''<br/>  
+
| '''C.1 Creators'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{name type, person details OR organisation }''<br/> <br/>
''{name type, person details OR organisation }''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/> <br/>  
+
|| <span style="font-size:150%"> </span> <br/><br/><br/>''person details = given name, family name, full name, ORCID identifier, affiliation <br/>organisation = organisation default name and, if the organisation exists in the context database, the associated  integer id''
person details = given name, family name, full name, identifier, identifier scheme, affiliation, affiliation identifier, affiliation identifier scheme
+
|| '''C.2 Contributors'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{contribution type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , name type, person details OR organisation }''<br/><br/>For most data objects contributors will be the contributors to the associated study or studies.<br/>
|| For most data objects contributors should be the study contributors.<br/> <br/>  
 
 
For journal articles contributors will be authors, plus organisational study contributors  
 
For journal articles contributors will be authors, plus organisational study contributors  
||  '''C.2 Contributors'''<br/>
 
''{contribution type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , name type, person details OR organisation }''<span style="color:red;font-size:150%;font-weight:bold"> *</span>
 
  
 
|- style="vertical-align:top; background-color:lightblue"
 
|- style="vertical-align:top; background-color:lightblue"
 
| style="color:darkblue;font-weight:bold" colspan="3"| D. Object Dates   
 
| style="color:darkblue;font-weight:bold" colspan="3"| D. Object Dates   
 
|- style="vertical-align:top; background-color:white"
 
|- style="vertical-align:top; background-color:white"
| '''D.1 Publication Year''' ||  ||  '''D.2 Dates'''
+
| '''D.1 Publication Year'''<span style="font-size:150%"> </span>  ||  ||  '''D.2 Dates'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{date type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , Is range, date as string, start year, start month, start day, end year, end month, end day, comments}''
''{date type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , Is range, date as string, start year, start month, start day, end year, end month, end day, comments}''<span style="color:red;font-size:150%;font-weight:bold"> *</span>
 
  
 
|- style="vertical-align:top; background-color:lightblue"
 
|- style="vertical-align:top; background-color:lightblue"
Line 88: Line 67:
 
|- style="vertical-align:top; background-color:white"
 
|- style="vertical-align:top; background-color:white"
 
|'''E.1 Class'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>'''E.2 Type'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/>  
 
|'''E.1 Class'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>'''E.2 Type'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/>  
|| '''E.3 Record key type''' (datasets only)<br/>''{type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , text description}''<br/><br/>'''E.4 De-identification level''' (datasets only)<br/>''{type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , specific actions, text description}''<br/><br/>
+
|| '''E.3 Record key type''' (datasets only)<span style="font-size:150%"> </span> <br/>''{type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , text description}''<br/><br/>'''E.4 De-identification level''' (datasets only)<br/>''{type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , specific actions, text description}''<br/><br/>
'''E.5 Associated consent''' (datasets only)<br/>''{type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , specific restrictions, text description}''<br/><br/>'''E.6 Description'''<br/>''{description type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , label, description text, language code<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , contains html?}''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/><br/>  '''E.7 EOSC Category'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>
+
'''E.5 Associated consent''' (datasets only)<br/>''{type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , specific restrictions, text description}''<br/><br/>'''E.6 Description'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{description type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , label, description text, language code<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , contains html?}''<br/><br/>  '''E.7 EOSC Category'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>
'''E.8 Language'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><span style="color:red;font-size:150%;font-weight:bold"> *</span><br/><br/>'''E.9 Related Resources'''<br/>''{ relationship type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , target object}''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/><br/>  
+
'''E.8 Language'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><span style="color:red;font-size:150%;font-weight:bold"> *</span><br/><br/>'''E.9 Inter-object relationships'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{ relationship type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , target object}''<br/><br/>  
|| '''E.10 Topics''' (of data object)<br/>''{topic value, topic type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , topic vocabulary<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , topic code}''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/><br/>
+
|| '''E.10 Topics''' (of data object)<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{topic type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , mesh coded?, topic code, topic value, topic qualcode, topic qualvalue, original value}''<br/><br/>
 
For most data objects topics should be the study topics.<br/>Journal articles will normally have their own listed topics
 
For most data objects topics should be the study topics.<br/>Journal articles will normally have their own listed topics
  
Line 97: Line 76:
 
| style="color:darkblue;font-weight:bold" colspan="3"| F. Object Location and Access Details   
 
| style="color:darkblue;font-weight:bold" colspan="3"| F. Object Location and Access Details   
 
|- style="vertical-align:top; background-color:white"
 
|- style="vertical-align:top; background-color:white"
| '''F.1 Managing Organisation'''<br/><br/>
+
| '''F.1 Managing Organisation'''<span style="font-size:150%"> </span> <br/><br/>
 
'''F.2 Access Type'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>  
 
'''F.2 Access Type'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>  
'''F.3 Access Details'''<br/><br/>
+
'''F.3 Access Details'''<br/>''{description, url of details, date url last checked}''<br/><br/>
'''F.4 Access Details URL''' <br/>
+
'''F.4 Physical Resources'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{repository organisation, resource url , url accessible?, date url last checked, resource type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , resource size, size units<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span>, comments }''<br/>
''{URL, Date last checked}''<br/><br/>
+
||<br/><br/><br/><br/><br/>(F3 is mandatory if access is non-public)<br/><br/>  ||  '''F.5 Rights'''<br/>''{name, rights uri, comments}''
(F3 and F4 are mandatory if access is non-public)<br/><br/>
 
'''F.5 Resources'''<br/>
 
''{repository organisation, URL, URL accessible, date URL last checked, resource type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , resource size, size units<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span>, resource comments }''<span style="color:red;font-size:150%;font-weight:bold"> *</span> <br/><br/>'''F.7 Provenance Data'''<br/>
 
|| ||  '''F.6 Rights'''
 
''{details, rights URI}''<span style="color:red;font-size:150%;font-weight:bold"> *</span>
 
 
|}
 
|}
  
Line 116: Line 90:
 
<br/>
 
<br/>
 
====A.1 Display Title (1)====
 
====A.1 Display Title (1)====
This is by default, the shorter or 'public' title. If there is no such title the full scientific or protocol title needs to be used. Whatever title is used it should also appear within the list of study titles (see A.3), where a fuller set of title attributes can be provided. <br/>
+
This is by default, the shorter or 'public' title. If there is no such title the full scientific or protocol title needs to be used. Whatever title is used it should also appear within the list of study titles (see A.3), where a fuller set of title attributes can be provided.<br/>
The language code indicates the language of the title using the two letter ISO language code, with default value 'en'.
 
<br/>
 
  
 
====A.2 Study Identifiers (0...n)====
 
====A.2 Study Identifiers (0...n)====
Line 125: Line 97:
 
These IDs are composite. If provided, they must include  
 
These IDs are composite. If provided, they must include  
 
* the identifier value,
 
* the identifier value,
* the identifier type (categorised, as selected from a predetermined list),
+
* the identifier type (categorised, as selected from a predetermined list of code-text pairs),
* the assigning organisation
+
* the assigning organisation (name and where available an Id within a suitable system).
 
* (optionally), the date the identifier was assigned  
 
* (optionally), the date the identifier was assigned  
* (optionally), any associated URL (for instance some public funder Ids in the US will link to a summary page about the grant and its use).  
+
* (optionally), any associated URL (for instance some public funder Ids in the US will link to a summary page about the grant and its use).
  
 
====A.3 Study Titles (0..n)====
 
====A.3 Study Titles (0..n)====
Line 135: Line 107:
 
All titles should be included in this list. The type is composite, and should include:
 
All titles should be included in this list. The type is composite, and should include:
 
* the title text,
 
* the title text,
* the title type (categorised, as selected from a predetermined list),
+
* the title type (categorised, as selected from a predetermined list of code-text pairs),
 
* the language of the title, as a 2 character ISO code,
 
* the language of the title, as a 2 character ISO code,
 
* (optionally), any additional comments about their genesis (e.g. "authors' translation"),
 
* (optionally), any additional comments about their genesis (e.g. "authors' translation"),
Line 142: Line 114:
 
Most study registry systems require a brief, non-specialist description of the study – which usually range from a few lines to a paragraph or two. This can be useful in assessing the relevance of studies to a particular search task and so is included in the study data points.
 
Most study registry systems require a brief, non-specialist description of the study – which usually range from a few lines to a paragraph or two. This can be useful in assessing the relevance of studies to a particular search task and so is included in the study data points.
 
<br/>
 
<br/>
There should also be an indication of whether the description contains embedded html, so that display systems can interpret any tags correctly, rather than display them as 'raw' text.
+
There should also be an indication of whether the description contains embedded html, so that display systems can interpret any tags correctly, rather than display them as 'raw' text with visible angle brackets.
 
<br/>
 
<br/>
  
Line 148: Line 120:
 
In recent years several trial registries have requested study sponsors and / or leads to indicate if they will make individual participant data and related documents available for sharing, and if so how and when the data would be available. As such a statement is central to the purpose of the MDR it is captured within the study data, so that if present it can be displayed.
 
In recent years several trial registries have requested study sponsors and / or leads to indicate if they will make individual participant data and related documents available for sharing, and if so how and when the data would be available. As such a statement is central to the purpose of the MDR it is captured within the study data, so that if present it can be displayed.
 
<br/>
 
<br/>
This data point also includes an indication of whether the data sharing statement contains embedded html, so that the tags can be interpreted correctly.
+
Again this data point should also include an indication of whether the data sharing statement contains embedded html, so that the tags can be interpreted correctly.
 
<br/>
 
<br/>
  
Line 154: Line 126:
 
None, one or more design features of the study.
 
None, one or more design features of the study.
 
<br/>
 
<br/>
The design features available will depend on whether the study is interventional or observational. Available types for interventional studies include Phase, Primary Purpose, Allocation method, Intervention Design and Masking. For observational studies the types include Observational Model, Time Perspective, and whether or not Specimens are retained.
+
The design features available will depend on whether the study is interventional or observational. Available types for interventional studies include Phase, Primary Purpose, Allocation method, Intervention Design and Masking. For observational studies the types include Observational Model, Time Perspective, and whether or not specimens are retained.
 
<br/>
 
<br/>
 
In each case the possible values are categorised, and so restricted to a pre-defined set of values. This makes the feature types useful candidates for filtering of study records within a web portal and / or API.
 
In each case the possible values are categorised, and so restricted to a pre-defined set of values. This makes the feature types useful candidates for filtering of study records within a web portal and / or API.
 
<br/>
 
<br/>
 
The composite study feature record is therefore
 
The composite study feature record is therefore
* feature type (categorised, as selected from a predetermined list), provided as a code-text pair.
+
* feature type (categorised code-text, as selected from a predetermined list.
* feature value, also categorised - each feature type has an associated list of options, each one available as a code-text pair.
+
* feature value, also categorised code-text. Each feature type has an associated list of options.
  
 
====A.7 Study Topics (0…n)====
 
====A.7 Study Topics (0…n)====
Line 167: Line 139:
 
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
 
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
 
<br/>
 
<br/>
The listed topics could be free text, but in many cases the text is structured, i.e. selected from a controlled vocabulary.  There are a variety of such controlled vocabularies available (MESH, ICD 10, MedDRA, SnoMed CT etc.). In many such schemes the controlled term is associated with a code. Either topic name or code can be provided, but preferably both should be supplied.  
+
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
 
<br/>
 
<br/>
Topics are also of a certain ‘type’ determining the domain in which they apply, e.g. ‘condition’, ‘organism’, ‘chemical / biological agent’, ‘geographic’ etc.  
+
The topics can be free text, but in many cases the text is structured, i.e. selected from a controlled vocabulary. The vocabulary that is used the most – by a large margin is the MESH code system developed by the US Library of Medicine. This is because MESH codes are applied to both PubMed records and ClinicalTrials.gov trial registry entries. MedDRA and ICD10 are also used by some sources but in relatively tiny amounts. To try and provide a more consistent coding scheme for topics non coded terms are also matched, wherever possible, to MESH terms. Further work is required, however, to reduce the proportion of non-coded items.  
<br/>
+
<br/><br/>
The composite study topic record is therefore
+
The study topic record is composite and has the following structure:
* topic type (categorised, as selected from a predetermined list),
+
* topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
* topic value, the keyword or topic name
+
* A boolean indicating whether or not the term has been MESH coded,
* (if applicable) topic vocabulary (categorised, as selected from a predetermined list),
+
* the MESH code, if present
* (if applicable) topic code in the vocabulary system.
+
* the topic name or value - either the original or if MESH coded the preferred MESH term
 +
* a MESH qualifier code and qualifier value where one exists (applies only to PubMede articles at the moment),  
 +
* the original value.
  
 
====A.8 Study Type (1)====
 
====A.8 Study Type (1)====
Line 197: Line 171:
 
<br/>
 
<br/>
  
====A.13 Related Studies (0..n)====
+
====A.13 Inter-study relationships (0..n)====
 
Studies can have relationships between themselves, for instance one study can be a feasibility study for a later one, or a study can represent an ‘expanded access’ version of a clinical trial (when a new drug is available for compassionate reasons, even though recipients fail eligibility criteria for the study, and it use is reported on a case by case basis), or one study can represent a continuation of another, in an ongoing series. This data can be useful for tracking related studies and their data objects and so is included in the metadata scheme. It is composite, with
 
Studies can have relationships between themselves, for instance one study can be a feasibility study for a later one, or a study can represent an ‘expanded access’ version of a clinical trial (when a new drug is available for compassionate reasons, even though recipients fail eligibility criteria for the study, and it use is reported on a case by case basis), or one study can represent a continuation of another, in an ongoing series. This data can be useful for tracking related studies and their data objects and so is included in the metadata scheme. It is composite, with
* the relationship type (categorised, as selected from a predetermined list)
+
* the relationship type (categorised, as selected from a predetermined code-text list)
* the identifier of the other or ‘target’ study (in a suitable system).
+
* the identifier of the other or ‘target’ study (within a suitable system, normally the same system in which the ‘subject’ study is found).
  
 
====A.14 Linked Data Objects (1..n)====
 
====A.14 Linked Data Objects (1..n)====
 
The linked data objects (there should be at least one, representing the entry in a trial registry system) are listed as object identifiers, usually accession Ids within an appropriate database system (e.g. the ECRIN MDR).
 
The linked data objects (there should be at least one, representing the entry in a trial registry system) are listed as object identifiers, usually accession Ids within an appropriate database system (e.g. the ECRIN MDR).
  
====A.15 Provenance Data (1)====
+
====A.15 Provenance String (1)====
 
A string indicating the source or sources of the data (usually a trial registry) and the date-times on which the data was last downloaded from the source or sources.
 
A string indicating the source or sources of the data (usually a trial registry) and the date-times on which the data was last downloaded from the source or sources.
 
<br/><br/>
 
<br/><br/>
Line 213: Line 187:
 
In line with the DataCite specification the principal identifier for data objects is seen as a Digital Object identifier or DOI, providing a persistent identifier that can be cited in other contexts. This applies to any objects that are available to others (whether publicly or under managed access).  
 
In line with the DataCite specification the principal identifier for data objects is seen as a Digital Object identifier or DOI, providing a persistent identifier that can be cited in other contexts. This applies to any objects that are available to others (whether publicly or under managed access).  
 
<br/>
 
<br/>
Unfortunately a large number (the majority) of clinical research data objects, apart from journal articles, do not currently have a DOI. It remains to be seen if this situation improves. If it does not, consideration should be given to a mechanism for minting and applying one – if financially feasible and acceptable to the object creators – or alternative identifiers should be explored, especially if a resolvable URL exists which could be used to immediately linked to the resource.. The extent of this problem needs to be clarified.
+
Unfortunately the great majority of clinical research data objects, apart from journal articles, do not currently have a DOI. If this situation does not improve, consideration may need to be given to a mechanism for minting and applying DOIs – if financially feasible and acceptable to the object creators – or alternative identifiers should be explored, especially if a resolvable URL exists which could be used to immediately linked to the resource.
 
<br/>
 
<br/>
 +
 
====B.2 Display Title (1)====
 
====B.2 Display Title (1)====
 
A title for the object. For a journal article it would be a citation of the article in a standard format (up to 3 authors, title, source journal information). For many other data objects the display title would need to be constructed from the study name followed by the object title or type, because in general such objects do not have unique names. In many situations the study name prefix could be dropped as it would be clear from the context (e.g. the study name would be a heading to the list of data objects). The study name and object type or name should therefore be separated by a clear indicator (‘ :: ’ is used within the MDR) so that if and when necessary the two parts of a composite title can be displayed separately.
 
A title for the object. For a journal article it would be a citation of the article in a standard format (up to 3 authors, title, source journal information). For many other data objects the display title would need to be constructed from the study name followed by the object title or type, because in general such objects do not have unique names. In many situations the study name prefix could be dropped as it would be clear from the context (e.g. the study name would be a heading to the list of data objects). The study name and object type or name should therefore be separated by a clear indicator (‘ :: ’ is used within the MDR) so that if and when necessary the two parts of a composite title can be displayed separately.
Line 220: Line 195:
  
 
====B.3 Version (0..1)====
 
====B.3 Version (0..1)====
The version of the data object, in whatever notation was used by the original data object creators. Many versions of a particular dataset or document may have been created in the course of a clinical study, though only the version or versions that are made available for sharing are important in this context.  The normal expectation would be that the final version of a data object (e.g. a protocol) would be the one that was shared with others.
+
The version of the data object, in whatever notation was used by the original data object creators. Many versions of a particular dataset or document may have been created in the course of a clinical study but the normal expectation would be that the final version of a data object (e.g. a protocol) would be the one that was shared with others.<br/>
<br/>
+
Although it is relatively rare for more than one version of a data object to be made available, if that is the case they should be clearly differentiated using version codes (and relevant dates – see D.2 – and possibly descriptions – see E.6). E.8 describes how the relationship to previous or next versions can be made explicit. If a version item exists it should be displayed with the name and other identifiers.<br/>
In some cases multiple versions of the same document or dataset could be made available, or they might be specifically requested. Assuming the data objects have similar names, they will therefore need to be clearly differentiated using version codes (and relevant dates – see D.2 – and possibly descriptions – see E.6). E.8 describes how the relationship to previous or next versions can be made explicit. If multiple versions of the same dataset are available to access the version attribute should be completed and displayed with the name and other identifiers.
+
 
<br/>
 
 
====B.4 Object Identifiers (0...n)====
 
====B.4 Object Identifiers (0...n)====
 
This refers to other unique identifiers that have been assigned to the data object in addition to its DOI primary identifier (for instance, for journal articles, a PubMed id). As with studies such IDs would be composite and include:
 
This refers to other unique identifiers that have been assigned to the data object in addition to its DOI primary identifier (for instance, for journal articles, a PubMed id). As with studies such IDs would be composite and include:
 
* the identifier value,
 
* the identifier value,
* the identifier type (categorised, as selected from a predetermined list),
+
* the identifier type (categorised, as selected from a predetermined list of code-text pairs),
* the assigning organisation
+
* the assigning organisation (name and where available an Id within a suitable system).
 
* (optionally), the date the identifier was assigned
 
* (optionally), the date the identifier was assigned
  
Line 239: Line 213:
  
 
====B.6. Linked Studies (1...n)====
 
====B.6. Linked Studies (1...n)====
The linked studies (there should be at least one, or the data object should not be included in the system) are listed as study identifiers, usually accession Ids within an appropriate database system (such as the MDR).
+
The linked studies (there should be at least one, or the data object should not be included in the system) are listed as study identifiers, usually accession Ids within an appropriate database system.
 +
<br/>
 +
 
 +
====B.7. Provenance String (1)====
 +
A string indicating the source or sources of the data and the date-times on which the data was last downloaded from the source or sources.
 
<br/><br/>
 
<br/><br/>
  
 
== Creators and Contributors ==
 
== Creators and Contributors ==
 
====C.1 Creators (1...n)====
 
====C.1 Creators (1...n)====
The main personnel involved in producing the data, or the authors of a publication. It may be a set of institutional and / or personal names. Each creator description, which is composite, therefore needs to indicate whether or not it refers to an individual or an organisation (or collaboration). If it is a person then fields are available for names, identifiers, and affiliation details. If not the organisation name needs to be provided. The composite structure (which follows closely that in DataCite) is therefore
+
The main personnel involved in producing the data, or the authors of a publication. It may be a set of institutional and / or personal names. Each creator description, which is composite, therefore needs to indicate whether or not it refers to an individual or an organisation (or collaboration). If it is a person then fields are available for names, ORCID and affiliation details. If not the organisation name needs to be provided, with an Id if one is available. The composite structure (which follows closely that in DataCite) is therefore
 
* whether an individual or not,  
 
* whether an individual or not,  
 
&nbsp;&nbsp;and if they are…
 
&nbsp;&nbsp;and if they are…
** given name,  
+
* given name,  
** family name,  
+
* family name,  
** full name,  
+
* full name,  
** identifier, (ORCID id if available)
+
* ORCID id if available
** affiliation, (string description of department / organisation)
+
* affiliation, (string description of department / organisation)
** affiliation identifier, (if the organisation has a formal identifier)
 
** affiliation identifier scheme (e.g. ‘ISNI’, ‘RINGGOLD’)
 
 
&nbsp;&nbsp;but if they are not….
 
&nbsp;&nbsp;but if they are not….
** The organisation name
+
* The organisation (name and where available an Id within a suitable system).
Most data objects, other than journal articles, are unlikely to have creators explicitly identified.
+
Most data objects where the metadata is harvested retrospectively, other than journal articles, are unlikely to have creators explicitly identified.
 
<br/>
 
<br/>
 +
 
====C.2 Contributors (0...n)====
 
====C.2 Contributors (0...n)====
 
From DataCite, contributors are “other institutions and / or persons responsible for collecting, managing, distributing, or otherwise contributing to the development of the data object.”  A contributor record is composite and is essentially the same as that for creators, except that each needs to be prefixed with an indicator of  
 
From DataCite, contributors are “other institutions and / or persons responsible for collecting, managing, distributing, or otherwise contributing to the development of the data object.”  A contributor record is composite and is essentially the same as that for creators, except that each needs to be prefixed with an indicator of  
* contributor type (categorised, as selected from a predetermined list)
+
* contributor type (categorised, as selected from a predetermined list of code-text pairs).
 
The types available include those defined within DataCite, but the list has been extended in the context of clinical research, to include (for example) trial sponsor, trial funder, device provider, central laboratory, public contact, study lead, (site) principal Investigator.  
 
The types available include those defined within DataCite, but the list has been extended in the context of clinical research, to include (for example) trial sponsor, trial funder, device provider, central laboratory, public contact, study lead, (site) principal Investigator.  
 
<br/>
 
<br/>
In general the contributor lists for data objects should be derived from the lists stored for their parent study, even though in the system contributors are only presented for data objects, not studies. These will usually include the study lead(s) and sponsors. Where data objects do record their contributors, they would normally take precedence over that of the study, but organisational contributors, in particular study sponsors and funders, should still be added to the data object list.  
+
In general the contributor lists for data objects should be derived from the lists stored for their parent study, even though in the system contributors are only presented for data objects, not studies. These will usually include the study lead(s) and sponsors. Where data objects do record their contributors, they would normally take precedence over that of the study, but organisational contributors, in particular study sponsors and funders, should still be added to the data object list.<br/> 
Any system retrieving creator / contributor data therefore needs to do so for the studies as well as for the data objects themselves.
+
Any system retrieving creator / contributor data therefore needs to collect that data for the studies as well as for the data objects themselves. Creator data is generally collected and stored in exactly the same way as contributor data – the contribution type is simply set as ‘creator’.
 
<br/><br/>
 
<br/><br/>
  
Line 321: Line 298:
 
None, one or more pieces of additional general information about the data object, so far as that is publicly available (journal abstracts, although an obvious ‘descriptor’, remain the property of the publisher and cannot in general be reproduced within the system). The item is composite, consisting of:
 
None, one or more pieces of additional general information about the data object, so far as that is publicly available (journal abstracts, although an obvious ‘descriptor’, remain the property of the publisher and cannot in general be reproduced within the system). The item is composite, consisting of:
 
* description type (categorised, as selected from a predetermined list)
 
* description type (categorised, as selected from a predetermined list)
* label, a heading that might be applied to the text
+
* label, a heading that might be applied to the text (e.g. as a sub-heading).
 
* description text, the description itself
 
* description text, the description itself
 
* language code, the 2 character ISO code
 
* language code, the 2 character ISO code
* contains html?, useful to know for display purposes
+
* a boolean indicating whether or not the description contains html, useful to know for display purposes
  
 
====E.7 EOSC Category (0..1)====
 
====E.7 EOSC Category (0..1)====
Line 338: Line 315:
 
<br/>
 
<br/>
  
====E.9 Related Objects (0..n)====
+
====E.9 Inter-object relationships (0..n)====
 
Data objects can be related to each other – for example one object can be a supplement to another, or a new version of an other, or be derived from, or the source of, one or more other data objects.  
 
Data objects can be related to each other – for example one object can be a supplement to another, or a new version of an other, or be derived from, or the source of, one or more other data objects.  
 
A particularly important relationship for clinical study data is the pairing of ‘Has Metadata’ and  ‘Is Metadata for’. Metadata in clinical research can include, for example, a data dictionary that provides the metadata for a dataset. Note that the metadata in this context is itself a file, and a data object in its own right.  Each record is composite and must include:
 
A particularly important relationship for clinical study data is the pairing of ‘Has Metadata’ and  ‘Is Metadata for’. Metadata in clinical research can include, for example, a data dictionary that provides the metadata for a dataset. Note that the metadata in this context is itself a file, and a data object in its own right.  Each record is composite and must include:
* the relationship type (categorised, as selected from a predetermined list)
+
* the relationship type (categorised, as selected from a predetermined code-text list)
 
* the identifier of the other or ‘target’ data object (in a suitable system).
 
* the identifier of the other or ‘target’ data object (in a suitable system).
 
+
Because few data objects have DOIs, it is usually a requirement that both subject and target objects are stored within the same system. This allows the identifier to be an internal identifier within that system, making navigation to it much simpler.  
To keep things simpler within the MDR, the requirement is that any related resource must also be indexed within the MDR. This allows the identifier to be an internal identifier within the MDR system, making navigation to it much simpler.  
 
 
<br/>
 
<br/>
  
 
====E.10 Topic (0...n)====
 
====E.10 Topic (0...n)====
None, one or more topic names or phrases, keywords, or classification codes describing the object or aspects of it.  In the context of clinical research, most data objects will not have listed topics associated with them (the exception is journal articles, which almost always do have linked keywords). Data Object topics for non journal articles should be those of the parent study or studies. This introduces a substantial amount of redundancy but it means that topics can be searched and used for filtering across all (rather than just some) data objects.
+
None, one or more topic names or phrases, keywords, or classification codes describing the object or aspects of it.  In the context of clinical research, most data objects will not have listed topics associated with them (the exception is journal articles, which almost always do have linked keywords). Data Object topics for non journal articles should be those of the parent study or studies. This introduces a substantial amount of redundancy but it means that topics can be searched and used for filtering across all (rather than just some) data objects.<br/>
<br/>
+
The structure of each topic item is exactly the same as for study topics:  
The structure of the each topic item is exactly the same as for study topics:  
+
* topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
* topic type (categorised, as selected from a predetermined list),
+
* A boolean indicating whether or not the term has been MESH coded,
* topic value, the keyword or topic name
+
* the MESH code, if present
* if applicable) topic vocabulary (categorised, as selected from a predetermined list),
+
* the topic name or value - either the original or if MESH coded the preferred MESH term
* (if applicable) topic code in the vocabulary system.
+
* a MESH qualifier code and qualifier value where one exists (applies only to PubMede articles at the moment),  
 +
* the original value.
 
<br/>
 
<br/>
  
Line 361: Line 338:
 
<br/>
 
<br/>
 
====F.1 Managing Organisation (1)====
 
====F.1 Managing Organisation (1)====
In this schema, this is the organisation that manages access to the document or data object, including making the overall decision about access type (see F.2). For data this would usually be the name of the organisation that was the data controller. For journal papers it would be the name of the company that publishes the journal, and which would normally run the primary web site on which it can be accessed.
+
In this schema, this is the organisation that manages access to the document or data object, including making the overall decision about access type (see F.2). For data this would usually be the name of the organisation that was the data controller. For journal papers it would be the name of the company that publishes the journal, and which would normally run the primary web site on which it can be accessed. In both cases the name would be associated with an id in a suitable system.<br/>
<br/>
 
 
====F.2 Access Type (1)====
 
====F.2 Access Type (1)====
A categorised value that represents in broad terms the type of access under which the object is available, for example by publicly available download, or restricted download (restricted to members of a specific group) or on screen access after review on a case by case basis.
+
A categorised value (code-text pair) that represents in broad terms the type of access under which the object is available, for example by publicly available download, or restricted download (restricted to members of a specific group) or on screen access after review on a case by case basis.
 
<br/>
 
<br/>
 
====F.3 Access Details (Mandatory for any of the non-public access types)====
 
====F.3 Access Details (Mandatory for any of the non-public access types)====
A brief summary of the access being offered, for example identifying the groups to which access is granted, the criteria on which a case-by-case decision would be based, any further restrictions on on-screen access, etc. In practice often taken from the managing organisation's web site. It may also reference the access details URL or other web based resources.
+
This is a composite element with three elements:
<br/>
+
* A textual summary of the access being offered, for example identifying the groups to which access is granted, the criteria on which a case-by-case decision would be based, any further restrictions on on-screen access, etc. It may reference web based resources, on the object manager’s web site or elsewhere (see below).
 
+
* A link to a resource that explains how access may be gained, e.g. how a group can be joined, and / or how application can be made for access on an individual basis. This would normally be a link to a web page on the managing organisation’s site, that would explain access procedures or provide an application proforma.  
====F.4 Access Details URL (Mandatory for any of the non-public access types)====
+
* A date, if one is available, representing the last time the URL was checked to be in existence (i.e. returned a 200 ‘success’ code rather than a 404).
A link to a resource that explains how access may be gained, e.g. how a group can be joined, and / or how application can be made for access on an individual basis. This would normally be a link to a web page on the managing organisation’s site, that would explain access procedures or provide an application proforma.  
+
====F.4 Resources (Mandatory unless case-by-case access)====
<br/>
 
The item is composite, and should include a date representing the last time the URL was checked to be valid (i.e. returned a 200 ‘success’ code rather than a 404) - though this does not guarantee that the content of the web page is still appropriate.
 
<br/>
 
 
 
====F.5 Resources (Mandatory unless case-by-case access)====
 
 
The web based resources that represent this data object. Mandatory for public objects, when at least one resource should be listed. For data objects simply listed as existing, but under managed access, this information may not be available for harvesting.  Each record is composite and includes
 
The web based resources that represent this data object. Mandatory for public objects, when at least one resource should be listed. For data objects simply listed as existing, but under managed access, this information may not be available for harvesting.  Each record is composite and includes
 
* the name of the organisation holding the resource (e.g. a data repository, bibliographic system, trial registry)
 
* the name of the organisation holding the resource (e.g. a data repository, bibliographic system, trial registry)
Line 389: Line 360:
 
* resource comments, provides a free text field to hold further details of the resource, in particular to support machine processing. These could include the schema used for XML files, and / or the character coding used for text files (e.g. UTF-8 versus UTF-16) or the presence and types of any byte order marks.
 
* resource comments, provides a free text field to hold further details of the resource, in particular to support machine processing. These could include the schema used for XML files, and / or the character coding used for text files (e.g. UTF-8 versus UTF-16) or the presence and types of any byte order marks.
  
====F.6 Rights (0..n)====
+
====F.5 Rights (0..n)====
Any intellectual property rights information for the data object, as a textual statement of the rights management associated with the resource.  The item is composite, and should include the URI for the specific rights management scheme as well as a textual description.
+
Any intellectual property rights information for the data object, as a textual statement of the rights management associated with the resource.  The item is composite, and should include:
 
+
* the name of the rights being applied
====F.7 Provenance Data (1)====
+
* a uri that identifies an information source, usually a url to a web page,
A string indicating the source or sources of the data and the date-times on which the data was last downloaded from the source or sources.
+
* any additional comments or description of the rights regime.
<br/><br/>
+
<br/>
  
 
== References ==
 
== References ==

Latest revision as of 21:11, 27 October 2020

Contents

Summary tables


The Study schema

Mandatory Recommended Optional
A. The Source Study
A.1 Display Title
{display title, language code }





A.8 Study Type

A.9 Study Status




A.14 Linked Data Objects
*
{object identifiers}

A.15 Provenance String

A.2 Study Identifiers *
{identifier type , Identifier value, source organisation, date, url link}

A4. Brief Description
{description text, ?contains html}

A.6 Study Features *
{feature type , feature value }

A.7 Study Topics *
{topic type , mesh coded?, topic code, topic value, topic qualcode, topic qualvalue, original value}

A3. Study Titles *
{title text, title type , language code , comments}


A5. Data Sharing Statement
{statement text, ?contains html}

A10. Study Enrolment Number

A11. Study Gender Eligibility

A12. Min and Max Ages
{age, age units}

A13. Inter-study relationships *
{relationship type , target study}

* May be repeated Categorised value


The Data Object schema

Mandatory Recommended Optional
B. Data Object Identifiers
B.1 DOI


B.2 Display Title

B.6 Linked Studies *
{study identifiers}

B.7 Provenance String

B.3 Version B.4 Object Identifiers *
{Identifier type , Identifier value, source organisation, application date}

B.5 Object Titles *
{title text, title type , language code , comments }


C. Creators and Contributors
C.1 Creators *
{name type, person details OR organisation }




person details = given name, family name, full name, ORCID identifier, affiliation
organisation = organisation default name and, if the organisation exists in the context database, the associated integer id
C.2 Contributors *
{contribution type , name type, person details OR organisation }

For most data objects contributors will be the contributors to the associated study or studies.

For journal articles contributors will be authors, plus organisational study contributors

D. Object Dates
D.1 Publication Year D.2 Dates *
{date type , Is range, date as string, start year, start month, start day, end year, end month, end day, comments}
E. Object Attributes and Descriptors
E.1 Class

E.2 Type
E.3 Record key type (datasets only)
{type , text description}

E.4 De-identification level (datasets only)
{type , specific actions, text description}

E.5 Associated consent (datasets only)
{type , specific restrictions, text description}

E.6 Description *
{description type , label, description text, language code , contains html?}

E.7 EOSC Category

E.8 Language *

E.9 Inter-object relationships *
{ relationship type , target object}

E.10 Topics (of data object) *
{topic type , mesh coded?, topic code, topic value, topic qualcode, topic qualvalue, original value}

For most data objects topics should be the study topics.
Journal articles will normally have their own listed topics

F. Object Location and Access Details
F.1 Managing Organisation

F.2 Access Type

F.3 Access Details
{description, url of details, date url last checked}

F.4 Physical Resources *
{repository organisation, resource url , url accessible?, date url last checked, resource type , resource size, size units, comments }






(F3 is mandatory if access is non-public)

F.5 Rights
{name, rights uri, comments}

* May be repeated Categorised value

Study Attributes

Strictly speaking these data points are not metadata because they do not describe data – instead they summarise some key attributes of the study, especially those that promote its discoverability.

A.1 Display Title (1)

This is by default, the shorter or 'public' title. If there is no such title the full scientific or protocol title needs to be used. Whatever title is used it should also appear within the list of study titles (see A.3), where a fuller set of title attributes can be provided.

A.2 Study Identifiers (0...n)

None, one or more unique identifiers that have been assigned to the study. For studies entered into trial registries these should include, as a minimum, the registry ID(s), but any IDs that have been externally applied, and that might be useful in identifying the study, can be included, for instance funders' and / or sponsors' ids.
These IDs are composite. If provided, they must include

  • the identifier value,
  • the identifier type (categorised, as selected from a predetermined list of code-text pairs),
  • the assigning organisation (name and where available an Id within a suitable system).
  • (optionally), the date the identifier was assigned
  • (optionally), any associated URL (for instance some public funder Ids in the US will link to a summary page about the grant and its use).

A.3 Study Titles (0..n)

Studies usually have a short or ‘public’ title as well as a full scientific one (as used on the protocol document), and can also be described by an acronym. They may have titles in more than one language.
All titles should be included in this list. The type is composite, and should include:

  • the title text,
  • the title type (categorised, as selected from a predetermined list of code-text pairs),
  • the language of the title, as a 2 character ISO code,
  • (optionally), any additional comments about their genesis (e.g. "authors' translation"),

A.4 Brief Description (0…1)

Most study registry systems require a brief, non-specialist description of the study – which usually range from a few lines to a paragraph or two. This can be useful in assessing the relevance of studies to a particular search task and so is included in the study data points.
There should also be an indication of whether the description contains embedded html, so that display systems can interpret any tags correctly, rather than display them as 'raw' text with visible angle brackets.

A.5 Data Sharing Statement (0..1)

In recent years several trial registries have requested study sponsors and / or leads to indicate if they will make individual participant data and related documents available for sharing, and if so how and when the data would be available. As such a statement is central to the purpose of the MDR it is captured within the study data, so that if present it can be displayed.
Again this data point should also include an indication of whether the data sharing statement contains embedded html, so that the tags can be interpreted correctly.

A.6 Study Features(0…n)

None, one or more design features of the study.
The design features available will depend on whether the study is interventional or observational. Available types for interventional studies include Phase, Primary Purpose, Allocation method, Intervention Design and Masking. For observational studies the types include Observational Model, Time Perspective, and whether or not specimens are retained.
In each case the possible values are categorised, and so restricted to a pre-defined set of values. This makes the feature types useful candidates for filtering of study records within a web portal and / or API.
The composite study feature record is therefore

  • feature type (categorised code-text, as selected from a predetermined list.
  • feature value, also categorised code-text. Each feature type has an associated list of options.

A.7 Study Topics (0…n)

None, one or more topic names or phrases, keywords, or classification codes describing the study or aspects of it. Topics is preferred to ‘Subjects’ because within clinical research ‘Study subjects’ is normally understood as referring to the study participants.
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
The topics can be free text, but in many cases the text is structured, i.e. selected from a controlled vocabulary. The vocabulary that is used the most – by a large margin – is the MESH code system developed by the US Library of Medicine. This is because MESH codes are applied to both PubMed records and ClinicalTrials.gov trial registry entries. MedDRA and ICD10 are also used by some sources but in relatively tiny amounts. To try and provide a more consistent coding scheme for topics non coded terms are also matched, wherever possible, to MESH terms. Further work is required, however, to reduce the proportion of non-coded items.

The study topic record is composite and has the following structure:

  • topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
  • A boolean indicating whether or not the term has been MESH coded,
  • the MESH code, if present
  • the topic name or value - either the original or if MESH coded the preferred MESH term
  • a MESH qualifier code and qualifier value where one exists (applies only to PubMede articles at the moment),
  • the original value.

A.8 Study Type (1)

This is a single term representing – in very broad terms – the type of clinical research study, e.g. ‘interventional’ (= clinical trial), ‘observational’, ‘expanded access’. It is categorised and must be selected from a predefined list. It is included as an aid to filtering records.

A.9 Study Status (1)

This is a single term representing the current status of the study in terms of its life-cycle, e.g. ‘not yet recruiting’, ‘recruiting’, ‘completed’, ‘terminated (early)’. It is categorised and must be selected from a predefined list. It is included as an aid to filtering records.

A.10 Study Enrolment Number (0..1)

This is an integer representing the anticipated or actual number of study participants.

A.11 Study Gender Eligibility (0..1)

This is a code / text pair that indicates whether the study is only open to male or female participants, or both.

A.12 Study Minimum and Maximum ages (0..1)

These are integers representing the minimum and maximum age criteria for study participants, where they exist. In each case they are associated with a term indicating the time units associated with the integer. This is usually 'Years', but, for example for paediatric studies, may be months or weeks, or even days or hours.

A.13 Inter-study relationships (0..n)

Studies can have relationships between themselves, for instance one study can be a feasibility study for a later one, or a study can represent an ‘expanded access’ version of a clinical trial (when a new drug is available for compassionate reasons, even though recipients fail eligibility criteria for the study, and it use is reported on a case by case basis), or one study can represent a continuation of another, in an ongoing series. This data can be useful for tracking related studies and their data objects and so is included in the metadata scheme. It is composite, with

  • the relationship type (categorised, as selected from a predetermined code-text list)
  • the identifier of the other or ‘target’ study (within a suitable system, normally the same system in which the ‘subject’ study is found).

A.14 Linked Data Objects (1..n)

The linked data objects (there should be at least one, representing the entry in a trial registry system) are listed as object identifiers, usually accession Ids within an appropriate database system (e.g. the ECRIN MDR).

A.15 Provenance String (1)

A string indicating the source or sources of the data (usually a trial registry) and the date-times on which the data was last downloaded from the source or sources.

Data Object identifiers

B.1 Data object identifier (0..1)

In line with the DataCite specification the principal identifier for data objects is seen as a Digital Object identifier or DOI, providing a persistent identifier that can be cited in other contexts. This applies to any objects that are available to others (whether publicly or under managed access).
Unfortunately the great majority of clinical research data objects, apart from journal articles, do not currently have a DOI. If this situation does not improve, consideration may need to be given to a mechanism for minting and applying DOIs – if financially feasible and acceptable to the object creators – or alternative identifiers should be explored, especially if a resolvable URL exists which could be used to immediately linked to the resource.

B.2 Display Title (1)

A title for the object. For a journal article it would be a citation of the article in a standard format (up to 3 authors, title, source journal information). For many other data objects the display title would need to be constructed from the study name followed by the object title or type, because in general such objects do not have unique names. In many situations the study name prefix could be dropped as it would be clear from the context (e.g. the study name would be a heading to the list of data objects). The study name and object type or name should therefore be separated by a clear indicator (‘ :: ’ is used within the MDR) so that if and when necessary the two parts of a composite title can be displayed separately.

B.3 Version (0..1)

The version of the data object, in whatever notation was used by the original data object creators. Many versions of a particular dataset or document may have been created in the course of a clinical study but the normal expectation would be that the final version of a data object (e.g. a protocol) would be the one that was shared with others.
Although it is relatively rare for more than one version of a data object to be made available, if that is the case they should be clearly differentiated using version codes (and relevant dates – see D.2 – and possibly descriptions – see E.6). E.8 describes how the relationship to previous or next versions can be made explicit. If a version item exists it should be displayed with the name and other identifiers.

B.4 Object Identifiers (0...n)

This refers to other unique identifiers that have been assigned to the data object in addition to its DOI primary identifier (for instance, for journal articles, a PubMed id). As with studies such IDs would be composite and include:

  • the identifier value,
  • the identifier type (categorised, as selected from a predetermined list of code-text pairs),
  • the assigning organisation (name and where available an Id within a suitable system).
  • (optionally), the date the identifier was assigned

B.5. Object Titles (0...n)

The complete data for the title(s) for the data object. In most cases there will only be one (the constructed display title), but journal papers may have titles in different languages, and in any case will be different from the display title (which is a full citation). The title description is composite , and should include

  • the title text,
  • the title type (categorised, as selected from a predetermined list),
  • the language of the title, as a 2 character ISO code,
  • (optionally), any additional comments about their genesis (e.g. "authors' translation")

B.6. Linked Studies (1...n)

The linked studies (there should be at least one, or the data object should not be included in the system) are listed as study identifiers, usually accession Ids within an appropriate database system.

B.7. Provenance String (1)

A string indicating the source or sources of the data and the date-times on which the data was last downloaded from the source or sources.

Creators and Contributors

C.1 Creators (1...n)

The main personnel involved in producing the data, or the authors of a publication. It may be a set of institutional and / or personal names. Each creator description, which is composite, therefore needs to indicate whether or not it refers to an individual or an organisation (or collaboration). If it is a person then fields are available for names, ORCID and affiliation details. If not the organisation name needs to be provided, with an Id if one is available. The composite structure (which follows closely that in DataCite) is therefore

  • whether an individual or not,

  and if they are…

  • given name,
  • family name,
  • full name,
  • ORCID id if available
  • affiliation, (string description of department / organisation)

  but if they are not….

  • The organisation (name and where available an Id within a suitable system).

Most data objects where the metadata is harvested retrospectively, other than journal articles, are unlikely to have creators explicitly identified.

C.2 Contributors (0...n)

From DataCite, contributors are “other institutions and / or persons responsible for collecting, managing, distributing, or otherwise contributing to the development of the data object.” A contributor record is composite and is essentially the same as that for creators, except that each needs to be prefixed with an indicator of

  • contributor type (categorised, as selected from a predetermined list of code-text pairs).

The types available include those defined within DataCite, but the list has been extended in the context of clinical research, to include (for example) trial sponsor, trial funder, device provider, central laboratory, public contact, study lead, (site) principal Investigator.
In general the contributor lists for data objects should be derived from the lists stored for their parent study, even though in the system contributors are only presented for data objects, not studies. These will usually include the study lead(s) and sponsors. Where data objects do record their contributors, they would normally take precedence over that of the study, but organisational contributors, in particular study sponsors and funders, should still be added to the data object list.
Any system retrieving creator / contributor data therefore needs to collect that data for the studies as well as for the data objects themselves. Creator data is generally collected and stored in exactly the same way as contributor data – the contribution type is simply set as ‘creator’.

Dates

D.1 Publication year (1)

The year in which the object is made available, i.e. in which it first becomes citable, expressed as 4 digits. Not the same as when an object becomes public – ‘available’ simply means that it can be accessed, but the conditions of that access remain in the control of the object’s owners or controllers, nor necessarily the year in which it was created (which may be present as one of the object’s dates).

D.2 Dates (0...n)

None, one or more dates or date ranges that are relevant to the data object. It is composite and includes both string and integer representations of the date. Year, month and day data is held separately to make it easier to apply date filters when finding data objects. The elements of the composite record are:

  • date type (categorised, as selected from a predetermined list),
  • is range, whether or not it is a single date or a range,
  • date as string, in a standard format yyyy MMM dd, e.g. “2018 Dec 12”, “2012 Mar 7”
  • start year, an integer
  • start month, an integer – may not be present for partial dates
  • start day, an integer – may not be present for partial dates
  • end year, for date ranges only
  • end month, for date ranges only, may not be present if date range partial
  • end day, for date ranges only, may not be present if date range partial
  • comments – any relevant / explanatory comments



Data Object Attributes

Section E is mainly based on the DataCite metadata specification, though a few extensions (E3 – E5) have been added for datasets (as opposed to document based data objects).

E.1 Class (1)

A categorised value, one of the existing DataCite controlled list for ‘Resource Type General’. In most cases, for clinical research data objects, the class will usually be one of:

  • Text
  • Dataset

though other options include: Data Paper, Software, Service, Audiovisual, and Interactive Resource.

E.2 Type (1)

A categorised description of the type of data object, at a more specific level than Class. The type and class should form a pair (as with DataCite), e.g. Dataset/census data or Text/conference abstract.
Unlike DataCite, both class and type are mandatory in the ECRIN schema. The types available include the CASRAI classifications of document objects, recommended by DataCite, together with additions to the list that represent object types of particular importance to clinical research (e.g. protocols, clinical study reports, statistical analysis plans, and datasets of various kinds).

E.3 Record key type (1, Datasets only)

This is a composite item that indicates the type of record keys used within the dataset, which indicates in particular if it is pseudonymised or anonymised. The contents are

  • Record key type (categorised, as selected from a predetermined list)
  • Details – text description to elaborate / clarify details

Note that the categorisation into 'pseudonymised'. 'anonymised', 'identifiable' etc. is based upon the description provided by the data controller / manager in the data (if one is supplied). The classification is therefore based on the data controller's understanding of the relevant terms. No attempt is made to apply a categorisation using standard criteria, as the meaning of the words used ('pseudonymised'. 'anonymised', etc.) may vary between different legal jurisdictions, over time, and in different usage contexts. The categorisation should therefore be read as only a very approximate guide to any legal requirements associated with the data.

E.4 De-identification level (1, Datasets only)

An item that indicates the amount of de-identification that has been applied to the dataset. The item consists of :

  • De-identification level (categorised, as selected from a predetermined list)
  • Additional actions carried out - boolean data indicating if any of the following applies: a) direct identifiers have been removed, b) US HIPAA rules for de-identification have been applied, c) dates have been rebased or replaced with integers, d) narrative text fields have been removed , and e) k-anonymisation has been carried out.
  • Details – text description to elaborate / clarify details.

E.5 Associated consent (1, Datasets only)

The consent in question is for secondary use of the data - consent for primary use is assumed.
The data item consists of:

  • a coded field that indicates the range of application of consent (if any) available for re-use and sharing associated with the data, selected from a list.
  • Possible additional restrictions, represented as a series of boolean data points: a) if use is limited to non-commercial research, b) if there any geographical restrictions on re-use, c) if only certain types of research are permitted, d) if only genetic research is allowed, and f) whether or not methodological or tool research (e.g. developing machine learning algorithms) is allowed.,
  • Details – text description to elaborate / clarify details, in particular to expand upon any of the additional restrictions listed as being present.

E.6 Description (0..n)

None, one or more pieces of additional general information about the data object, so far as that is publicly available (journal abstracts, although an obvious ‘descriptor’, remain the property of the publisher and cannot in general be reproduced within the system). The item is composite, consisting of:

  • description type (categorised, as selected from a predetermined list)
  • label, a heading that might be applied to the text (e.g. as a sub-heading).
  • description text, the description itself
  • language code, the 2 character ISO code
  • a boolean indicating whether or not the description contains html, useful to know for display purposes

E.7 EOSC Category (0..1)

An integer (0, 1, 2 or 3) that conforms to an EOSC categorisation recommended for data objects. The classification is

  • 0 = Non-personal data. Contains no information that refers to any identified or identifiable living individual.
  • 1 = Anonymised data.
  • 2 = Pseudonymised data.
  • 3 = Sensitive pseudonymised data.

In general, almost all documents expected in the MDR will be categorised as 0, whilst all IPD datasets will be categorised as 3 - unless there is general agreement that they are fully anonymised, in which case they become 1.

E.8 Language (1..n)

The language or languages of the data object itself (not of a description of the object), using the ISO language codes (e.g. en, de, fr). DataCite assumes a single language but some clinical research data objects (e.g. journal articles) are created in two or more languages. The record may therefore be multiple.

E.9 Inter-object relationships (0..n)

Data objects can be related to each other – for example one object can be a supplement to another, or a new version of an other, or be derived from, or the source of, one or more other data objects. A particularly important relationship for clinical study data is the pairing of ‘Has Metadata’ and ‘Is Metadata for’. Metadata in clinical research can include, for example, a data dictionary that provides the metadata for a dataset. Note that the metadata in this context is itself a file, and a data object in its own right. Each record is composite and must include:

  • the relationship type (categorised, as selected from a predetermined code-text list)
  • the identifier of the other or ‘target’ data object (in a suitable system).

Because few data objects have DOIs, it is usually a requirement that both subject and target objects are stored within the same system. This allows the identifier to be an internal identifier within that system, making navigation to it much simpler.

E.10 Topic (0...n)

None, one or more topic names or phrases, keywords, or classification codes describing the object or aspects of it. In the context of clinical research, most data objects will not have listed topics associated with them (the exception is journal articles, which almost always do have linked keywords). Data Object topics for non journal articles should be those of the parent study or studies. This introduces a substantial amount of redundancy but it means that topics can be searched and used for filtering across all (rather than just some) data objects.
The structure of each topic item is exactly the same as for study topics:

  • topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
  • A boolean indicating whether or not the term has been MESH coded,
  • the MESH code, if present
  • the topic name or value - either the original or if MESH coded the preferred MESH term
  • a MESH qualifier code and qualifier value where one exists (applies only to PubMede articles at the moment),
  • the original value.


Location and Access details

An area where the existing DataCite schema needs to be extended is in providing a full description of the access arrangements for any data object. The following data points are proposed.

F.1 Managing Organisation (1)

In this schema, this is the organisation that manages access to the document or data object, including making the overall decision about access type (see F.2). For data this would usually be the name of the organisation that was the data controller. For journal papers it would be the name of the company that publishes the journal, and which would normally run the primary web site on which it can be accessed. In both cases the name would be associated with an id in a suitable system.

F.2 Access Type (1)

A categorised value (code-text pair) that represents in broad terms the type of access under which the object is available, for example by publicly available download, or restricted download (restricted to members of a specific group) or on screen access after review on a case by case basis.

F.3 Access Details (Mandatory for any of the non-public access types)

This is a composite element with three elements:

  • A textual summary of the access being offered, for example identifying the groups to which access is granted, the criteria on which a case-by-case decision would be based, any further restrictions on on-screen access, etc. It may reference web based resources, on the object manager’s web site or elsewhere (see below).
  • A link to a resource that explains how access may be gained, e.g. how a group can be joined, and / or how application can be made for access on an individual basis. This would normally be a link to a web page on the managing organisation’s site, that would explain access procedures or provide an application proforma.
  • A date, if one is available, representing the last time the URL was checked to be in existence (i.e. returned a 200 ‘success’ code rather than a 404).

F.4 Resources (Mandatory unless case-by-case access)

The web based resources that represent this data object. Mandatory for public objects, when at least one resource should be listed. For data objects simply listed as existing, but under managed access, this information may not be available for harvesting. Each record is composite and includes

  • the name of the organisation holding the resource (e.g. a data repository, bibliographic system, trial registry)
  • the resource type (categorised, for downloadable resources normally based on the file extension)
  • the resource URL
  • whether or not the resource is directly accessible (i.e. is public and not behind a pay wall) - so far as is known
  • the date the URL was last checked as valid

and, if downloadable,

  • the resource size,
  • the resource size units, usually in KB, MB or GB.

In addition...

  • resource comments, provides a free text field to hold further details of the resource, in particular to support machine processing. These could include the schema used for XML files, and / or the character coding used for text files (e.g. UTF-8 versus UTF-16) or the presence and types of any byte order marks.

F.5 Rights (0..n)

Any intellectual property rights information for the data object, as a textual statement of the rights management associated with the resource. The item is composite, and should include:

  • the name of the rights being applied
  • a uri that identifies an information source, usually a url to a web page,
  • any additional comments or description of the rights regime.


References