Difference between revisions of "Schema Description"

From ECRIN-MDR Wiki
Jump to navigation Jump to search
(Object Dates)
(Study Attributes)
 
(30 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Summary tables ==
+
<p style="color:blue; text-align:right"><small>'''''Last updated: 20/09/2022'''''</small></p>
 +
The following provides a more detailed description of each of the data points in the two schemas, including the components of composite data points. It does not, however, provide a full description of a ''practical implementation'' of the schemas, e.g. when storing schema data in a database or within JSON files. An implementation level description, which requires record ids and audit fields, is given by the two JSON file definitions, and a discussion of the implementation of the schema in the MDR database is provided in the Data Extraction section of the wiki.<br/>
 
<br/>
 
<br/>
'''The Study schema'''
 
{| class="wikitable" style="width: 85%;"
 
|-
 
! style="width: 33%;" | Mandatory !! style="width: 33%;" | Recommended !! style="width: 33%;" | Optional
 
 
|- style="vertical-align:top; background-color:lightblue"
 
| style="color:darkblue;font-weight:bold" colspan="3"| A. The Source Study
 
|- style="vertical-align:top; background-color:white"
 
 
| '''A.1 Display Title'''<span style="color:red;font-size:150%;font-weight:bold"> </span><br/>''{display title, language code<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> }''<br/><br/><br/><br/><br/><br/>'''A.8 Study Type'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>'''A.9 Study Status'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span> '''<br/><br/><br/><br/><br/>A.14 Linked Data Objects'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/> ''{object identifiers}''<br/> '''<br/>A.15 Provenance String'''<br/> 
 
 
|| '''A.2 Study Identifiers''' <span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{identifier type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , Identifier value, source organisation, date, url link}''<br/><br/>
 
'''A4. Brief Description'''<br/>
 
''{description text, ?contains html}''<br/><br/>
 
'''A.6 Study Features'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{feature type <span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , feature value <span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span>}''<br/><br/>
 
'''A.7 Study Topics'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{topic type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , mesh coded?, topic code, topic value, topic qualcode, topic qualvalue, original value}''<br/><br/>
 
 
|| '''A3. Study Titles'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{title text, title type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , language code<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , comments}''
 
<br/>
 
'''A5. Data Sharing Statement'''<br/>
 
''{statement text, ?contains html}''<br/><br/>
 
'''A10. Study Enrolment Number'''
 
<br/><br/>
 
'''A11. Study Gender Eligibility'''
 
<br/><br/>
 
'''A12. Min and Max Ages'''<br/>
 
''{age, age units<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span>}''
 
<br/><br/>
 
'''A13. Inter-study relationships'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>
 
''{relationship type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , target study}''
 
|}
 
<span style="color:red;font-size:150%;font-weight:bold"> *</span> May be repeated  <span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span> Categorised value
 
<br/><br/>
 
<br/>
 
'''The Data Object schema'''
 
 
{| class="wikitable" style="width: 85%;"
 
|-
 
! style="width: 33%;" | Mandatory !! style="width: 33%;" | Recommended !! style="width: 33%;" | Optional
 
 
|- style="vertical-align:top; background-color:lightblue"
 
| style="color:darkblue;font-weight:bold" colspan="3"| B. Data Object Identifiers 
 
|- style="vertical-align:top; background-color:white"
 
| '''B.1 DOI'''<span style="font-size:150%"> </span><br/><br/><br/>'''B.2 Display Title'''<br/><br/>'''B.6 Linked Studies'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{study identifiers}''<br/><br/>'''B.7 Provenance String'''<br/><br/>
 
|| '''B.3 Version'''<span style="font-size:150%"> </span>
 
|| '''B.4 Object Identifiers'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{Identifier type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , Identifier value, source organisation, application date}''<br/> <br/>
 
'''B.5 Object Titles'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{title text, title type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , language code<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , comments }''
 
 
 
|- style="vertical-align:top; background-color:lightblue"
 
| style="color:darkblue;font-weight:bold" colspan="3"| C. Creators and Contributors 
 
|- style="vertical-align:top; background-color:white"
 
| '''C.1 Creators'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{name type, person details OR organisation }''<br/> <br/>
 
|| <span style="font-size:150%"> </span> <br/><br/><br/>''person details = given name, family name, full name, ORCID identifier, affiliation <br/>organisation = organisation default name and, if the organisation exists in the context database, the associated  integer id''
 
||  '''C.2 Contributors'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{contribution type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , name type, person details OR organisation }''<br/><br/>For most data objects contributors will be the contributors to the associated study or studies.<br/>
 
For journal articles contributors will be authors, plus organisational study contributors
 
 
|- style="vertical-align:top; background-color:lightblue"
 
| style="color:darkblue;font-weight:bold" colspan="3"| D. Object Dates 
 
|- style="vertical-align:top; background-color:white"
 
| '''D.1 Publication Year'''<span style="font-size:150%"> </span>  ||  ||  '''D.2 Dates'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{date type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , Is range, date as string, start year, start month, start day, end year, end month, end day, comments}''
 
 
|- style="vertical-align:top; background-color:lightblue"
 
| style="color:darkblue;font-weight:bold" colspan="3"| E. Object Attributes and Descriptors 
 
|- style="vertical-align:top; background-color:white"
 
|'''E.1 Class'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>'''E.2 Type'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/>
 
|| '''E.3 Record key type''' (datasets only)<span style="font-size:150%"> </span> <br/>''{type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , text description}''<br/><br/>'''E.4 De-identification level''' (datasets only)<br/>''{type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , specific actions, text description}''<br/><br/>
 
'''E.5 Associated consent''' (datasets only)<br/>''{type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , specific restrictions, text description}''<br/><br/>'''E.6 Description'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{description type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , label, description text, language code<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , contains html?}''<br/><br/>  '''E.7 EOSC Category'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>
 
'''E.8 Language'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><span style="color:red;font-size:150%;font-weight:bold"> *</span><br/><br/>'''E.9 Inter-object relationships'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{ relationship type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , target object}''<br/><br/>
 
|| '''E.10 Topics''' (of data object)<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{topic type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , mesh coded?, topic code, topic value, topic qualcode, topic qualvalue, original value}''<br/><br/>
 
For most data objects topics should be the study topics.<br/>Journal articles will normally have their own listed topics
 
 
|- style="vertical-align:top; background-color:lightblue"
 
| style="color:darkblue;font-weight:bold" colspan="3"| F. Object Location and Access Details 
 
|- style="vertical-align:top; background-color:white"
 
| '''F.1 Managing Organisation'''<span style="font-size:150%"> </span> <br/><br/>
 
'''F.2 Access Type'''<span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span><br/><br/>
 
'''F.3 Access Details'''<br/>''{description, url of details, date url last checked}''<br/><br/>
 
'''F.4 Physical Resources'''<span style="color:red;font-size:150%;font-weight:bold"> *</span><br/>''{repository organisation, resource url , url accessible?, date url last checked, resource type<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span> , resource size, size units<span style="color:blue;font-style:normal;font-weight:bold"> &#10625;</span>, comments }''<br/>
 
||<br/><br/><br/><br/><br/>(F3 is mandatory if access is non-public)<br/><br/>  ||  '''F.5 Rights'''<br/>''{name, rights uri, comments}''
 
|}
 
 
<span style="color:red;font-size:150%;font-weight:bold"> *</span> May be repeated  <span style="color:blue;font-size:150%;font-weight:bold"> &#10625;</span> Categorised value
 
<br/><br/>
 
 
 
== Study Attributes ==
 
== Study Attributes ==
 
Strictly speaking these data points are not metadata because they do not describe data – instead they summarise some key attributes of the study, especially those that promote its discoverability.   
 
Strictly speaking these data points are not metadata because they do not describe data – instead they summarise some key attributes of the study, especially those that promote its discoverability.   
Line 98: Line 14:
 
* the identifier value,
 
* the identifier value,
 
* the identifier type (categorised, as selected from a predetermined list of code-text pairs),
 
* the identifier type (categorised, as selected from a predetermined list of code-text pairs),
* the assigning organisation (name and where available an Id within a suitable system).
+
* the assigning organisation (name and where available the organisation's Id(s) - e.g. ROR Id, ECRIN MDR Id).
 
* (optionally), the date the identifier was assigned  
 
* (optionally), the date the identifier was assigned  
 
* (optionally), any associated URL (for instance some public funder Ids in the US will link to a summary page about the grant and its use).
 
* (optionally), any associated URL (for instance some public funder Ids in the US will link to a summary page about the grant and its use).
Line 113: Line 29:
 
'''A.4 Brief Description (0…1)'''<br/>
 
'''A.4 Brief Description (0…1)'''<br/>
 
Most study registry systems require a brief, non-specialist description of the study – which usually range from a few lines to a paragraph or two. This can be useful in assessing the relevance of studies to a particular search task and so is included in the study data points.
 
Most study registry systems require a brief, non-specialist description of the study – which usually range from a few lines to a paragraph or two. This can be useful in assessing the relevance of studies to a particular search task and so is included in the study data points.
<br/>
 
There should also be an indication of whether the description contains embedded html, so that display systems can interpret any tags correctly, rather than display them as 'raw' text with visible angle brackets.
 
 
<br/>
 
<br/>
 
<br/>
 
<br/>
 
'''A.5 Data Sharing Statement (0..1)'''<br/>
 
'''A.5 Data Sharing Statement (0..1)'''<br/>
 
In recent years several trial registries have requested study sponsors and / or leads to indicate if they will make individual participant data and related documents available for sharing, and if so how and when the data would be available. As such a statement is central to the purpose of the MDR it is captured within the study data, so that if present it can be displayed.
 
In recent years several trial registries have requested study sponsors and / or leads to indicate if they will make individual participant data and related documents available for sharing, and if so how and when the data would be available. As such a statement is central to the purpose of the MDR it is captured within the study data, so that if present it can be displayed.
<br/>
 
Again this data point should also include an indication of whether the data sharing statement contains embedded html, so that the tags can be interpreted correctly.
 
 
<br/>
 
<br/>
 
<br/>
 
<br/>
Line 136: Line 48:
 
'''A.7 Study Topics (0…n)'''<br/>
 
'''A.7 Study Topics (0…n)'''<br/>
 
None, one or more topic names or phrases, keywords, or classification codes describing the study or aspects of it.  Topics is preferred to ‘Subjects’ because within clinical research ‘Study subjects’ is normally understood as referring to the study participants.
 
None, one or more topic names or phrases, keywords, or classification codes describing the study or aspects of it.  Topics is preferred to ‘Subjects’ because within clinical research ‘Study subjects’ is normally understood as referring to the study participants.
<br/>
 
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
 
 
<br/>
 
<br/>
 
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
 
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
Line 145: Line 55:
 
The study topic record is composite and has the following structure:
 
The study topic record is composite and has the following structure:
 
* topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
 
* topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
* A boolean indicating whether or not the term has been MESH coded,  
+
* A boolean indicating whether or not the term has been MESH coded, either in the original data or later as part of the MDR extraction process,
* the MESH code, if present
+
* the MESH code and MESH term, if MESH coding was done, either in the original data or later as part of the MDR extraction process,
* the topic name or value - either the original or if MESH coded the preferred MESH term  
+
* a code representing the controlled terminology (CT) originally used (if there was one) and the code within that CT scheme. As explained above this is most often MESH (code 14), and the mesh code but in some cases it may be MedDRA or ICD10, or some other scheme, and the corresponding code in that scheme,
* a MESH qualifier code and qualifier value where one exists (applies only to PubMede articles at the moment),  
+
* the original value, i.e. the topic as originally expressed, in whatever, or no, coding scheme.
* the original value.
+
Thus:<br/>
 +
A topic that was MESH coded in the original source material will be structured as<br/>
 +
{topic-type, TRUE, MESH code, MESH term, MESH-CT Code, MESH code, original value (= MESH term)}<br/>
 +
A topic that was ''not'' MESH coded in the original source material, but was able to be coded as part of the extraction process, will be structured as<br/>
 +
{topic-type, TRUE, MESH code, MESH term, (null), (null), original value}, OR, if originally coded using a CT 'X', as {topic-type, TRUE, MESH code, MESH term, X-CT code, code in X, original value},<br/>
 +
A topic that was ''not'' MESH coded in the original source material, and was ''not'' able to be coded as part of the extraction process, will be structured as<br/>
 +
{topic-type, FALSE, (null), (null), (null), (null), original value}, OR, if originally coded using a CT 'X', as {topic-type, FALSE, (null), (null), X-CT code, code in X, original value}.<br/>
 
<br/>
 
<br/>
 
'''A.8 Study Type (1)'''<br/>
 
'''A.8 Study Type (1)'''<br/>
Line 160: Line 76:
 
<br/>
 
<br/>
 
'''A.10 Study Enrolment Number (0..1)'''<br/>
 
'''A.10 Study Enrolment Number (0..1)'''<br/>
This is an integer representing the anticipated or actual number of study participants.
+
This is a string representing the anticipated or actual number of study participants. Usually a simple number but may be a short sentence providing enrolment details (e.g. for different sub-protocols).
 
<br/>
 
<br/>
 
<br/>
 
<br/>
Line 180: Line 96:
 
<br/>
 
<br/>
 
'''A.15 Provenance String (1)'''<br/>
 
'''A.15 Provenance String (1)'''<br/>
A string indicating the source or sources of the data (usually a trial registry) and the date-times on which the data was last downloaded from the source or sources.
+
A string indicating the source or sources of the data (usually a trial registry) and the date-times on which the data was last downloaded from the source or sources.<br/>
 +
<br/>
 +
'''A.16 Study Countries (0..n)'''<br/>
 +
The geonames numerical Id and name of the country or countries where recruitment of study participants took place<br/>
 +
<br/>
 +
'''A.17 Study Sites (0..n)'''<br/>
 +
The clinical sites where the study took place, when that information is available in the source material. The data includes the facility's name (usually a hospital) as a compound organisational record, the  the geonames id and the name of the city, the geonames id and the name of the country, and the status (recruiting, stopped recruiting etc.) of the site, as of the last data harvest.
 +
<br/><br/>
 +
'''A.18 Study Start Time (0..1)'''<br/>
 +
The year and month that the study began (usually defined as 'first patient first visit'), presented as two integers.
 
<br/><br/>
 
<br/><br/>
 +
'''A.19 Study contributors (0...n)'''<br/>
 +
The main organisations and personnel involved in designing, running, funding and sponsoring a study. It is usually a set of institutional and / or personal names. Each contributor description, which is composite, needs to indicate whether or not it refers to an individual or an organisation (or collaboration). If it is a person then fields are available for names, ORCID and affiliation details. If not the organisation name needs to be provided, with an ECRIN and / or ROR Id if one is available. The composite structure (which follows closely that in DataCite for object contributors) is therefore
 +
* contributor type (categorised, as selected from a predetermined list of code-text pairs).
 +
* whether an individual or not,
 +
&nbsp;&nbsp;and if they are…
 +
* given name,
 +
* family name,
 +
* full name,
 +
* ORCID id if available
 +
* affiliation string, (as provided in source data)
 +
* affiliation organisation, if it can be deduced from the affiliation string (name and Id(s) - e.g. ROR Id, ECRIN MDR id).
 +
&nbsp;&nbsp;but if they are not….
 +
* The organisation (name and Id(s) - e.g. ROR Id, ECRIN MDR id).
 +
The contributor types available include those defined within DataCite, but the list has been extended in the context of clinical research, to include (for example) trial sponsor, trial funder, device provider, central laboratory, public contact, study lead, (site) principal Investigator.
 +
<br/>
 +
<br/>
  
 
== Data Object identifiers ==
 
== Data Object identifiers ==
Line 202: Line 143:
 
* the identifier value,
 
* the identifier value,
 
* the identifier type (categorised, as selected from a predetermined list of code-text pairs),
 
* the identifier type (categorised, as selected from a predetermined list of code-text pairs),
* the assigning organisation (name and where available an Id within a suitable system).
+
* the assigning organisation (name and where available Id(s) - e.g. ROR Id, ECRIN MDR Id).
 
* (optionally), the date the identifier was assigned
 
* (optionally), the date the identifier was assigned
 
<br/>
 
<br/>
Line 222: Line 163:
 
== Creators and Contributors ==
 
== Creators and Contributors ==
 
'''C.1 Creators (1...n)'''<br/>
 
'''C.1 Creators (1...n)'''<br/>
The main personnel involved in producing the data, or the authors of a publication. It may be a set of institutional and / or personal names. Each creator description, which is composite, therefore needs to indicate whether or not it refers to an individual or an organisation (or collaboration). If it is a person then fields are available for names, ORCID and affiliation details. If not the organisation name needs to be provided, with an Id if one is available. The composite structure (which follows closely that in DataCite) is therefore
+
The main personnel involved in producing the data, or, much more commonly, the authors of a publication. It may include institutional or collaboration names. Each creator description, which is composite, therefore needs to indicate whether or not it refers to an individual or an organisation. If it is a person then fields are available for names, ORCID and affiliation details. If not the organisation name needs to be provided, with an Id if one is available. The composite structure (which follows closely that in DataCite) is therefore
 
* whether an individual or not,  
 
* whether an individual or not,  
 
&nbsp;&nbsp;and if they are…
 
&nbsp;&nbsp;and if they are…
Line 229: Line 170:
 
* full name,  
 
* full name,  
 
* ORCID id if available
 
* ORCID id if available
* affiliation, (string description of department / organisation)
+
* affiliation string, (as provided in source data)
 +
* affiliation organisation, if it can be deduced from the affiliation string (name and Id(s) - e.g. ROR Id, ECRIN MDR id).
 
&nbsp;&nbsp;but if they are not….
 
&nbsp;&nbsp;but if they are not….
* The organisation (name and where available an Id within a suitable system).
+
* The organisation (name and Id(s) - e.g. ROR Id, ECRIN MDR id).
Most data objects where the metadata is harvested retrospectively, other than journal articles, are unlikely to have creators explicitly identified.
+
Most data objects where the metadata is harvested retrospectively, other than journal articles, are unlikely to have creators explicitly identified. The contributors to the associated study, with a role of 'study lead' or 'sponsor', could be viewed as the creators of non-article objects but are no longer exported from the system in that role.
 
<br/>
 
<br/>
 
<br/>
 
<br/>
Line 240: Line 182:
 
The types available include those defined within DataCite, but the list has been extended in the context of clinical research, to include (for example) trial sponsor, trial funder, device provider, central laboratory, public contact, study lead, (site) principal Investigator.  
 
The types available include those defined within DataCite, but the list has been extended in the context of clinical research, to include (for example) trial sponsor, trial funder, device provider, central laboratory, public contact, study lead, (site) principal Investigator.  
 
<br/>
 
<br/>
In general the contributor lists for data objects should be derived from the lists stored for their parent study, even though in the system contributors are only presented for data objects, not studies. These will usually include the study lead(s) and sponsors. Where data objects do record their contributors, they would normally take precedence over that of the study, but organisational contributors, in particular study sponsors and funders, should still be added to the data object list.<br/> 
+
<br/>
Any system retrieving creator / contributor data therefore needs to collect that data for the studies as well as for the data objects themselves. Creator data is generally collected and stored in exactly the same way as contributor data – the contribution type is simply set as ‘creator’.
 
<br/><br/>
 
  
 
==Object Dates ==
 
==Object Dates ==
Line 260: Line 200:
 
* end day, for date ranges only, may not be present if date range partial
 
* end day, for date ranges only, may not be present if date range partial
 
* comments – any relevant / explanatory comments
 
* comments – any relevant / explanatory comments
<br/><br/>
+
<br/>
  
 
== Data Object Attributes ==
 
== Data Object Attributes ==
 
Section E is mainly based on the DataCite metadata specification, though a few extensions (E3 – E5) have been added for datasets (as opposed to document based data objects).
 
Section E is mainly based on the DataCite metadata specification, though a few extensions (E3 – E5) have been added for datasets (as opposed to document based data objects).
<br/>
+
<br/><br/>
'''E.1 Class (1)'''
+
'''E.1 Class (1)'''<br/>
 
A categorised value, one of the existing DataCite controlled list for ‘Resource Type General’. In most cases, for clinical research data objects, the class will usually be one of:
 
A categorised value, one of the existing DataCite controlled list for ‘Resource Type General’. In most cases, for clinical research data objects, the class will usually be one of:
 
* Text  
 
* Text  
 
* Dataset
 
* Dataset
 
though other options include: Data Paper, Software, Service, Audiovisual, and Interactive Resource.
 
though other options include: Data Paper, Software, Service, Audiovisual, and Interactive Resource.
<br/>
+
<br/><br/>
'''E.2 Type (1)'''
+
'''E.2 Type (1)'''<br/>
 
A categorised description of the type of data object, at a more specific level than Class. The type and class should form a pair (as with DataCite), e.g. Dataset/census data or Text/conference abstract.  
 
A categorised description of the type of data object, at a more specific level than Class. The type and class should form a pair (as with DataCite), e.g. Dataset/census data or Text/conference abstract.  
 
<br/>
 
<br/>
 
Unlike DataCite, both class and type are mandatory in the ECRIN schema. The types available include the CASRAI classifications of document objects, recommended by DataCite, together with additions to the list that represent object types of particular importance to clinical research (e.g. protocols, clinical study reports, statistical analysis plans, and datasets of various kinds).
 
Unlike DataCite, both class and type are mandatory in the ECRIN schema. The types available include the CASRAI classifications of document objects, recommended by DataCite, together with additions to the list that represent object types of particular importance to clinical research (e.g. protocols, clinical study reports, statistical analysis plans, and datasets of various kinds).
<br/>
+
<br/><br/>
'''E.3 Record key type (1, Datasets only)'''
+
'''E.3 Record key type (1, Datasets only)'''<br/>
 
This is a composite item that indicates the type of record keys used within the dataset, which indicates in particular if it is pseudonymised or anonymised. The contents are
 
This is a composite item that indicates the type of record keys used within the dataset, which indicates in particular if it is pseudonymised or anonymised. The contents are
 
* Record key type (categorised, as selected from a predetermined list)
 
* Record key type (categorised, as selected from a predetermined list)
 
* Details – text description to elaborate / clarify details
 
* Details – text description to elaborate / clarify details
 
Note that the categorisation into 'pseudonymised'. 'anonymised', 'identifiable' etc. is based upon the description provided by the data controller / manager in the data (if one is supplied). The classification is therefore based on the data controller's understanding of the relevant terms. No attempt is made to apply a categorisation using standard criteria, as the meaning of the words used ('pseudonymised'. 'anonymised', etc.) may vary between different legal jurisdictions, over time, and in different usage contexts. The categorisation should therefore be read as only a very approximate guide to any legal requirements associated with the data.
 
Note that the categorisation into 'pseudonymised'. 'anonymised', 'identifiable' etc. is based upon the description provided by the data controller / manager in the data (if one is supplied). The classification is therefore based on the data controller's understanding of the relevant terms. No attempt is made to apply a categorisation using standard criteria, as the meaning of the words used ('pseudonymised'. 'anonymised', etc.) may vary between different legal jurisdictions, over time, and in different usage contexts. The categorisation should therefore be read as only a very approximate guide to any legal requirements associated with the data.
 
+
<br/><br/>
'''E.4 De-identification level (1, Datasets only)'''
+
'''E.4 De-identification level (1, Datasets only)'''<br/>
 
An item that indicates the amount of de-identification that has been applied to the dataset. The item consists of :
 
An item that indicates the amount of de-identification that has been applied to the dataset. The item consists of :
 
* De-identification level (categorised, as selected from a predetermined list)
 
* De-identification level (categorised, as selected from a predetermined list)
 
* Additional actions carried out - boolean data indicating if any of the following applies: a) direct identifiers have been removed, b) US HIPAA rules for de-identification have been applied, c) dates have been rebased or replaced with integers, d) narrative text fields have been removed , and e) k-anonymisation has been carried out.
 
* Additional actions carried out - boolean data indicating if any of the following applies: a) direct identifiers have been removed, b) US HIPAA rules for de-identification have been applied, c) dates have been rebased or replaced with integers, d) narrative text fields have been removed , and e) k-anonymisation has been carried out.
 
* Details – text description to elaborate / clarify details.
 
* Details – text description to elaborate / clarify details.
 
+
<br/>
'''E.5 Associated consent (1, Datasets only)'''
+
'''E.5 Associated consent (1, Datasets only)'''<br/>
 
The consent in question is for secondary use of the data - consent for primary use is assumed.<br/>
 
The consent in question is for secondary use of the data - consent for primary use is assumed.<br/>
 
The data item consists of:
 
The data item consists of:
Line 294: Line 234:
 
* Possible additional restrictions, represented as a series of boolean data points: a) if use is limited to non-commercial research, b) if there any geographical restrictions on re-use, c) if only certain types of research are permitted, d) if only genetic research is allowed, and f) whether or not methodological or tool research (e.g. developing machine learning algorithms) is allowed.,
 
* Possible additional restrictions, represented as a series of boolean data points: a) if use is limited to non-commercial research, b) if there any geographical restrictions on re-use, c) if only certain types of research are permitted, d) if only genetic research is allowed, and f) whether or not methodological or tool research (e.g. developing machine learning algorithms) is allowed.,
 
* Details – text description to elaborate / clarify details, in particular to expand upon any of the additional restrictions listed as being present.
 
* Details – text description to elaborate / clarify details, in particular to expand upon any of the additional restrictions listed as being present.
 
+
<br/>
 
'''E.6 Description (0..n)'''
 
'''E.6 Description (0..n)'''
 
None, one or more pieces of additional general information about the data object, so far as that is publicly available (journal abstracts, although an obvious ‘descriptor’, remain the property of the publisher and cannot in general be reproduced within the system). The item is composite, consisting of:
 
None, one or more pieces of additional general information about the data object, so far as that is publicly available (journal abstracts, although an obvious ‘descriptor’, remain the property of the publisher and cannot in general be reproduced within the system). The item is composite, consisting of:
Line 301: Line 241:
 
* description text, the description itself
 
* description text, the description itself
 
* language code, the 2 character ISO code
 
* language code, the 2 character ISO code
* a boolean indicating whether or not the description contains html, useful to know for display purposes
+
<br/>
 
+
'''E.7 EOSC Category (0..1)'''<br/>
'''E.7 EOSC Category (0..1)'''
 
 
An integer (0, 1, 2 or 3) that conforms to an EOSC categorisation recommended for data objects. The classification is  
 
An integer (0, 1, 2 or 3) that conforms to an EOSC categorisation recommended for data objects. The classification is  
 
* 0 = Non-personal data. Contains no information that refers to any identified or identifiable living individual.  
 
* 0 = Non-personal data. Contains no information that refers to any identified or identifiable living individual.  
Line 310: Line 249:
 
* 3 = Sensitive pseudonymised data.
 
* 3 = Sensitive pseudonymised data.
 
In general, almost all documents expected in the MDR will be categorised as 0, whilst all IPD datasets will be categorised as 3 - unless there is general agreement that they are fully anonymised, in which case they become 1.
 
In general, almost all documents expected in the MDR will be categorised as 0, whilst all IPD datasets will be categorised as 3 - unless there is general agreement that they are fully anonymised, in which case they become 1.
 
+
<br/><br/>
'''E.8 Language (1..n)'''
+
'''E.8 Language (1..n)'''<br/>
 
The language or languages of the data object itself (not of a description of the object), using the ISO language codes (e.g. en, de, fr). DataCite assumes a single language but some clinical research data objects (e.g. journal articles) are created in two or more languages. The record may therefore be multiple.
 
The language or languages of the data object itself (not of a description of the object), using the ISO language codes (e.g. en, de, fr). DataCite assumes a single language but some clinical research data objects (e.g. journal articles) are created in two or more languages. The record may therefore be multiple.
 
<br/>
 
<br/>
 
+
<br/>
'''E.9 Inter-object relationships (0..n)'''
+
'''E.9 Inter-object relationships (0..n)'''<br/>
 
Data objects can be related to each other – for example one object can be a supplement to another, or a new version of an other, or be derived from, or the source of, one or more other data objects.  
 
Data objects can be related to each other – for example one object can be a supplement to another, or a new version of an other, or be derived from, or the source of, one or more other data objects.  
 
A particularly important relationship for clinical study data is the pairing of ‘Has Metadata’ and  ‘Is Metadata for’. Metadata in clinical research can include, for example, a data dictionary that provides the metadata for a dataset. Note that the metadata in this context is itself a file, and a data object in its own right.  Each record is composite and must include:
 
A particularly important relationship for clinical study data is the pairing of ‘Has Metadata’ and  ‘Is Metadata for’. Metadata in clinical research can include, for example, a data dictionary that provides the metadata for a dataset. Note that the metadata in this context is itself a file, and a data object in its own right.  Each record is composite and must include:
Line 322: Line 261:
 
Because few data objects have DOIs, it is usually a requirement that both subject and target objects are stored within the same system. This allows the identifier to be an internal identifier within that system, making navigation to it much simpler.  
 
Because few data objects have DOIs, it is usually a requirement that both subject and target objects are stored within the same system. This allows the identifier to be an internal identifier within that system, making navigation to it much simpler.  
 
<br/>
 
<br/>
 
+
<br/>
'''E.10 Topic (0...n)'''
+
'''E.10 Topic (0...n)'''<br/>
 
None, one or more topic names or phrases, keywords, or classification codes describing the object or aspects of it.  In the context of clinical research, most data objects will not have listed topics associated with them (the exception is journal articles, which almost always do have linked keywords). Data Object topics for non journal articles should be those of the parent study or studies. This introduces a substantial amount of redundancy but it means that topics can be searched and used for filtering across all (rather than just some) data objects.<br/>
 
None, one or more topic names or phrases, keywords, or classification codes describing the object or aspects of it.  In the context of clinical research, most data objects will not have listed topics associated with them (the exception is journal articles, which almost always do have linked keywords). Data Object topics for non journal articles should be those of the parent study or studies. This introduces a substantial amount of redundancy but it means that topics can be searched and used for filtering across all (rather than just some) data objects.<br/>
 
The structure of each topic item is exactly the same as for study topics:  
 
The structure of each topic item is exactly the same as for study topics:  
 +
<br/><br/>
 +
The study topic record is composite and has the following structure:
 
* topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
 
* topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
* A boolean indicating whether or not the term has been MESH coded,  
+
* A boolean indicating whether or not the term has been MESH coded, either in the original data or later as part of the MDR extraction process,
* the MESH code, if present
+
* the MESH code and MESH term, if MESH coding was done, either in the original data or later as part of the MDR extraction process,
* the topic name or value - either the original or if MESH coded the preferred MESH term  
+
* a code representing the controlled terminology (CT) originally used (if there was one) and the code within that CT scheme. As explained above this is most often MESH (code 14), and the mesh code but in some cases it may be MedDRA or ICD10, or some other scheme, and the corresponding code in that scheme,
* a MESH qualifier code and qualifier value where one exists (applies only to PubMede articles at the moment),  
+
* the original value, i.e. the topic as originally expressed, in whatever, or no, coding scheme.
* the original value.
+
Thus:<br/>
 +
A topic that was MESH coded in the original source material will be structured as<br/>
 +
{topic-type, TRUE, MESH code, MESH term, MESH-CT Code, MESH code, original value (= MESH term)}<br/>
 +
A topic that was ''not'' MESH coded in the original source material, but was able to be coded as part of the extraction process, will be structured as<br/>
 +
{topic-type, TRUE, MESH code, MESH term, (null), (null), original value}, OR, if originally coded using a CT 'X', as {topic-type, TRUE, MESH code, MESH term, X-CT code, code in X, original value},<br/>
 +
A topic that was ''not'' MESH coded in the original source material, and was ''not'' able to be coded as part of the extraction process, will be structured as<br/>
 +
{topic-type, FALSE, (null), (null), (null), (null), original value}, OR, if originally coded using a CT 'X', as {topic-type, FALSE, (null), (null), X-CT code, code in X, original value}.<br/>
 
<br/>
 
<br/>
  
 
== Location and Access details ==
 
== Location and Access details ==
 
An area where the existing DataCite schema needs to be extended is in providing a full description of the access arrangements for any data object. The following data points are proposed.
 
An area where the existing DataCite schema needs to be extended is in providing a full description of the access arrangements for any data object. The following data points are proposed.
 +
<br/><br/>
 +
'''F.1 Managing Organisation (1)'''<br/>
 +
In this schema, this is the organisation that manages access to the document or data object, including making the overall decision about access type (see F.2). For data this would usually be the name of the organisation that was the data controller. For journal papers it would be the name of the company that publishes the journal, and which would normally run the primary web site on which it can be accessed. In both cases the name would be associated with an id or ids in a suitable system (e.g. a ROR Id, ECRIN MDR Id).<br/>
 
<br/>
 
<br/>
'''F.1 Managing Organisation (1)'''
+
'''F.2 Access Type (1)'''<br/>
In this schema, this is the organisation that manages access to the document or data object, including making the overall decision about access type (see F.2). For data this would usually be the name of the organisation that was the data controller. For journal papers it would be the name of the company that publishes the journal, and which would normally run the primary web site on which it can be accessed. In both cases the name would be associated with an id in a suitable system.<br/>
+
A categorised value (code-text pair) that represents in broad terms the type of access under which the object is available, for example by publicly available download, or restricted download (restricted to members of a specific group) or on screen access after review on a case by case basis.<br/>
'''F.2 Access Type (1)'''
 
A categorised value (code-text pair) that represents in broad terms the type of access under which the object is available, for example by publicly available download, or restricted download (restricted to members of a specific group) or on screen access after review on a case by case basis.
 
 
<br/>
 
<br/>
'''F.3 Access Details (Mandatory for any of the non-public access types)'''
+
'''F.3 Access Details (Mandatory for any of the non-public access types)'''<br/>
 
This is a composite element with three elements:
 
This is a composite element with three elements:
 
* A textual summary of the access being offered, for example identifying the groups to which access is granted, the criteria on which a case-by-case decision would be based, any further restrictions on on-screen access, etc. It may reference web based resources, on the object manager’s web site or elsewhere (see below).
 
* A textual summary of the access being offered, for example identifying the groups to which access is granted, the criteria on which a case-by-case decision would be based, any further restrictions on on-screen access, etc. It may reference web based resources, on the object manager’s web site or elsewhere (see below).
 
* A link to a resource that explains how access may be gained, e.g. how a group can be joined, and / or how application can be made for access on an individual basis. This would normally be a link to a web page on the managing organisation’s site, that would explain access procedures or provide an application proforma.  
 
* A link to a resource that explains how access may be gained, e.g. how a group can be joined, and / or how application can be made for access on an individual basis. This would normally be a link to a web page on the managing organisation’s site, that would explain access procedures or provide an application proforma.  
 
* A date, if one is available, representing the last time the URL was checked to be in existence (i.e. returned a 200 ‘success’ code rather than a 404).
 
* A date, if one is available, representing the last time the URL was checked to be in existence (i.e. returned a 200 ‘success’ code rather than a 404).
'''F.4 Resources (Mandatory unless case-by-case access)'''
+
<br/>
 +
'''F.4 Resources (Mandatory unless case-by-case access)'''<br/>
 
The web based resources that represent this data object. Mandatory for public objects, when at least one resource should be listed. For data objects simply listed as existing, but under managed access, this information may not be available for harvesting.  Each record is composite and includes
 
The web based resources that represent this data object. Mandatory for public objects, when at least one resource should be listed. For data objects simply listed as existing, but under managed access, this information may not be available for harvesting.  Each record is composite and includes
 
* the name of the organisation holding the resource (e.g. a data repository, bibliographic system, trial registry)
 
* the name of the organisation holding the resource (e.g. a data repository, bibliographic system, trial registry)
Line 359: Line 308:
 
In addition...
 
In addition...
 
* resource comments, provides a free text field to hold further details of the resource, in particular to support machine processing. These could include the schema used for XML files, and / or the character coding used for text files (e.g. UTF-8 versus UTF-16) or the presence and types of any byte order marks.
 
* resource comments, provides a free text field to hold further details of the resource, in particular to support machine processing. These could include the schema used for XML files, and / or the character coding used for text files (e.g. UTF-8 versus UTF-16) or the presence and types of any byte order marks.
 
+
<br/>
 
'''F.5 Rights (0..n)'''
 
'''F.5 Rights (0..n)'''
 
Any intellectual property rights information for the data object, as a textual statement of the rights management associated with the resource.  The item is composite, and should include:
 
Any intellectual property rights information for the data object, as a textual statement of the rights management associated with the resource.  The item is composite, and should include:
Line 365: Line 314:
 
* a uri that identifies an information source, usually a url to a web page,  
 
* a uri that identifies an information source, usually a url to a web page,  
 
* any additional comments or description of the rights regime.
 
* any additional comments or description of the rights regime.
<br/>
+
<br/><br/>
 
 
== References ==
 

Latest revision as of 10:33, 11 November 2022

Last updated: 20/09/2022

The following provides a more detailed description of each of the data points in the two schemas, including the components of composite data points. It does not, however, provide a full description of a practical implementation of the schemas, e.g. when storing schema data in a database or within JSON files. An implementation level description, which requires record ids and audit fields, is given by the two JSON file definitions, and a discussion of the implementation of the schema in the MDR database is provided in the Data Extraction section of the wiki.

Study Attributes

Strictly speaking these data points are not metadata because they do not describe data – instead they summarise some key attributes of the study, especially those that promote its discoverability.

A.1 Display Title (1)
This is by default, the shorter or 'public' title. If there is no such title the full scientific or protocol title needs to be used. Whatever title is used it should also appear within the list of study titles (see A.3), where a fuller set of title attributes can be provided.

A.2 Study Identifiers (0...n)
None, one or more unique identifiers that have been assigned to the study. For studies entered into trial registries these should include, as a minimum, the registry ID(s), but any IDs that have been externally applied, and that might be useful in identifying the study, can be included, for instance funders' and / or sponsors' ids.
These IDs are composite. If provided, they must include

  • the identifier value,
  • the identifier type (categorised, as selected from a predetermined list of code-text pairs),
  • the assigning organisation (name and where available the organisation's Id(s) - e.g. ROR Id, ECRIN MDR Id).
  • (optionally), the date the identifier was assigned
  • (optionally), any associated URL (for instance some public funder Ids in the US will link to a summary page about the grant and its use).


A.3 Study Titles (0..n)
Studies usually have a short or ‘public’ title as well as a full scientific one (as used on the protocol document), and can also be described by an acronym. They may have titles in more than one language.
All titles should be included in this list. The type is composite, and should include:

  • the title text,
  • the title type (categorised, as selected from a predetermined list of code-text pairs),
  • the language of the title, as a 2 character ISO code,
  • (optionally), any additional comments about their genesis (e.g. "authors' translation"),


A.4 Brief Description (0…1)
Most study registry systems require a brief, non-specialist description of the study – which usually range from a few lines to a paragraph or two. This can be useful in assessing the relevance of studies to a particular search task and so is included in the study data points.

A.5 Data Sharing Statement (0..1)
In recent years several trial registries have requested study sponsors and / or leads to indicate if they will make individual participant data and related documents available for sharing, and if so how and when the data would be available. As such a statement is central to the purpose of the MDR it is captured within the study data, so that if present it can be displayed.

A.6 Study Features(0…n)
None, one or more design features of the study.
The design features available will depend on whether the study is interventional or observational. Available types for interventional studies include Phase, Primary Purpose, Allocation method, Intervention Design and Masking. For observational studies the types include Observational Model, Time Perspective, and whether or not specimens are retained.
In each case the possible values are categorised, and so restricted to a pre-defined set of values. This makes the feature types useful candidates for filtering of study records within a web portal and / or API.
The composite study feature record is therefore

  • feature type (categorised code-text, as selected from a predetermined list.
  • feature value, also categorised code-text. Each feature type has an associated list of options.


A.7 Study Topics (0…n)
None, one or more topic names or phrases, keywords, or classification codes describing the study or aspects of it. Topics is preferred to ‘Subjects’ because within clinical research ‘Study subjects’ is normally understood as referring to the study participants.
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
The topics can be free text, but in many cases the text is structured, i.e. selected from a controlled vocabulary. The vocabulary that is used the most – by a large margin – is the MESH code system developed by the US Library of Medicine. This is because MESH codes are applied to both PubMed records and ClinicalTrials.gov trial registry entries. MedDRA and ICD10 are also used by some sources but in relatively tiny amounts. To try and provide a more consistent coding scheme for topics non coded terms are also matched, wherever possible, to MESH terms. Further work is required, however, to reduce the proportion of non-coded items.

The study topic record is composite and has the following structure:

  • topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
  • A boolean indicating whether or not the term has been MESH coded, either in the original data or later as part of the MDR extraction process,
  • the MESH code and MESH term, if MESH coding was done, either in the original data or later as part of the MDR extraction process,
  • a code representing the controlled terminology (CT) originally used (if there was one) and the code within that CT scheme. As explained above this is most often MESH (code 14), and the mesh code but in some cases it may be MedDRA or ICD10, or some other scheme, and the corresponding code in that scheme,
  • the original value, i.e. the topic as originally expressed, in whatever, or no, coding scheme.

Thus:
A topic that was MESH coded in the original source material will be structured as
{topic-type, TRUE, MESH code, MESH term, MESH-CT Code, MESH code, original value (= MESH term)}
A topic that was not MESH coded in the original source material, but was able to be coded as part of the extraction process, will be structured as
{topic-type, TRUE, MESH code, MESH term, (null), (null), original value}, OR, if originally coded using a CT 'X', as {topic-type, TRUE, MESH code, MESH term, X-CT code, code in X, original value},
A topic that was not MESH coded in the original source material, and was not able to be coded as part of the extraction process, will be structured as
{topic-type, FALSE, (null), (null), (null), (null), original value}, OR, if originally coded using a CT 'X', as {topic-type, FALSE, (null), (null), X-CT code, code in X, original value}.

A.8 Study Type (1)
This is a single term representing – in very broad terms – the type of clinical research study, e.g. ‘interventional’ (= clinical trial), ‘observational’, ‘expanded access’. It is categorised and must be selected from a predefined list. It is included as an aid to filtering records.

A.9 Study Status (1)
This is a single term representing the current status of the study in terms of its life-cycle, e.g. ‘not yet recruiting’, ‘recruiting’, ‘completed’, ‘terminated (early)’. It is categorised and must be selected from a predefined list. It is included as an aid to filtering records.

A.10 Study Enrolment Number (0..1)
This is a string representing the anticipated or actual number of study participants. Usually a simple number but may be a short sentence providing enrolment details (e.g. for different sub-protocols).

A.11 Study Gender Eligibility (0..1)
This is a code / text pair that indicates whether the study is only open to male or female participants, or both.

A.12 Study Minimum and Maximum ages (0..1)
These are integers representing the minimum and maximum age criteria for study participants, where they exist. In each case they are associated with a term indicating the time units associated with the integer. This is usually 'Years', but, for example for paediatric studies, may be months or weeks, or even days or hours.

A.13 Inter-study relationships (0..n)
Studies can have relationships between themselves, for instance one study can be a feasibility study for a later one, or a study can represent an ‘expanded access’ version of a clinical trial (when a new drug is available for compassionate reasons, even though recipients fail eligibility criteria for the study, and it use is reported on a case by case basis), or one study can represent a continuation of another, in an ongoing series. This data can be useful for tracking related studies and their data objects and so is included in the metadata scheme. It is composite, with

  • the relationship type (categorised, as selected from a predetermined code-text list)
  • the identifier of the other or ‘target’ study (within a suitable system, normally the same system in which the ‘subject’ study is found).


A.14 Linked Data Objects (1..n)
The linked data objects (there should be at least one, representing the entry in a trial registry system) are listed as object identifiers, usually accession Ids within an appropriate database system (e.g. the ECRIN MDR).

A.15 Provenance String (1)
A string indicating the source or sources of the data (usually a trial registry) and the date-times on which the data was last downloaded from the source or sources.

A.16 Study Countries (0..n)
The geonames numerical Id and name of the country or countries where recruitment of study participants took place

A.17 Study Sites (0..n)
The clinical sites where the study took place, when that information is available in the source material. The data includes the facility's name (usually a hospital) as a compound organisational record, the the geonames id and the name of the city, the geonames id and the name of the country, and the status (recruiting, stopped recruiting etc.) of the site, as of the last data harvest.

A.18 Study Start Time (0..1)
The year and month that the study began (usually defined as 'first patient first visit'), presented as two integers.

A.19 Study contributors (0...n)
The main organisations and personnel involved in designing, running, funding and sponsoring a study. It is usually a set of institutional and / or personal names. Each contributor description, which is composite, needs to indicate whether or not it refers to an individual or an organisation (or collaboration). If it is a person then fields are available for names, ORCID and affiliation details. If not the organisation name needs to be provided, with an ECRIN and / or ROR Id if one is available. The composite structure (which follows closely that in DataCite for object contributors) is therefore

  • contributor type (categorised, as selected from a predetermined list of code-text pairs).
  • whether an individual or not,

  and if they are…

  • given name,
  • family name,
  • full name,
  • ORCID id if available
  • affiliation string, (as provided in source data)
  • affiliation organisation, if it can be deduced from the affiliation string (name and Id(s) - e.g. ROR Id, ECRIN MDR id).

  but if they are not….

  • The organisation (name and Id(s) - e.g. ROR Id, ECRIN MDR id).

The contributor types available include those defined within DataCite, but the list has been extended in the context of clinical research, to include (for example) trial sponsor, trial funder, device provider, central laboratory, public contact, study lead, (site) principal Investigator.

Data Object identifiers

B.1 Data object identifier (0..1)
In line with the DataCite specification the principal identifier for data objects is seen as a Digital Object identifier or DOI, providing a persistent identifier that can be cited in other contexts. This applies to any objects that are available to others (whether publicly or under managed access).
Unfortunately the great majority of clinical research data objects, apart from journal articles, do not currently have a DOI. If this situation does not improve, consideration may need to be given to a mechanism for minting and applying DOIs – if financially feasible and acceptable to the object creators – or alternative identifiers should be explored, especially if a resolvable URL exists which could be used to immediately linked to the resource.

B.2 Display Title (1)
A title for the object. For a journal article it would be a citation of the article in a standard format (up to 3 authors, title, source journal information). For many other data objects the display title would need to be constructed from the study name followed by the object title or type, because in general such objects do not have unique names. In many situations the study name prefix could be dropped as it would be clear from the context (e.g. the study name would be a heading to the list of data objects). The study name and object type or name should therefore be separated by a clear indicator (‘ :: ’ is used within the MDR) so that if and when necessary the two parts of a composite title can be displayed separately.

B.3 Version (0..1)
The version of the data object, in whatever notation was used by the original data object creators. Many versions of a particular dataset or document may have been created in the course of a clinical study but the normal expectation would be that the final version of a data object (e.g. a protocol) would be the one that was shared with others.
Although it is relatively rare for more than one version of a data object to be made available, if that is the case they should be clearly differentiated using version codes (and relevant dates – see D.2 – and possibly descriptions – see E.6). E.8 describes how the relationship to previous or next versions can be made explicit. If a version item exists it should be displayed with the name and other identifiers.

B.4 Object Identifiers (0...n)
This refers to other unique identifiers that have been assigned to the data object in addition to its DOI primary identifier (for instance, for journal articles, a PubMed id). As with studies such IDs would be composite and include:

  • the identifier value,
  • the identifier type (categorised, as selected from a predetermined list of code-text pairs),
  • the assigning organisation (name and where available Id(s) - e.g. ROR Id, ECRIN MDR Id).
  • (optionally), the date the identifier was assigned


B.5. Object Titles (0...n)
The complete data for the title(s) for the data object. In most cases there will only be one (the constructed display title), but journal papers may have titles in different languages, and in any case will be different from the display title (which is a full citation). The title description is composite , and should include

  • the title text,
  • the title type (categorised, as selected from a predetermined list),
  • the language of the title, as a 2 character ISO code,
  • (optionally), any additional comments about their genesis (e.g. "authors' translation")


B.6. Linked Studies (1...n)
The linked studies (there should be at least one, or the data object should not be included in the system) are listed as study identifiers, usually accession Ids within an appropriate database system.

B.7. Provenance String (1)
A string indicating the source or sources of the data and the date-times on which the data was last downloaded from the source or sources.

Creators and Contributors

C.1 Creators (1...n)
The main personnel involved in producing the data, or, much more commonly, the authors of a publication. It may include institutional or collaboration names. Each creator description, which is composite, therefore needs to indicate whether or not it refers to an individual or an organisation. If it is a person then fields are available for names, ORCID and affiliation details. If not the organisation name needs to be provided, with an Id if one is available. The composite structure (which follows closely that in DataCite) is therefore

  • whether an individual or not,

  and if they are…

  • given name,
  • family name,
  • full name,
  • ORCID id if available
  • affiliation string, (as provided in source data)
  • affiliation organisation, if it can be deduced from the affiliation string (name and Id(s) - e.g. ROR Id, ECRIN MDR id).

  but if they are not….

  • The organisation (name and Id(s) - e.g. ROR Id, ECRIN MDR id).

Most data objects where the metadata is harvested retrospectively, other than journal articles, are unlikely to have creators explicitly identified. The contributors to the associated study, with a role of 'study lead' or 'sponsor', could be viewed as the creators of non-article objects but are no longer exported from the system in that role.

C.2 Contributors (0...n)
From DataCite, contributors are “other institutions and / or persons responsible for collecting, managing, distributing, or otherwise contributing to the development of the data object.” A contributor record is composite and is essentially the same as that for creators, except that each needs to be prefixed with an indicator of

  • contributor type (categorised, as selected from a predetermined list of code-text pairs).

The types available include those defined within DataCite, but the list has been extended in the context of clinical research, to include (for example) trial sponsor, trial funder, device provider, central laboratory, public contact, study lead, (site) principal Investigator.

Object Dates

D.1 Publication year (1)
The year in which the object is made available, i.e. in which it first becomes citable, expressed as 4 digits. Not the same as when an object becomes public – ‘available’ simply means that it can be accessed, but the conditions of that access remain in the control of the object’s owners or controllers, nor necessarily the year in which it was created (which may be present as one of the object’s dates).

D.2 Dates (0...n)
None, one or more dates or date ranges that are relevant to the data object. It is composite and includes both string and integer representations of the date. Year, month and day data is held separately to make it easier to apply date filters when finding data objects. The elements of the composite record are:

  • date type (categorised, as selected from a predetermined list),
  • is range, whether or not it is a single date or a range,
  • date as string, in a standard format yyyy MMM dd, e.g. “2018 Dec 12”, “2012 Mar 7”
  • start year, an integer
  • start month, an integer – may not be present for partial dates
  • start day, an integer – may not be present for partial dates
  • end year, for date ranges only
  • end month, for date ranges only, may not be present if date range partial
  • end day, for date ranges only, may not be present if date range partial
  • comments – any relevant / explanatory comments


Data Object Attributes

Section E is mainly based on the DataCite metadata specification, though a few extensions (E3 – E5) have been added for datasets (as opposed to document based data objects).

E.1 Class (1)
A categorised value, one of the existing DataCite controlled list for ‘Resource Type General’. In most cases, for clinical research data objects, the class will usually be one of:

  • Text
  • Dataset

though other options include: Data Paper, Software, Service, Audiovisual, and Interactive Resource.

E.2 Type (1)
A categorised description of the type of data object, at a more specific level than Class. The type and class should form a pair (as with DataCite), e.g. Dataset/census data or Text/conference abstract.
Unlike DataCite, both class and type are mandatory in the ECRIN schema. The types available include the CASRAI classifications of document objects, recommended by DataCite, together with additions to the list that represent object types of particular importance to clinical research (e.g. protocols, clinical study reports, statistical analysis plans, and datasets of various kinds).

E.3 Record key type (1, Datasets only)
This is a composite item that indicates the type of record keys used within the dataset, which indicates in particular if it is pseudonymised or anonymised. The contents are

  • Record key type (categorised, as selected from a predetermined list)
  • Details – text description to elaborate / clarify details

Note that the categorisation into 'pseudonymised'. 'anonymised', 'identifiable' etc. is based upon the description provided by the data controller / manager in the data (if one is supplied). The classification is therefore based on the data controller's understanding of the relevant terms. No attempt is made to apply a categorisation using standard criteria, as the meaning of the words used ('pseudonymised'. 'anonymised', etc.) may vary between different legal jurisdictions, over time, and in different usage contexts. The categorisation should therefore be read as only a very approximate guide to any legal requirements associated with the data.

E.4 De-identification level (1, Datasets only)
An item that indicates the amount of de-identification that has been applied to the dataset. The item consists of :

  • De-identification level (categorised, as selected from a predetermined list)
  • Additional actions carried out - boolean data indicating if any of the following applies: a) direct identifiers have been removed, b) US HIPAA rules for de-identification have been applied, c) dates have been rebased or replaced with integers, d) narrative text fields have been removed , and e) k-anonymisation has been carried out.
  • Details – text description to elaborate / clarify details.


E.5 Associated consent (1, Datasets only)
The consent in question is for secondary use of the data - consent for primary use is assumed.
The data item consists of:

  • a coded field that indicates the range of application of consent (if any) available for re-use and sharing associated with the data, selected from a list.
  • Possible additional restrictions, represented as a series of boolean data points: a) if use is limited to non-commercial research, b) if there any geographical restrictions on re-use, c) if only certain types of research are permitted, d) if only genetic research is allowed, and f) whether or not methodological or tool research (e.g. developing machine learning algorithms) is allowed.,
  • Details – text description to elaborate / clarify details, in particular to expand upon any of the additional restrictions listed as being present.


E.6 Description (0..n) None, one or more pieces of additional general information about the data object, so far as that is publicly available (journal abstracts, although an obvious ‘descriptor’, remain the property of the publisher and cannot in general be reproduced within the system). The item is composite, consisting of:

  • description type (categorised, as selected from a predetermined list)
  • label, a heading that might be applied to the text (e.g. as a sub-heading).
  • description text, the description itself
  • language code, the 2 character ISO code


E.7 EOSC Category (0..1)
An integer (0, 1, 2 or 3) that conforms to an EOSC categorisation recommended for data objects. The classification is

  • 0 = Non-personal data. Contains no information that refers to any identified or identifiable living individual.
  • 1 = Anonymised data.
  • 2 = Pseudonymised data.
  • 3 = Sensitive pseudonymised data.

In general, almost all documents expected in the MDR will be categorised as 0, whilst all IPD datasets will be categorised as 3 - unless there is general agreement that they are fully anonymised, in which case they become 1.

E.8 Language (1..n)
The language or languages of the data object itself (not of a description of the object), using the ISO language codes (e.g. en, de, fr). DataCite assumes a single language but some clinical research data objects (e.g. journal articles) are created in two or more languages. The record may therefore be multiple.

E.9 Inter-object relationships (0..n)
Data objects can be related to each other – for example one object can be a supplement to another, or a new version of an other, or be derived from, or the source of, one or more other data objects. A particularly important relationship for clinical study data is the pairing of ‘Has Metadata’ and ‘Is Metadata for’. Metadata in clinical research can include, for example, a data dictionary that provides the metadata for a dataset. Note that the metadata in this context is itself a file, and a data object in its own right. Each record is composite and must include:

  • the relationship type (categorised, as selected from a predetermined code-text list)
  • the identifier of the other or ‘target’ data object (in a suitable system).

Because few data objects have DOIs, it is usually a requirement that both subject and target objects are stored within the same system. This allows the identifier to be an internal identifier within that system, making navigation to it much simpler.

E.10 Topic (0...n)
None, one or more topic names or phrases, keywords, or classification codes describing the object or aspects of it. In the context of clinical research, most data objects will not have listed topics associated with them (the exception is journal articles, which almost always do have linked keywords). Data Object topics for non journal articles should be those of the parent study or studies. This introduces a substantial amount of redundancy but it means that topics can be searched and used for filtering across all (rather than just some) data objects.
The structure of each topic item is exactly the same as for study topics:

The study topic record is composite and has the following structure:

  • topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
  • A boolean indicating whether or not the term has been MESH coded, either in the original data or later as part of the MDR extraction process,
  • the MESH code and MESH term, if MESH coding was done, either in the original data or later as part of the MDR extraction process,
  • a code representing the controlled terminology (CT) originally used (if there was one) and the code within that CT scheme. As explained above this is most often MESH (code 14), and the mesh code but in some cases it may be MedDRA or ICD10, or some other scheme, and the corresponding code in that scheme,
  • the original value, i.e. the topic as originally expressed, in whatever, or no, coding scheme.

Thus:
A topic that was MESH coded in the original source material will be structured as
{topic-type, TRUE, MESH code, MESH term, MESH-CT Code, MESH code, original value (= MESH term)}
A topic that was not MESH coded in the original source material, but was able to be coded as part of the extraction process, will be structured as
{topic-type, TRUE, MESH code, MESH term, (null), (null), original value}, OR, if originally coded using a CT 'X', as {topic-type, TRUE, MESH code, MESH term, X-CT code, code in X, original value},
A topic that was not MESH coded in the original source material, and was not able to be coded as part of the extraction process, will be structured as
{topic-type, FALSE, (null), (null), (null), (null), original value}, OR, if originally coded using a CT 'X', as {topic-type, FALSE, (null), (null), X-CT code, code in X, original value}.

Location and Access details

An area where the existing DataCite schema needs to be extended is in providing a full description of the access arrangements for any data object. The following data points are proposed.

F.1 Managing Organisation (1)
In this schema, this is the organisation that manages access to the document or data object, including making the overall decision about access type (see F.2). For data this would usually be the name of the organisation that was the data controller. For journal papers it would be the name of the company that publishes the journal, and which would normally run the primary web site on which it can be accessed. In both cases the name would be associated with an id or ids in a suitable system (e.g. a ROR Id, ECRIN MDR Id).

F.2 Access Type (1)
A categorised value (code-text pair) that represents in broad terms the type of access under which the object is available, for example by publicly available download, or restricted download (restricted to members of a specific group) or on screen access after review on a case by case basis.

F.3 Access Details (Mandatory for any of the non-public access types)
This is a composite element with three elements:

  • A textual summary of the access being offered, for example identifying the groups to which access is granted, the criteria on which a case-by-case decision would be based, any further restrictions on on-screen access, etc. It may reference web based resources, on the object manager’s web site or elsewhere (see below).
  • A link to a resource that explains how access may be gained, e.g. how a group can be joined, and / or how application can be made for access on an individual basis. This would normally be a link to a web page on the managing organisation’s site, that would explain access procedures or provide an application proforma.
  • A date, if one is available, representing the last time the URL was checked to be in existence (i.e. returned a 200 ‘success’ code rather than a 404).


F.4 Resources (Mandatory unless case-by-case access)
The web based resources that represent this data object. Mandatory for public objects, when at least one resource should be listed. For data objects simply listed as existing, but under managed access, this information may not be available for harvesting. Each record is composite and includes

  • the name of the organisation holding the resource (e.g. a data repository, bibliographic system, trial registry)
  • the resource type (categorised, for downloadable resources normally based on the file extension)
  • the resource URL
  • whether or not the resource is directly accessible (i.e. is public and not behind a pay wall) - so far as is known
  • the date the URL was last checked as valid

and, if downloadable,

  • the resource size,
  • the resource size units, usually in KB, MB or GB.

In addition...

  • resource comments, provides a free text field to hold further details of the resource, in particular to support machine processing. These could include the schema used for XML files, and / or the character coding used for text files (e.g. UTF-8 versus UTF-16) or the presence and types of any byte order marks.


F.5 Rights (0..n) Any intellectual property rights information for the data object, as a textual statement of the rights management associated with the resource. The item is composite, and should include:

  • the name of the rights being applied
  • a uri that identifies an information source, usually a url to a web page,
  • any additional comments or description of the rights regime.