Schema Description

From ECRIN-MDR Wiki
Jump to navigation Jump to search

The following provides a more detailed description of each of the data points in the two schemas, including the components of composite data points. It does not, however, provide a full description of a practical implementation of the schemas, e.g. when storing schema data in a database or within JSON files. An implementation level description, which requires record ids and audit fields, is given by the two JSON file definitions, and a discussion of the implementation of the schema in the MDR database is provided in the Data Extraction section of the wiki.

Study Attributes

Strictly speaking these data points are not metadata because they do not describe data – instead they summarise some key attributes of the study, especially those that promote its discoverability.

A.1 Display Title (1)
This is by default, the shorter or 'public' title. If there is no such title the full scientific or protocol title needs to be used. Whatever title is used it should also appear within the list of study titles (see A.3), where a fuller set of title attributes can be provided.

A.2 Study Identifiers (0...n)
None, one or more unique identifiers that have been assigned to the study. For studies entered into trial registries these should include, as a minimum, the registry ID(s), but any IDs that have been externally applied, and that might be useful in identifying the study, can be included, for instance funders' and / or sponsors' ids.
These IDs are composite. If provided, they must include

  • the identifier value,
  • the identifier type (categorised, as selected from a predetermined list of code-text pairs),
  • the assigning organisation (name and where available an Id within a suitable system).
  • (optionally), the date the identifier was assigned
  • (optionally), any associated URL (for instance some public funder Ids in the US will link to a summary page about the grant and its use).


A.3 Study Titles (0..n)
Studies usually have a short or ‘public’ title as well as a full scientific one (as used on the protocol document), and can also be described by an acronym. They may have titles in more than one language.
All titles should be included in this list. The type is composite, and should include:

  • the title text,
  • the title type (categorised, as selected from a predetermined list of code-text pairs),
  • the language of the title, as a 2 character ISO code,
  • (optionally), any additional comments about their genesis (e.g. "authors' translation"),


A.4 Brief Description (0…1)
Most study registry systems require a brief, non-specialist description of the study – which usually range from a few lines to a paragraph or two. This can be useful in assessing the relevance of studies to a particular search task and so is included in the study data points.
There should also be an indication of whether the description contains embedded html, so that display systems can interpret any tags correctly, rather than display them as 'raw' text with visible angle brackets.

A.5 Data Sharing Statement (0..1)
In recent years several trial registries have requested study sponsors and / or leads to indicate if they will make individual participant data and related documents available for sharing, and if so how and when the data would be available. As such a statement is central to the purpose of the MDR it is captured within the study data, so that if present it can be displayed.
Again this data point should also include an indication of whether the data sharing statement contains embedded html, so that the tags can be interpreted correctly.

A.6 Study Features(0…n)
None, one or more design features of the study.
The design features available will depend on whether the study is interventional or observational. Available types for interventional studies include Phase, Primary Purpose, Allocation method, Intervention Design and Masking. For observational studies the types include Observational Model, Time Perspective, and whether or not specimens are retained.
In each case the possible values are categorised, and so restricted to a pre-defined set of values. This makes the feature types useful candidates for filtering of study records within a web portal and / or API.
The composite study feature record is therefore

  • feature type (categorised code-text, as selected from a predetermined list.
  • feature value, also categorised code-text. Each feature type has an associated list of options.


A.7 Study Topics (0…n)
None, one or more topic names or phrases, keywords, or classification codes describing the study or aspects of it. Topics is preferred to ‘Subjects’ because within clinical research ‘Study subjects’ is normally understood as referring to the study participants.
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
In the context of clinical research, most data objects – which will not have listed topics associated with them – would, for purposes of discoverability, take on the topics or keywords associated with their parent study. (The exception is journal articles, which almost always do have linked keywords).
The topics can be free text, but in many cases the text is structured, i.e. selected from a controlled vocabulary. The vocabulary that is used the most – by a large margin – is the MESH code system developed by the US Library of Medicine. This is because MESH codes are applied to both PubMed records and ClinicalTrials.gov trial registry entries. MedDRA and ICD10 are also used by some sources but in relatively tiny amounts. To try and provide a more consistent coding scheme for topics non coded terms are also matched, wherever possible, to MESH terms. Further work is required, however, to reduce the proportion of non-coded items.

The study topic record is composite and has the following structure:

  • topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
  • A boolean indicating whether or not the term has been MESH coded,
  • the MESH code, if present
  • the topic name or value - either the original or if MESH coded the preferred MESH term
  • a MESH qualifier code and qualifier value where one exists (applies only to PubMede articles at the moment),
  • the original value.


A.8 Study Type (1)
This is a single term representing – in very broad terms – the type of clinical research study, e.g. ‘interventional’ (= clinical trial), ‘observational’, ‘expanded access’. It is categorised and must be selected from a predefined list. It is included as an aid to filtering records.

A.9 Study Status (1)
This is a single term representing the current status of the study in terms of its life-cycle, e.g. ‘not yet recruiting’, ‘recruiting’, ‘completed’, ‘terminated (early)’. It is categorised and must be selected from a predefined list. It is included as an aid to filtering records.

A.10 Study Enrolment Number (0..1)
This is an integer representing the anticipated or actual number of study participants.

A.11 Study Gender Eligibility (0..1)
This is a code / text pair that indicates whether the study is only open to male or female participants, or both.

A.12 Study Minimum and Maximum ages (0..1)
These are integers representing the minimum and maximum age criteria for study participants, where they exist. In each case they are associated with a term indicating the time units associated with the integer. This is usually 'Years', but, for example for paediatric studies, may be months or weeks, or even days or hours.

A.13 Inter-study relationships (0..n)
Studies can have relationships between themselves, for instance one study can be a feasibility study for a later one, or a study can represent an ‘expanded access’ version of a clinical trial (when a new drug is available for compassionate reasons, even though recipients fail eligibility criteria for the study, and it use is reported on a case by case basis), or one study can represent a continuation of another, in an ongoing series. This data can be useful for tracking related studies and their data objects and so is included in the metadata scheme. It is composite, with

  • the relationship type (categorised, as selected from a predetermined code-text list)
  • the identifier of the other or ‘target’ study (within a suitable system, normally the same system in which the ‘subject’ study is found).


A.14 Linked Data Objects (1..n)
The linked data objects (there should be at least one, representing the entry in a trial registry system) are listed as object identifiers, usually accession Ids within an appropriate database system (e.g. the ECRIN MDR).

A.15 Provenance String (1)
A string indicating the source or sources of the data (usually a trial registry) and the date-times on which the data was last downloaded from the source or sources.

Data Object identifiers

B.1 Data object identifier (0..1)
In line with the DataCite specification the principal identifier for data objects is seen as a Digital Object identifier or DOI, providing a persistent identifier that can be cited in other contexts. This applies to any objects that are available to others (whether publicly or under managed access).
Unfortunately the great majority of clinical research data objects, apart from journal articles, do not currently have a DOI. If this situation does not improve, consideration may need to be given to a mechanism for minting and applying DOIs – if financially feasible and acceptable to the object creators – or alternative identifiers should be explored, especially if a resolvable URL exists which could be used to immediately linked to the resource.

B.2 Display Title (1)
A title for the object. For a journal article it would be a citation of the article in a standard format (up to 3 authors, title, source journal information). For many other data objects the display title would need to be constructed from the study name followed by the object title or type, because in general such objects do not have unique names. In many situations the study name prefix could be dropped as it would be clear from the context (e.g. the study name would be a heading to the list of data objects). The study name and object type or name should therefore be separated by a clear indicator (‘ :: ’ is used within the MDR) so that if and when necessary the two parts of a composite title can be displayed separately.

B.3 Version (0..1)
The version of the data object, in whatever notation was used by the original data object creators. Many versions of a particular dataset or document may have been created in the course of a clinical study but the normal expectation would be that the final version of a data object (e.g. a protocol) would be the one that was shared with others.
Although it is relatively rare for more than one version of a data object to be made available, if that is the case they should be clearly differentiated using version codes (and relevant dates – see D.2 – and possibly descriptions – see E.6). E.8 describes how the relationship to previous or next versions can be made explicit. If a version item exists it should be displayed with the name and other identifiers.

B.4 Object Identifiers (0...n)
This refers to other unique identifiers that have been assigned to the data object in addition to its DOI primary identifier (for instance, for journal articles, a PubMed id). As with studies such IDs would be composite and include:

  • the identifier value,
  • the identifier type (categorised, as selected from a predetermined list of code-text pairs),
  • the assigning organisation (name and where available an Id within a suitable system).
  • (optionally), the date the identifier was assigned


B.5. Object Titles (0...n)
The complete data for the title(s) for the data object. In most cases there will only be one (the constructed display title), but journal papers may have titles in different languages, and in any case will be different from the display title (which is a full citation). The title description is composite , and should include

  • the title text,
  • the title type (categorised, as selected from a predetermined list),
  • the language of the title, as a 2 character ISO code,
  • (optionally), any additional comments about their genesis (e.g. "authors' translation")


B.6. Linked Studies (1...n)
The linked studies (there should be at least one, or the data object should not be included in the system) are listed as study identifiers, usually accession Ids within an appropriate database system.

B.7. Provenance String (1)
A string indicating the source or sources of the data and the date-times on which the data was last downloaded from the source or sources.

Creators and Contributors

C.1 Creators (1...n)
The main personnel involved in producing the data, or the authors of a publication. It may be a set of institutional and / or personal names. Each creator description, which is composite, therefore needs to indicate whether or not it refers to an individual or an organisation (or collaboration). If it is a person then fields are available for names, ORCID and affiliation details. If not the organisation name needs to be provided, with an Id if one is available. The composite structure (which follows closely that in DataCite) is therefore

  • whether an individual or not,

  and if they are…

  • given name,
  • family name,
  • full name,
  • ORCID id if available
  • affiliation, (string description of department / organisation)

  but if they are not….

  • The organisation (name and where available an Id within a suitable system).

Most data objects where the metadata is harvested retrospectively, other than journal articles, are unlikely to have creators explicitly identified.

C.2 Contributors (0...n)
From DataCite, contributors are “other institutions and / or persons responsible for collecting, managing, distributing, or otherwise contributing to the development of the data object.” A contributor record is composite and is essentially the same as that for creators, except that each needs to be prefixed with an indicator of

  • contributor type (categorised, as selected from a predetermined list of code-text pairs).

The types available include those defined within DataCite, but the list has been extended in the context of clinical research, to include (for example) trial sponsor, trial funder, device provider, central laboratory, public contact, study lead, (site) principal Investigator.
In general the contributor lists for data objects should be derived from the lists stored for their parent study, even though in the system contributors are only presented for data objects, not studies. These will usually include the study lead(s) and sponsors. Where data objects do record their contributors, they would normally take precedence over that of the study, but organisational contributors, in particular study sponsors and funders, should still be added to the data object list.
Any system retrieving creator / contributor data therefore needs to collect that data for the studies as well as for the data objects themselves. Creator data is generally collected and stored in exactly the same way as contributor data – the contribution type is simply set as ‘creator’.

Object Dates

D.1 Publication year (1)
The year in which the object is made available, i.e. in which it first becomes citable, expressed as 4 digits. Not the same as when an object becomes public – ‘available’ simply means that it can be accessed, but the conditions of that access remain in the control of the object’s owners or controllers, nor necessarily the year in which it was created (which may be present as one of the object’s dates).

D.2 Dates (0...n)
None, one or more dates or date ranges that are relevant to the data object. It is composite and includes both string and integer representations of the date. Year, month and day data is held separately to make it easier to apply date filters when finding data objects. The elements of the composite record are:

  • date type (categorised, as selected from a predetermined list),
  • is range, whether or not it is a single date or a range,
  • date as string, in a standard format yyyy MMM dd, e.g. “2018 Dec 12”, “2012 Mar 7”
  • start year, an integer
  • start month, an integer – may not be present for partial dates
  • start day, an integer – may not be present for partial dates
  • end year, for date ranges only
  • end month, for date ranges only, may not be present if date range partial
  • end day, for date ranges only, may not be present if date range partial
  • comments – any relevant / explanatory comments


Data Object Attributes

Section E is mainly based on the DataCite metadata specification, though a few extensions (E3 – E5) have been added for datasets (as opposed to document based data objects).

E.1 Class (1)
A categorised value, one of the existing DataCite controlled list for ‘Resource Type General’. In most cases, for clinical research data objects, the class will usually be one of:

  • Text
  • Dataset

though other options include: Data Paper, Software, Service, Audiovisual, and Interactive Resource.

E.2 Type (1)
A categorised description of the type of data object, at a more specific level than Class. The type and class should form a pair (as with DataCite), e.g. Dataset/census data or Text/conference abstract.
Unlike DataCite, both class and type are mandatory in the ECRIN schema. The types available include the CASRAI classifications of document objects, recommended by DataCite, together with additions to the list that represent object types of particular importance to clinical research (e.g. protocols, clinical study reports, statistical analysis plans, and datasets of various kinds).

E.3 Record key type (1, Datasets only)
This is a composite item that indicates the type of record keys used within the dataset, which indicates in particular if it is pseudonymised or anonymised. The contents are

  • Record key type (categorised, as selected from a predetermined list)
  • Details – text description to elaborate / clarify details

Note that the categorisation into 'pseudonymised'. 'anonymised', 'identifiable' etc. is based upon the description provided by the data controller / manager in the data (if one is supplied). The classification is therefore based on the data controller's understanding of the relevant terms. No attempt is made to apply a categorisation using standard criteria, as the meaning of the words used ('pseudonymised'. 'anonymised', etc.) may vary between different legal jurisdictions, over time, and in different usage contexts. The categorisation should therefore be read as only a very approximate guide to any legal requirements associated with the data.

E.4 De-identification level (1, Datasets only)
An item that indicates the amount of de-identification that has been applied to the dataset. The item consists of :

  • De-identification level (categorised, as selected from a predetermined list)
  • Additional actions carried out - boolean data indicating if any of the following applies: a) direct identifiers have been removed, b) US HIPAA rules for de-identification have been applied, c) dates have been rebased or replaced with integers, d) narrative text fields have been removed , and e) k-anonymisation has been carried out.
  • Details – text description to elaborate / clarify details.


E.5 Associated consent (1, Datasets only)
The consent in question is for secondary use of the data - consent for primary use is assumed.
The data item consists of:

  • a coded field that indicates the range of application of consent (if any) available for re-use and sharing associated with the data, selected from a list.
  • Possible additional restrictions, represented as a series of boolean data points: a) if use is limited to non-commercial research, b) if there any geographical restrictions on re-use, c) if only certain types of research are permitted, d) if only genetic research is allowed, and f) whether or not methodological or tool research (e.g. developing machine learning algorithms) is allowed.,
  • Details – text description to elaborate / clarify details, in particular to expand upon any of the additional restrictions listed as being present.


E.6 Description (0..n) None, one or more pieces of additional general information about the data object, so far as that is publicly available (journal abstracts, although an obvious ‘descriptor’, remain the property of the publisher and cannot in general be reproduced within the system). The item is composite, consisting of:

  • description type (categorised, as selected from a predetermined list)
  • label, a heading that might be applied to the text (e.g. as a sub-heading).
  • description text, the description itself
  • language code, the 2 character ISO code
  • a boolean indicating whether or not the description contains html, useful to know for display purposes


E.7 EOSC Category (0..1)
An integer (0, 1, 2 or 3) that conforms to an EOSC categorisation recommended for data objects. The classification is

  • 0 = Non-personal data. Contains no information that refers to any identified or identifiable living individual.
  • 1 = Anonymised data.
  • 2 = Pseudonymised data.
  • 3 = Sensitive pseudonymised data.

In general, almost all documents expected in the MDR will be categorised as 0, whilst all IPD datasets will be categorised as 3 - unless there is general agreement that they are fully anonymised, in which case they become 1.

E.8 Language (1..n)
The language or languages of the data object itself (not of a description of the object), using the ISO language codes (e.g. en, de, fr). DataCite assumes a single language but some clinical research data objects (e.g. journal articles) are created in two or more languages. The record may therefore be multiple.

E.9 Inter-object relationships (0..n)
Data objects can be related to each other – for example one object can be a supplement to another, or a new version of an other, or be derived from, or the source of, one or more other data objects. A particularly important relationship for clinical study data is the pairing of ‘Has Metadata’ and ‘Is Metadata for’. Metadata in clinical research can include, for example, a data dictionary that provides the metadata for a dataset. Note that the metadata in this context is itself a file, and a data object in its own right. Each record is composite and must include:

  • the relationship type (categorised, as selected from a predetermined code-text list)
  • the identifier of the other or ‘target’ data object (in a suitable system).

Because few data objects have DOIs, it is usually a requirement that both subject and target objects are stored within the same system. This allows the identifier to be an internal identifier within that system, making navigation to it much simpler.

E.10 Topic (0...n)
None, one or more topic names or phrases, keywords, or classification codes describing the object or aspects of it. In the context of clinical research, most data objects will not have listed topics associated with them (the exception is journal articles, which almost always do have linked keywords). Data Object topics for non journal articles should be those of the parent study or studies. This introduces a substantial amount of redundancy but it means that topics can be searched and used for filtering across all (rather than just some) data objects.
The structure of each topic item is exactly the same as for study topics:

  • topic type (categorised, as selected from a predetermined list of code-text pairs. Topic types include, ‘condition’, ‘organism’, ‘chemical / biological agent’, and ‘geographic’.
  • A boolean indicating whether or not the term has been MESH coded,
  • the MESH code, if present
  • the topic name or value - either the original or if MESH coded the preferred MESH term
  • a MESH qualifier code and qualifier value where one exists (applies only to PubMede articles at the moment),
  • the original value.


Location and Access details

An area where the existing DataCite schema needs to be extended is in providing a full description of the access arrangements for any data object. The following data points are proposed.

F.1 Managing Organisation (1)
In this schema, this is the organisation that manages access to the document or data object, including making the overall decision about access type (see F.2). For data this would usually be the name of the organisation that was the data controller. For journal papers it would be the name of the company that publishes the journal, and which would normally run the primary web site on which it can be accessed. In both cases the name would be associated with an id in a suitable system.

F.2 Access Type (1)
A categorised value (code-text pair) that represents in broad terms the type of access under which the object is available, for example by publicly available download, or restricted download (restricted to members of a specific group) or on screen access after review on a case by case basis.

F.3 Access Details (Mandatory for any of the non-public access types)
This is a composite element with three elements:

  • A textual summary of the access being offered, for example identifying the groups to which access is granted, the criteria on which a case-by-case decision would be based, any further restrictions on on-screen access, etc. It may reference web based resources, on the object manager’s web site or elsewhere (see below).
  • A link to a resource that explains how access may be gained, e.g. how a group can be joined, and / or how application can be made for access on an individual basis. This would normally be a link to a web page on the managing organisation’s site, that would explain access procedures or provide an application proforma.
  • A date, if one is available, representing the last time the URL was checked to be in existence (i.e. returned a 200 ‘success’ code rather than a 404).


F.4 Resources (Mandatory unless case-by-case access)
The web based resources that represent this data object. Mandatory for public objects, when at least one resource should be listed. For data objects simply listed as existing, but under managed access, this information may not be available for harvesting. Each record is composite and includes

  • the name of the organisation holding the resource (e.g. a data repository, bibliographic system, trial registry)
  • the resource type (categorised, for downloadable resources normally based on the file extension)
  • the resource URL
  • whether or not the resource is directly accessible (i.e. is public and not behind a pay wall) - so far as is known
  • the date the URL was last checked as valid

and, if downloadable,

  • the resource size,
  • the resource size units, usually in KB, MB or GB.

In addition...

  • resource comments, provides a free text field to hold further details of the resource, in particular to support machine processing. These could include the schema used for XML files, and / or the character coding used for text files (e.g. UTF-8 versus UTF-16) or the presence and types of any byte order marks.


F.5 Rights (0..n) Any intellectual property rights information for the data object, as a textual statement of the rights management associated with the resource. The item is composite, and should include:

  • the name of the rights being applied
  • a uri that identifies an information source, usually a url to a web page,
  • any additional comments or description of the rights regime.