JSON files v3 to v4 changes
The Study Metadata definition
For studies the changes were
a) The introduction of study features: Previously ‘study topics’ included not only MESH codes and keywords, but also study attributes such as enrolment numbers and minimum and maximum ages for study participants, as well as study design features such as the type of treatment allocation, masking or observation model used.
These study attributes and features have now been factored out as separate data points. The following are now separate additional attributes of the study entity:
- Enrolment number (proposed or actual – the distinction does not seem to be made),
- Gender eligibility (i.e. if the study is open to male, female or both). This is a coded value present as both the code and the text decode in the json file.
- Minimum age for study participants, plus age units. Age units are also coded and present as both the code and the text decode in the json file.
- Maximum age for study participants, plus age units, organised the same as minimum age.
Study design features are now available as a repeating composite, called study_features. The reason for doing this (rather than simply adding them as additional study attributes) is that in some cases a study may have more than one feature of a particular type listed – e.g. more than one observational model may be present. Because of this it was necessary to bring the data out in a one to many relationship with the parent study. Although not all design features can repeat it made sense to have a single, consistent structure for all of them. Each repeat includes:
- A coded value / text pair indicating the type of the feature (e.g. study phase, masking etc.)
- A coded value / text pair indicating the category of the feature within that type (e.g. Phase 1, Randomised, etc.)
These features are mainly used for filtering purposes. It is hoped that extracting them in this way will make the filters easier and quicker to operate, as well as reducing the size of the Study topics dataset, hopefully making that easier to search.
b) The introduction of study provenance data: This is an additional string field included to hold a listing of the source or sources (usually a trial registry) from which the data for the study has been drawn, and the date-time(s) when the data was last downloaded. The intention is to make this information available to MDR users (e.g. as a pop-up) so that they can easily see the provenance of the data. It also allows the MDR to fulfil requirements of many source systems, which allow their data to be used on the condition that they are acknowledged as the source.
c) Introduction of 'contains_html' tags for some text fields. Boolean tags have been added to the study brief description and data sharing text fields, to indicate whether or not they contain embedded html (some source systems allow this in the original material). The tags were added to allow client systems to more easily manage and use these tags, which otherwise tend to be rendered as 'raw' html. At the same time the 'contains_html' tag was removed from study titles, as any html tags in titles are now removed during data processing. (Sub and superscript tags in titles are replaced by the appropriate unicode characters).
The Data Object Metadata definition
For the data objects the main changes were
a) Introduction of an EOSC category field: An integer (0, 1, 2 or 3) that conforms to an EOSC categorisation recommended for data objects. The classification is
0 = Non-personal data. Contains no information that refers to any identified or identifiable living individual.
1 = Anonymised data.
2 = Pseudonymised data.
3 = Sensitive pseudonymised data.
In general, almost all documents expected in the MDR will be categorised as 0, whilst all IPD datasets will be categorised as 3 - unless there is general agreement that they are fully anonymised, in which case they become 1.
b) Revision of dataset specific fields: The fields associated with dataset data objects were modified:
- Record key type. This was simplified with fewer categories available. The description of the categories (anonymised, pseudonymised, identifiable) was changed to indicate that they were as claimed by the data controllers / managers, rather than by using standardised criteria, using the data controllers' understanding of the terms, as they are applied inn their local context. The categorisation should therefore be read as only a very approximate guide to any legal requirements associated with the data. The textual description field is retained for any further details.
- De-identification level. The category question was again simplified, but supplemented by 5 boolean data points as well as a free text description field. The boolean fields can be used to indicate additional specific actions, whether: a) direct identifiers have been removed, b) US HIPAA rules for de-identification have been applied, c) dates have been rebased or replaced with integers, d) narrative text fields have been removed , and e) k-anonymisation has been carried out.
- Associated consent. Use was made of the DUO categorisation of consents in biomedical research to extensively revise this question. The category question is retained, to give a broad indication of the level of consent, and conforms to DUO categories. It is supplemented by 5 boolean data points that allow possible additional restrictions to be added: a) if use is limited to non-commercial research, b) if there any geographical restrictions on re-use, c) if only certain types of research are permitted, d) if only genetic research is allowed, and f) whether or not methodological or tool research (e.g. developing machine learning algorithms) is allowed. The text description is used to elaborate / clarify details, in particular to expand upon any of the additional restrictions listed as being present.
Note that the names of all the dataset specific fields in both the database and the json definition have been changed.
c) Addition of resource comments field: This is a free text field to hold further details of the resource, in particular to support machine processing. These could include the schema used for XML files, and / or the character coding used for text files (e.g. UTF-8 versus UTF-16) or the presence and types of any byte order marks.
d) The introduction of object provenance data: An additional string field is included to hold a listing of the source or sources from which the data for the object has been drawn, and the date-time(s) when the data was last downloaded. The intention is to make this information available to MDR users (e.g. as a pop-up) so that they can easily see the provenance of the data. Most data objects have only one source.