JSON files v5 to v6 changes
Contents
Removal of the ‘contains_html’ attributes
The ‘contains_html’ attribute has been removed from the brief_description and data_sharing_statement properties of the study entity. These two fields become simple strings rather than compound objects. The ‘contains_html’ attribute has also been removed from the data object description record.
This attribute had been present as a flag for consuming applications, so that they could interpret html tags (which are often embedded in the source material) correctly. All these tags are now removed during extraction and replaced where appropriate with carriage returns. The applications do need, however, to ensure that they process the carriage returns (\r\n or \r, depending on source) correctly.
Study enrolment change from integer to string
The study_enrolment property of a study has become a string rather than an integer field. Although most enrolments were given as simple numbers a small proportion were provided as text, often explaining further details (e.g. for planned sub-protocol numbers). Changing this field to a string enables all of the data to be captured.
Simplification and name changes for topic records
For ‘topic’ records – both study_topics and object_topics – the fields topic_code and topic_value have been renamed to mesh_code and mesh_value. This is simply more accurate because these fields always hold MESH codes or terms, when these are available - as they are for the majority of topics. The topic_qualcode and topic_qualvalue fields have been removed to simplify the tables. These fields were only ever used by PubMed and provide details of the specific aspect of a listed MESH topic that is being referenced in a paper. They were not used within the MDR's search strategy, however, and their inclusion led to a large duplication of the base topic terms. Removing them therefore not only allows a simpler and smaller table structure, it also makes for a much smaller data volume for PubMed derived topics (currently reducing them from 4.2 million records to 2.5 million). Note that the original_value field, which has exactly that – the topic as originally expressed – remains the same and always has a value.
Introduction of ROR ids for organisations
A major change is the introduction of ROR (Research Organisation Registry (ROR) ids for most organisation data within the system. In most cases the id / name object for organisations has been replaced by a new triplet object: id / name / ror_id. The ror_id is a URL, and so can be used by a consuming application to provide a hyperlink to the ROR data about that organisation.
Not all organisations have a ROR id – they are available and applied to many academic and healthcare organisations, but only to a minority of companies or government agencies, and not at all to registries and data repositories.
This change applies to
- the identifier_org object of both study and object identifiers,
- the managing_organisation attribute of data objects, and
- the organisation object of object_contributors.
It does not apply, however, to the repository_org attribute of object_instances (because none of the registries or repositories currrently have ror_ids).
Object contributors
The fields for individual contributors have been revised, with the system attempting to identify the organisation of the individual’s affiliation from the affiliation string provided. Previously it only did this when an explicit code was provided for the organisation, which was very rare.
The person object (within object contributors) retains the family_name, given_name, full_name, and orcid attributes as before. The affiliation attribute, i.e. the affiliation as provided in the source, has been renamed as affiliation_string, to try and clarify its role. There are then three additional attributes:
- affiliation_org_id - The id of the organisation within the MDR, if it can be identified from the affiliation string
- affiliation_org_name - The name of the organisation to which the person is affiliated, as deduced from the affiliation string
- affiliation_org_ror_id - The id of the organisation to which the person is affiliated, if knowm, within the ROR (Research Organisation Registry) system. This is a URL linking to the ROR resource.
During extraction the system uses a set of simple rules to try and identify the organisation name from the affiliation string, and then tries to match that organisation with those known to the system. If it succeeds the MDR id is inserted and, if one is available, the ROR id.
date_is_range change
The boolean is_date_range attribute of object dates has been renamed date_is_range. This was to improve consistency with the underlying database.