Difference between revisions of "JSON files v6 to v7 changes"

From ECRIN-MDR Wiki
Jump to navigation Jump to search
(Created page with " Removal of the ‘contains_html’ attributes The ‘contains_html’ attribute has been removed from the brief_description and data_sharing_statement properties of the study...")
 
Line 1: Line 1:
  
Removal of the ‘contains_html’ attributes
+
=== Addition of country and location attributes===
The ‘contains_html’ attribute has been removed from the brief_description and data_sharing_statement properties of the study entity. These two fields become simple strings rather than compound objects. The ‘contains_html’ attribute has also been removed from the data object description record.
+
For studies, the country or countries where participants were recruited is now included. <br/>
This attribute had been present as a flag for consuming applications, so that they could interpret html tags (which are often embedded in the source material) correctly. All these tags are now removed during extraction and replaced where appropriate with carriage returns. The applications do need, however, to ensure that they process the carriage returns (\r\n or \r, depending on source) correctly.
+
In addition, where the data exists in the source material - for the moment only within ClinicalTrials.gov data - the clinical sites for the study are also listed, including the city and country of the site and ther status as of the most recent data harvesting.<br/>
 +
Internally within the system integer geocode ids are used for countries and cities (see https://geocode.xyz/). For display and within the schema the city and country names are also included.
  
Study enrolment change from integer to string
+
=== Changes for topic records===
The study_enrolment property of a study has become a string rather than an integer field. Although most enrolments were given as simple numbers a small proportion were provided as text, often explaining further details (e.g. for planned sub-protocol numbers). Changing this field to a string enables all of the data to be captured.
+
For ‘topic’ records – both study_topics and object_topics – the original controlled terminology (CT) code and controlled terminology code have been restored to the schema (these were never removed from the data). In most cases the CT will be MESH (code = 14) but in some cases MedDRA and ICD codes, and very occassionally a few other CTs, are used. Returning these datapoints to the schema simply allows them to be displayed if and when required.
 
 
Simplification and name changes for topic records
 
For ‘topic’ records – both study_topics and object_topics – the fields topic_code and topic_value have been renamed to mesh_code and mesh_value. This is simply more accurate because these fields always hold MESH codes or terms, when these are available - as they are for the majority of topics. The topic_qualcode and topic_qualvalue fields have been removed to simplify the tables. These fields were only ever used by PubMed and provide details of the specific aspect of a listed MESH topic that is being referenced in a paper. They were not used within the MDR's search strategy, however, and their inclusion led to a large duplication of the base topic terms. Removing them therefore not only allows a simpler and smaller table structure, it also makes for a much smaller data volume for PubMed derived topics (currently reducing them from 4.2 million records to 2.5 million). Note that the original_value field, which has exactly that – the topic as originally expressed – remains the same and always has a value.
 

Revision as of 19:54, 20 September 2022

Addition of country and location attributes

For studies, the country or countries where participants were recruited is now included.
In addition, where the data exists in the source material - for the moment only within ClinicalTrials.gov data - the clinical sites for the study are also listed, including the city and country of the site and ther status as of the most recent data harvesting.
Internally within the system integer geocode ids are used for countries and cities (see https://geocode.xyz/). For display and within the schema the city and country names are also included.

Changes for topic records

For ‘topic’ records – both study_topics and object_topics – the original controlled terminology (CT) code and controlled terminology code have been restored to the schema (these were never removed from the data). In most cases the CT will be MESH (code = 14) but in some cases MedDRA and ICD codes, and very occassionally a few other CTs, are used. Returning these datapoints to the schema simply allows them to be displayed if and when required.