JSON files v4 to v5 changes

From ECRIN-MDR Wiki
Revision as of 15:57, 24 October 2020 by Admin (talk | contribs) (In data objects)
Jump to navigation Jump to search

In General

Both schemas have been slightly simplified and better structured to make them more coherent and consistent. Many of the changes impact the exported JSON data rather than the metadata schema itself. In particular:

  • Fields that are supplementary to others, such as ‘contains html’ for text fields, or ‘date last checked’ for url data, have been grouped together with the ‘parent’ fields into composite element structures.
  • A few other groups of fields that are closely logically connected have been assembled together in composite elements, for instance those dealing with access and those dealing with the physical characteristics of objects.
  • Both the above changes reduces the numbers of top level fields in the schema definitions and makes the overtall structure of the JSON files much easier to see. In some case they bring the JSON files into closer alignment with the metadata schema, which has always made more use of composite objects.
  • Some fields that have no or very little data in them and no realistic chance of being used in the near future have been dropped – this applies mainly to fields about people in the ‘object contibutors’ element. These fields remain in the database but are no longer exported as part of the schema.
  • The organisation element has been simplified to an id and a single name, rather than an id and an array of alternate names. This has no real impact on the data but does make the schema definitions a little simpler. Alternative names for organisations are still collected and remain important, but will need to be accessed and used outside the main data collection process.
  • A few field names in the database and the JSON definitions have been changed to more accurately reflect the purpose of the data, to bring them into line with the underlying data, or for consistency across the system. For example language (of an object, study title etc.) is now always a simple string and consistently called lang_code’.
  • In the JSON definition files the ‘definitions’ (i.e. re-usable elements, referenced in the main documents) have been removed, and the associated json elements are now included ‘in-line’. This makes the JSON definitions a little simpler and easier to read.
  • The JSON definitions now include descriptive statements for all elements at all levels.

Topic Data and MESH codes

The topic data for both studies and objects has been modified to reflect the application of MESH coding, where possible. The data points for both study and object topic data now include:
a) the topic type (e.g. condition, chemical agent, organism),
b) whether or not the term has been MESH coded,
c) the MESH code, if present
d) the topic name or value - either the original or if MESH coded the preferred MESH term
e) a MESH qualifier code and value where one exists,
f) the original value.

The reasons for using MESH over other schemes were:

  • Almost all of the coded topic data in the source material - in particular almost all topics listed within PubMed and a high proportion of those in ClinicalTrials.gov, is already coded using MESH. Other schemes are used (MedDRA in EUCTR data, ICD10 in DRKS) but the volumes are relatively insignificant.
  • MESH is a comprehensive system, whereas many of the other vocabularies available (e.g. MedDRA, ICD10, LOINC) are limited to particular topic types. SnoMed CT is similarly comprehensive, but does not appear to be used by any of the existing source repositories.
  • MESH is already familiar to many researchers as a tool for searching literature, and searching the MDR is a similar process.

Where possible uncoded terms have the relevant MESH code applied top them. About 20% of the topic data remains to be coded - and possible mappings between these and existing MESH terms need to be explored.

In Studies

Specific changes in studies (other than the changes to study topics described above) include:
a) Simplification of the composite display_title element back to a simple string. The lang_code has been removed because it was redundant – the same information is present in the study_+titles data. This also makes the display_title consistent with the string field of the same name used for data objects.
b) Combination of the brief_description and bd_contains_html fields into a single brief_description composite element, with fields ‘text’ and ‘contains_html’.
c) Combination of the data_sharing_statement and dss_contains_html fields into a single data_sharing_statement composite element, with fields ‘text’ and ‘contains_html’.
d) Cobination of the 3 fiedls dealing with minimum age into a single composite element, called min_age – with fields ‘value’, ‘unit_id’ and ‘unit_name’. Previously this was a single integer field called min_age followed by a composite min_age_units element (with id and name).
e) Analogous changes to max_age data as described above.
f) Change of name of ‘related studies’ to ‘study_relationships’, for greater consistency with the database.

In data objects

Specific changes in objects (other than the changes to object topics described above) include:
a) Object sccess details has been made into a composite element called access_details – includes description, url, url last accessed (previously were three separate top level fields (access_details, access_details_url, url_last_checked).
b) ‘language_code’ has been changed to lang_code in object_descriptions, and changed from an array of strings to a simple string. This string may – occasionally – be a comma separated list of two letter language codes, though nornally only a single such code will be present. This change allows the object_languages table to be deleted from the underlying databases.
c) ‘dataset_identifiers’ has been renamed to ‘dataset_deident_level’, the latter being seen as more accurate and more in line with the names of the included fields.
d) ‘dataset_consents’ has been renamed to ‘dataset_consent’, again as being more accurate and more in line with the included fields.
e) ‘contains html’ has been removed from object titles. This should have been done earlier – the processing of titles was changed to remove or substitute html tags some months ago.
f) Object_instances has been substantially revised, with just 3 top level elements, along with an id field. The composite elements include the repository_org (as before), a new access_details element (containing fields describing whether access is direct or not, the access url, and the date the url was last checked), and a new resource_details element. The latter contains two fields detailing the id and name of the resource type, 2 fields for the resource’s size and the units used for that size, and a fifth comments field. This is all the same data as previously, but arranged in a more coherent form that is easier to inspect and understand.
g) The person element within object_contributors has been simplified. The (person) id attribute has been dropped because there was no data in it and little hope of generating any in the near future. The previous composite ‘identifier’ element has been replaced by a simple string called ‘orcid’, as the only identifiers harvested have been ORCID Ids. The previous composite ‘affiliation’ element has also been simplified to a simple string, for the afilliaition as expressed in the source data. Fields for organisation ids and th scheme in which the Ids were assigned have been dropped (but are retained in the database) because only a very small proportion of affiliation source material includes this data. This may be an area to revisit if and when schemes for PIDs for organisations improve.
h) Object rights has been expanded slightly, with better named fields. At the moment, however, none of this data is harvested.
i) Related_objects has been renamed object_relationships, for better consistency with the underlying database.