JSON files v2 to v3 changes

From ECRIN-MDR Wiki
Jump to navigation Jump to search

After examining several data sources in detail, it was clear that some aspects of the metadata schema need to be changed. The broad structure remained the same but changes were made to

  • Better conform to the nature of the data
  • Correct earlier errors

Most of the changes represented additions but there were some renaming of fields. JSON files generated from the database (also modified, to support these changes) will need to conform to the revised schema.
The changes are detailed below:

The Study schema

scientific_title renamed as display_title

For studies this is redefined, as being by default the public or short title if there is one, otherwise to the scientific title, and otherwise to anacronym. This change is partly because: a) 100% of ClinicalTrials.gov records have a public title, where as 10,000+ do not have a scientific title listed. b) The public title is easier for the system to display and for the user to read. Note that all titles will be stored in the study_titles table, so that they can all be searched, the display_title is simply the one selected from the set of all titles to be used for display purposes.

Additional field: brief_description

Added because to make it makes it much easier to assess the relevance of any found study to the current review or search task. A string of maximum length 5000 but normally much shorter than that – designed to be displayed with the study, e.g. below the display title but above any associated data objects.

Additional field: data_sharing_statement

Added as an important data point in the context of the MDR. Trial registries are now requesting that study managers indicate if, when and how they intend to share IPD. This is clearly central to the purpose of the MDR and if such a statement exists it should be captured and displayed. May be displayed under the brief description and above any associated data objects.

study_other_titles renamed as study_titles

This array now holds all study titles and not just the non default one, as was the case in the past. This allows a larger and more consistent set of data to be held for all titles in the database.

Additional fields in study_titles

A variety of additional data points have been added to the study_titles table, but only two need to be included in the json metadata:

  • An additional ‘comments’ field. This would be for any textual clarification about the nature of a title (e.g. ‘previously known as’, or ‘also referred to as’) which might be displayed in brackets underneath a title. For study titles such qualifications are rare.
  • An additional ‘contains_html’ field. Some titles may have super or subscripts, or use italics (e.g. for gene names), signalled by html tags. Where such tags exist the system should be aware of them in case there are implications for display processing.


Additional field in study_identifiers

Some identifiers are associated with a url, for example grant IDs that link to an NIH page that provides more details about the grant. An identifier_link field has been added as a string to the database, and to the json metadata schema, for possible use within the display.

identifier_date format changed

Because the dates of identifiers are unlikely to be used for filtering / searching, so are mainly for display only, they have been turned into a string 'yyyy MMM dd' format, e.g. '2015 Dec 12', for improved legibility and slightly easier processing.

Addition of related_studies

Although the source database has always had a study_relationships table, capturing inter-relationships between studies (e.g. ‘is a sub-study of, ‘is a feasibility project for’) no information of this kind had been collected and it did not appear in the metadata. Some data of this kind is now available and needs to be included in the json file.
The structure of related_studies is analogous to related objects (for data objects) and consists of a triplet of source study id, relationship type (a composite of relationship type id and name in the json data) and target study id.

Simplification of linked data object records

The current linked_data_object array is an array of json objects, each of which contains a single integer field. To make things simpler and clearer, the array has been changed to a simple array of integers.

Clarification of names

To avoid ambiguities and possible clashes with system key words, the following have been renamed: In study_ identifiers:

  • “value” => “identifier_value”
  • “type” => “identifier_type”
  • “date” => “identifier_date”
  • “organization” => “identifier_org”

In study_topics :

  • “value” => “topic_value”

The revised top level structure of the study object in the json file is therefore:

The Data object schema

data_object_title changed to display_title

For consistency with the study title this has been renamed display title. All titles are now stored in the object_titles table, though most data objects only have one title and that is often constructed from the study title and the object type. For some objects it may make sense to construct the display title rather than use any of the available titles as the default display title. For instance for journal articles it may be better to construct a standard citation rather than just giving the article title (to be discussed further).

Clarification of names

To simplify names and make them more consistent the following elements have been renamed:

  • “data_object_other_identifiers” => “object_identifiers”
  • “data_object_other_titles” to “object_titles”
  • “data_object_contributors” to “object_contributors”
  • “data_object_dates” to “object_dates”
  • “data_object_descriptions” to “object_descriptions”
  • “data_object_instances” to “object_instances”
  • “data_object_rights” to “object_rights”


Date formats, partial and string dates

Up to now most dates in the schema have used a standard full date representation, e.g. "2019-10-13", the normal way of representing dates in JSON and one which translates easily to and from database date fields.
The difficulty is that many dates can be partial - usually missing the day - or strings listing time periods such as seasons (e.g. "Summer 2008"). Dates of publication, in particular, often use these non-standard date formats.
Because of this, an additional date_as_string representation has been added to object_dates. This is applicable to all dates in the system, and allows all dates to be displayed accurately. The PubMed format (yyyy MMM dd, e.g. "2017 Sep 23") is used, with variations (e.g. "2016 Dec", or "Spring, Summer 2015"), only when the source data was a partial or in a non-standard date format.
But data object dates can also be used to filter records and we need to support that filtering, (e.g. "records published after March 2015", "records before 2018"). Most systems can filter on full dates, because they are stored internally as numbers, but have problems with partial dates and other representations that are strings. A way of handling dates that can handle both non standard formats and filtering is required.
Along with the date_as_string representation, therefore, for all dates in the system likely to be used in filtering operations, separate integer fields have been developed for year, month and day. These are applied to both start and end dates, creating start_year, start_month, start_day, end_year, end_month and end_day fields, in both the database and the json fields. The year fields should always have 4 digits.
For dates that are given as a range the start and end dates should represent the inclusive limits of the range (as is normally the case in the source data). Partial dates will not populate the full triplet of fields. Rules should be develop for translating dates given as seasons or other periods into start and end years and months, e.g. 'Summer' is June - August inclusive, 'Winter' is December - February inclusive. Complete automation may not be possible immediately, but a collection of such rules can be developed over time.
Filtering before a certain month / year would make use of the start date year and month fields, filtering after a month/year would compare against the start date year and month fields for single dates, but use the end date year and month data for date ranges. This ensures objects dated with a date range are included in the appropriate filter.

Multiple languages of data objects

Most data objects are in one language, but it appears that some use multiple languages – for instance a small proportion of articles in PubMed are listed as being in 2 or 3 languages. The language code element for the Data Object should therefore really be an array of language codes, to match a database structure where ‘object_languages’ will need to be factored out as a separate table.

Distinguish contributor type

The original metadata specification included a field in the object contributor structure that indicated if the contributor was an individual or an organisation / group. This seems to have become lost. It would be useful to have this explicit indicator, either as a boolean (e.g. is_individual) or a string or type (e.g. (“individual”| “research group” | “organisation”). The latter is probably more flexible but different types of groups may be difficult to distinguish. A boolean has therefore been added.

Removal of the person email attribute

This attribute was added because it is in the DataCite schema but, because (even though they are all publicly available) the display of these emails may generate privacy concerns, and expose ECRIN to allegations of data misuse, it should be removed from the json schema. Emails should still be collected and stored internally, however, because email addresses are globally unique and can therefore be useful indicators of a person's identity.

Better person data

The current list of fields for a person’s data, in the context of object contributors, is neither very clear nor comprehensive. Unfortunately this stems from problems with the original ECRIN metadata definition. The actions taken have been to:

  • Change the name of the person_data element itself to just ‘person’
  • Change first_name to given_name (more accurate and in line with DataCite, as well as the original ECRIN metadata specification),
  • Change last_name to family_name (more accurate and in line with DataCite and the metadata specification),
  • Remove the email field from the JSON object, (see above).
  • Retain full_name as it is.
  • Add a person identifier and identifier type fields, mostly for ORCID identifiers, though other identifier schemes may be used in some cases. This would match DataCite and should have been included from the beginning.
  • Add an organisational identifier and scheme ID to the Affiliation element, (though use of this seems very limited).

These changes should have occurred before the current structure was agreed, but were missed at the time. They make the person data match the DataCite specification much more closely.

Changing object_titles – additional elements

As with study titles, the underlying table has been changed to include all titles, so the title has changed from object_other_titles to object_titles. This allows the same richer set of data to be stored about each title.
The impact on the json has been to add two new fields (the same as for the study json file)

  • An additional ‘comments’ field. This would be for any textual clarification about the nature of a title (e.g. ‘previously known as’, or ‘also referred to as’) which might be displayed in brackets underneath a title. For study titles such qualifications are rare.
  • An additional ‘contains_html’ field. Some titles may have super or subscripts, or use italics (e.g. for gene names), signalled by html tags. Where such tags exist the system should be aware of them in case there are implications for display processing.


Changing object_descriptions – additional elements

Two additional elements have been added to object_descriptions

  • An additional ‘description_label’ field. This is to provide a short description of the accompanying description text, in the way that headings are used to structure abstracts. The labels may be within the source data or constructed during the extraction process.
  • An additional ‘contains_html’ field. Some text will contain super or subscripts, or use italics (e.g. for gene names), signalled by html tags. Where such tags exist the system should be aware of them in case there are implications for display processing.


URL check date added to data objects

For consistency, a url_last_checked field has been added to the data_object element, as a string. All URLs should be checked periodically in the system to ensure that they are still ‘live’, and the most recent check date should be visible to users to give them greater confidence in the system. Adding this field is rectifying an earlier error.

Simplifying related objects

The current linked_studies array is an array of json objects, each of which contains a single integer field. To make things simpler and clearer, the array has been changed to a simple array of integers.

Changing object_identifiers

As with the study identifiers, the dentifier_date field has had its format changed to a string rather than a date. In addition, the following fields have had their names changed

  • “value” => “identifier_value”
  • “type” => “identifier_type”
  • “date” => “identifier_date”
  • “organization” => “identifier_org”


Clarification of names, avoiding keywords

To avoid ambiguities and possible clashes with system key words, the following have been renamed:
In data_objects:

  • “class” => “object_class”
  • “type” => “object_type”

In object_dates

  • “start” => “start_date” (now a composite object, see above)
  • “end” => “end_date” (now a composite object, see above)

In object_topics :

  • “value” => “topic_value”

In object_instances :

  • “size” => “resource_size”

The top level structure of the data object json file is shown on the next page.