Difference between revisions of "Contextual Data"
(→Organisation Data) |
(→Organisation Data) |
||
Line 30: | Line 30: | ||
</pre> | </pre> | ||
</div> | </div> | ||
+ | while a few rows from the table are shown in figure 1. In reality, few of the organisations listed have the full range of associated data, e.g. with notes and a notes source - though the first 6 fields are ususally complete. | ||
[[File:organisations.png|center||]] | [[File:organisations.png|center||]] | ||
<div style="text-align:center">'''Figure 1:''' Sample rows from the ctx.organisations table</div> | <div style="text-align:center">'''Figure 1:''' Sample rows from the ctx.organisations table</div> | ||
− | + | The structiujre of other org_other_names is: | |
<div style="font-family: monospace; font-size: 13px"> | <div style="font-family: monospace; font-size: 13px"> | ||
<pre> | <pre> |
Revision as of 20:10, 17 November 2020
Contents
Introduction
All the systems and databases within the MDR need to access so called 'contextual' data, which is of two main types:
- Context data: Data relating to the 'landscape' in which clinical research is carried out - organisations and people, countries and regions, systems for classifying keywords and topic etc.
- Controlled terminology: Data relating to the options available for categorised questions within the MDR itself.
Rather than replicate such data in every database, all of these data are stored in a separate context database, in the ctx and lup (look up) schemas respectively.
Postgres does not allow statements made in one database context to reference objects in another DB directly, so when processes in one of the source data databases require access to contextual data, one or both context schemas are imported into the database within a 'foreign table wrapper'. This allows the data, but not the table designs, to be read and manipulated. It does risk, however, cluttering up the databases with additional schema, so within the MDR such foreign table wrappers are always created on a temporary basis, when required, and are torn down again once the processing is completed (see the Managing Links between databases section in Data_Structures).
The sections below provide further details on the various types of contextual data in the system.
Contextual Data
Organisation Data
The context database contains several tables concerned with organisational data, though only three are important in the current context:
- ctx.organisations
- ctx.org_other_names
- ctx.org_type_links
The organisations table itself lists organisations that are legal entities (e.g. universities, healthcare providers, publishers, pharmaceutical companies), that are part of such entities (e.g. trials units, laboratories, hospitals), or are government agencies (e.g. regulatory authorities, ethical review organisations), or are collaborations or research projects (e.g. H2020 projects). In other words an 'organisation' is loosely defined as any entity that is not an individual, but which is made up of people with a common purpose, that appears in the data. A group is defined as an organisation when it is convenient to do so, at a level that makes most sense in the context, not because it is legally or otherwise externally defined. In fact some 'organisations' in the system, such as an international drugs company, are defined as a single organisation when legally they m,ay be composed of many different organisations - e.g. subsidiaries in different countries.
The organisations table has the structure:
id integer primary key , default_name varchar , lang_code varchar two letter language code , display_suffix varchar usually location, when location is not obvious , geog_id integer id from the geog_entities table , is_current integer 1 by default, 0 if no longer exists , notes varchar general description , notes_source varchar e.g. web site , source_date date date information last retrieved , temp_type varchar temporary field - to be dropped , type_id integer temporary field - to be dropped
while a few rows from the table are shown in figure 1. In reality, few of the organisations listed have the full range of associated data, e.g. with notes and a notes source - though the first 6 fields are ususally complete.
The structiujre of other org_other_names is:
id integer primary key , org_id integer id of the organisation recvord , default_name varchar default name - redundant but useful for isnpection , qualifier_id integer the nature of the other name, e.g. abbreviation, translation , other_name varchar the other name itself , changes_language integer whether or not it changes the language from the defualt name , lang_code varchar the language cod of the other name
Topic Data
The topic or keyword data associated with studies and journal papers is an obvious are for searching and filtering, but that function is not helped by the fact that a 'topic' can often be expressed in many ways. Synonym files can be used (and within the MDR portal they are used) to support search functions, but it would also be good to try and rationalise the topic data in the database, converting terms to a standard vocabulary wherever possible, linking terms to a code and removing duplicates.
Because so much of the data already has MESH coding applied this system is the natural first choice to use for this process. It also has the advantages of being a well known, widely used, comprehensive system covering all aspects of clinical research. Most other systems are either relatively specialised (LOINC, MedDRA, ICD etc.) or are much less widely used - at least in the context of the source data systems (e.g. SNOMED CT).
The context database contains two MESH source tables, downloaded from the US National Library of Medicine, one for descriptive terms and the other for supplementary terns. These are combined into a table called mesh_lookup, with over 500,000 records but with just 4 fields:
- entry - a lower case text term that is one of the synonyms for a MESH code
- code - the MSH code itself
- term - the 'preferred' MESH term
- source - whether from the descriptive or supplementary list
As is explained in the Updating contextual data section of Harvesting Data, this table is used in a lookup process to replace entry terms in the source data with the corresponding MESH term, although the original term is also preserved in the data. At the same time terms that have not been matched are stored, to see if they can be mapped to a MESH code in the future.
Geographical Data
Geographical entities - countries, regions, continents, states, etc. - are referenced both within the main MDR, as a possible topic or keyword applied to some studies or papers, and within the contextual data , because organisations have a 'geographic scope'. They can be entered into the system under a variety of different names (US, U.S., USA. United States, United States of America, etc.) and it is therefore better to standardise and code them where possible.
The contextual data includes three tables - geog_entities, geog_entity_types, and geog_groupings - that list the countries and their codes, list the various types of entity available, and list the members of a composite entity (e.g. the states inside a country). The entity types are as follows:
- 0 System
- 1 Global
- 2 Continental
- 3 Regional (country grouping)
- 4 National
- 5 State (The largest of the administrative region types within a country, e.g. state, province, devolved nation, land)
- 6 County (The second tier administrative area in the country - e.g. county, départment)
- 7 Mixed
A section of the geog_entities is shown in Figure 1:
The iso codes (2 letter, 3 letter and numeric) associated with countries refer to the ISO-3166 country codes. Then geoname ids refer to the Ids to be found within the GeoNames geographical database (see http://www.geonames.org/). These include states as well as countries. Within the MDR, however, the id used for all geographical entities, e.g. within the organisations table, is the id field from this table, the internal accession number. This provides the greatest flexibility in labelling the geographic scope of entities.
The geog_groupings table is designed to support aggregation of geographically related data, by listing the constituent parts of entities. A small section of the table is shown in figure 2 below, in this case illustrating the member states of the EU:
Publisher Data
The context schema includes three tables that relate print issns (pissns), electronic issns (eissns) and journal names to a publisher id, and a fourth table that gives the publisher name against that id. The data is largely derived from CrossRef, although the publisher's id is derived from its entry in the organisations table. The tables are required because PubMed citations identify the journal name, and its pissn and / or eissn, but not the publisher. The publisher is usually the managing organisation for a journal paper - in the sense that it is the publisher rather than the editorial board that decides whether a journal or paper is open access or not. The publisher's identity is therefore an important part of the metadata for a published paper. These four look up tables are therefore used to identify the publisher.
People Data
Ideally, there would be a 'contextual' list of people that would relate people's names, titles and affiliations to ORCID identifiers, and only the ORCID Ids would be used within the system. Unfortunately only a very small proportion of the names in the source data (as study leads or contacts, or as journal paper authors) come with ORCID identifiers. Names come in a variety of different formats - sometimes with titles, sometimes with first name rather than initials, sometimes with the family name before the given name - and there is no obvious pattern or consistency to this, even within a single source. In general, therefore, standardisation and rationalisation of name and affiliation data is not possible in the MDR, and such data remains largely as it is presented in the source data (though titles are dropped when they can be identified). There are no plans at the moment to create a contextual resource for people, unless and until adoption of ORCID identifiers becomes much more widespread.
Controlled Terminology
Many of the data fields in the MDR system are categorised, i.e. they are constrained to hold one of a predefined set of possible values.
For example, a clinical study may fall into one of several different main ‘types’, according to the broad methodology used:
- Interventional
- Observational
- Observational [Patient Registry]
- Expanded access
- Funded programme
Although clinical studies are usually categorised in similar ways, different systems can use different terms for those categories. Using the MDR’s controlled terminology, studies labelled as ‘clinical trial’, ‘randomised trial’, or ‘active intervention’ would all be mapped to a study type of ‘Interventional’, while other studies, described variously as ‘off-label’, ‘compassionate use’, or ‘pre-licence’, would all be mapped to ‘Expanded Access’. The options available for any categorised data item are stored in a 'look up' table.
Structure of Look Up tables
The MDR system uses a collection of these ‘look up’ tables. In the context of a user interface, each lookup table would often be used as the source of a dropdown or list box that displayed the categories for search or filtering purposes. Each look up table lists the options available for that data field, and almost all have a core common structure:
- An integer id field, that provides a key for each option. The database only has to store the id to represent the category being used, making storage much more efficient.
- A string name field, with the short name of the category, as it would appear – for instance – in a dropdown box that displayed the alternatives available.
- A text description field, that provides, when necessary, a brief explanation of what the category meant.
- An integer list_order field, which allows the default order in a display or report (normally that of the id, as the primary key) to be over-ridden, by ordering the records according to the value of this field.
A few look up tables have one or two additional fields but the great majority follow the pattern above.
The categories used are normally as already exist within DataCite or other key systems, augmented if necessary by ECRIN to better cover the full range of categories required for clinical research data objects. For instance the listed study types are taken from the study type classification found within the ClinicalTrials.gov trials registry.
The lookup tables contain two additional ‘audit’ field:
- A string source field, denoting the system from which the term was originally taken. Terms created by the MDR team are labelled as ‘ECRIN’
- A date date_added field, that indicates the date a category was agreed as being required by the MDR system (though full implementation may be later). If a term has been proposed but not yet agreed this date is null.
The study_types table, below, illustrates how this structure works in the case of study categories.
id | name | description | list_order | source | date_added |
---|---|---|---|---|---|
0 | Not yet known | Dummy value supplied by default on entity creation. | 99 | ECRIN | 2019-02-08 |
11 | Interventional | A clinical trial. | 10 | ClinicalTrials.gov | 2019-02-08 |
12 | Observational | Any form of non-interventional research. | 20 | ClinicalTrials.gov | 2019-02-08 |
13 | Observational Patient Registry | Collecting data for a designated registry. | 30 | ClinicalTrials.gov | 2019-02-08 |
14 | Expanded access | Off label usage of a new product for individuals. | 40 | ClinicalTrials.gov | 2019-02-08 |
15 | Funded programme | With a single or linked series of grants. | 50 | ClinicalTrials.gov | 2019-02-08 |
Listing of Look Up tables
At present the MDR system contains 29 look up tables, with most indicating that they refer to categorised fields by their name, usually <something>_types, occasionally _codes, categories or _classes.
- study attributes (study_types, study_statuses, gender_eligibility_types)
- study features (feature_types and categories of phase, allocation method, masking, time perspective etc.)
- study_relationship_types
- object_classes
- object_types
- object_filter_types
- object_access_types
- object_instance_types
- object_relationship_types
- contribution_types
- dataset_consent_types
- dataset_de-identification_levels
- dataset_record_key_types
- date_types
- description_types
- doi_status_types
- identifier_types
- language_codes
- language_usage_types
- resource_types
- title_types
- topic_types
- topic_vocabularies
- units (for size and time)
- composite_hash_types
Each of these tables is described in more detail on its own wiki page.