Difference between revisions of "Contextual Data"
(→Geographical Data) |
(→Publisher Data) |
||
(55 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | <p style="color:blue; text-align:right"><small>'''''Last updated: 10/05/2022'''''</small></p> | ||
===Introduction=== | ===Introduction=== | ||
All the systems and databases within the MDR need to access so called 'contextual' data, which is of two main types: | All the systems and databases within the MDR need to access so called 'contextual' data, which is of two main types: | ||
− | * Context data: Data relating to the 'landscape' in which clinical research is carried out - organisations and people, countries and regions, systems for classifying keywords and | + | * Context data: Data relating to the 'landscape' in which clinical research is carried out - organisations and people, countries and regions, systems for classifying keywords and topics etc. |
* Controlled terminology: Data relating to the options available for categorised questions within the MDR itself. | * Controlled terminology: Data relating to the options available for categorised questions within the MDR itself. | ||
Rather than replicate such data in every database, all of these data are stored in a separate context database, in the ctx and lup (look up) schemas respectively.<br/> | Rather than replicate such data in every database, all of these data are stored in a separate context database, in the ctx and lup (look up) schemas respectively.<br/> | ||
− | Postgres does not allow statements made in one database context to reference objects in another DB directly, so when processes in one of the source data databases require access to contextual data, one or both context schemas are imported into the database within a 'foreign table wrapper'. This allows the data, but not the table designs, to be read and manipulated. It does risk, however, cluttering up the databases with additional schema, so within the MDR such foreign table wrappers are always created on a temporary basis, when required, and are torn down again once the processing is completed (see the Managing Links between databases section in '''[[Data_Structures]]''').<br/> | + | Postgres does not allow statements made in one database context to reference objects in another DB directly, so when processes in one of the source data databases require access to contextual data, one or both context schemas are imported into the database within a 'foreign table wrapper'. This allows the data, but not the table designs, to be read and manipulated. It does risk, however, cluttering up the databases with additional schema, so within the MDR such foreign table wrappers are always created on a temporary basis, as and when required, and are torn down again once the processing is completed (see the Managing Links between databases section in '''[[Data_Structures]]''').<br/> |
The sections below provide further details on the various types of contextual data in the system. | The sections below provide further details on the various types of contextual data in the system. | ||
− | |||
====Organisation Data==== | ====Organisation Data==== | ||
+ | The context database contains several tables concerned with organisational data, though the main two are | ||
+ | * ctx.organisations, and | ||
+ | * ctx.org_names | ||
+ | The organisations table itself lists organisations that are legal entities (e.g. universities, healthcare providers, publishers, pharmaceutical companies), that are part of such entities (e.g. trials units, laboratories, hospitals), or are government agencies (e.g. regulatory authorities, ethical review organisations), or are collaborations or research projects (e.g. H2020 projects). In other words an 'organisation' in the MDR is loosely defined as any entity that is not an individual, but which is made up of people with a common purpose, that appears in the data. A group is defined as an organisation when it is convenient to do so, at a level that makes most sense in the context, not because it is legally or otherwise externally defined. In addition, some 'organisations' in the system, such as a multi-national drugs company, are defined as a single organisation when legally they may be composed of many different organisations - e.g. subsidiaries in different countries.<br/> | ||
+ | The organisations table has the structure: | ||
+ | <div style="font-family: monospace; font-size: 13px"> | ||
+ | <pre> | ||
+ | id integer primary key | ||
+ | , default_name varchar | ||
+ | , ror_id varchar the id within the ROR organisational database, if there is one | ||
+ | , display_suffix varchar usually location, when location is not obvious from the name | ||
+ | , scope_id integer an id from the ctx.geog_entities table that indictes any explicit geographical 'scope' of the organisation | ||
+ | , scope_notes varchar any additonal comments on the geographical scope | ||
+ | , is_current integer 1 by default, 0 if no longer exists | ||
+ | , year_established integer if known | ||
+ | , year_ceased integer if applicable and known | ||
+ | , created_on timestamptz created automatically | ||
+ | </pre> | ||
+ | </div> | ||
+ | while a few rows from the table are shown in figure 1. <br/> | ||
+ | Note that the geographical scope is applied when there is an ''explicit'' geographical scope attached to an organisation's work, often for national or multi-national institutions. Most organisations, such as universities, hospitals, or commercial companies, while they are based in a country or countries, are not restrictred to a particular area in terms of the work they do, for example they may take students, patients and customers from outside their own geographical area.<br/> | ||
+ | The ROR (Research Organisation Registry) Id is available for many organisations that are also legal entities, and the URL that is the Id provides a link through to the set of associated ROR data - see [https://ror.org/ ROR]. As the ROR system expands it is hoped that a greater proportion of the organisations in the system will be linked to ROR Ids. | ||
+ | [[File:orgs 2.png|center||]] | ||
+ | <div style="text-align:center">'''Figure 1:''' Sample rows from the ctx.organisations table</div> | ||
+ | The structure of other org_names table is: | ||
+ | <div style="font-family: monospace; font-size: 13px"> | ||
+ | <pre> | ||
+ | id integer primary key | ||
+ | , org_id integer id of the organisation recvord | ||
+ | , qualifier_id integer the nature of the other name, e.g. abbreviation, translation | ||
+ | , lang_code varchar the language code of the other name | ||
+ | , script_code varchar the code of the script being used (for future use) | ||
+ | , name varchar the name itself | ||
+ | </pre> | ||
+ | </div> | ||
+ | Figure 2 shows some sample rows, in this case illustrating the different names for the WHO and some of the trial registries in the system. The qualifier_id field is used to indicate the 'type' of the name compared to the default (qualifier id = 1), for example an abbreviation, a translation, former name etc. The script_code field will be used to indicate the identity of any non-latin script, which, because of the use of translitteration schemes for chinese, japanese, russian etc., may vary independently of the language. | ||
+ | [[File:orgnames 2.png|center||]] | ||
+ | <div style="text-align:center">'''Figure 2:''' Sample rows from the ctx.org_other_names table</div> | ||
+ | Different names are required in the system for any particular organisation because the names are almost always input into source systems as free text, allowing for a huge variety of spellings and forms. The updating contextual data section of '''[[Harvesting Data]]''' explains how the organisation tables are used to make the organisational data within various tables = so far as is possible - more standardised.<br/> | ||
+ | The org_type_membership table is designed to show the type(s) of the organisation, which may be multiple - for example an organisation may be both a funder and a charity, or both a research institution and a healthcare provider. The table is therefore essentially a join table between organisations and the lup.org_types listing of possible organisation types. Other tables include org_locations and org_links. These tables, and others, are likely to play a bigger role in the future as integration with the ROR system increases and systems are developed to update organisational and other contextual data independently of the MDR, using a separate interface.<br/> | ||
+ | The ROR data, integrated with the organisational data in the ctx schema, is downloaded periodically from the ROR web site and is stored in tables in the '''ror''' schema. | ||
+ | |||
+ | ====Topic Data==== | ||
+ | The topics or keywords associated with studies and journal papers is an obvious pool of data for searching and filtering, but that functionality is not helped by the fact that a 'topic' can often be expressed in many ways. Synonym files can be used (and within the MDR portal they are used) to support search functions, but it would also be good to try and rationalise the topic data in the database, converting terms to a standard vocabulary wherever possible, linking terms to a code and removing duplicates.<br/> | ||
+ | Because so much of the data already has MESH coding applied this system is the natural first choice to use for this process. It also has the advantages of being a well known, widely used and comprehensive system covering all aspects of clinical research. Most other systems are either relatively specialised (LOINC, MedDRA, ICD etc.) or are much less widely used - at least in the context of the source data systems (e.g. SNOMED CT).<br/> | ||
+ | The MESH data is downloaded periodically as part of a larger periodic download of UMLS data, stored in the and '''umls''' schema, and includes two main MESH source tables, one for descriptive terms and the other for supplementary terns. These are combined into a table called mesh_lookup, with over 500,000 records but with just 4 fields: | ||
+ | * entry - a lower case text term that is one of the synonyms for a MESH code | ||
+ | * code - the MESH code itself | ||
+ | * term - the 'preferred' MESH term | ||
+ | * source - whether from the descriptive or supplementary list | ||
+ | Figure 3 provides some examples of the rowes in this table, referring to 3 MESH codes but 16 entry terms. | ||
+ | [[File:mesh codes.png|center||]] | ||
+ | <div style="text-align:center">'''Figure 3:''' Sample rows from the ctx.mesh_lookup table</div> | ||
+ | As is explained in the Updating contextual data section of '''[[Harvesting Data]]''', this table is used in a lookup process to replace entry terms in the source data with the corresponding MESH term, although the original term is also preserved in the data. At the same time terms that have not been matched are also stored, to see if they can be mapped to a MESH code in the future. | ||
====Geographical Data==== | ====Geographical Data==== | ||
− | Geographical entities - countries, regions, continents, states, etc. - are referenced both within the main MDR, as a possible topic or keyword applied to some studies or papers, and within the contextual data , because organisations have a | + | Geographical entities - countries, regions, continents, states, etc. - are referenced both within the main MDR, as a possible topic or keyword applied to some studies or papers, and within the contextual data, because organisations may have a geographic scope_id. They can be entered into the source systems under a variety of different names (US, U.S., USA. United States, United States of America, etc.) and it is therefore better to standardise and code them where possible.<br/> |
− | The contextual data includes | + | The contextual data includes two tables - ctx.geog_entities and lup.geog_entity_types - that list countries and other geographical entities. The entity types are as follows: |
− | + | * 1 Global : A single entity with a global extent | |
− | * 1 | + | * 2 Continental : Geonames listed continents |
− | * 2 | + | * 3 Regional : Geonames country groupings |
− | * 3 | + | * 4 National : Geonames countries |
− | * 4 | + | * 5 State : Geonames admin1s, the top-level administrative regions |
− | * 5 | + | * 6 Political or economic : Groupings with a political or economic purpose, rather than geographical proximity |
− | * 6 | + | * 7 Other : Any entity not included in other categories |
− | * 7 | + | A section of the geog_entities is shown in Figure 4, and includes some regional groupings as well as some countries. In fact the most common type of entity is 'State', the main adminitrative division listed within each country. The source of the geographic data is the geonames website ([https://www.geonames.org/ geonames]) and the data is periodically downloaded from there and stored in the '''geonames''' schema, before some basic processing to create the geog_entities table.<br/> |
− | A section of the geog_entities is shown | + | [[File:countries 2.png|center||]] |
− | + | <div style="text-align:center">'''Figure 4:''' Sample rows from the ctx.geog_entities table</div> | |
− | The | + | <br/> |
− | |||
− | |||
====Publisher Data==== | ====Publisher Data==== | ||
− | The context schema includes | + | The context schema includes tables that relate print issns (pissns), electronic issns (eissns) and journal names to a publisher id, and a fourth table that gives the publisher name against that id. The data is largely derived from CrossRef, although the publisher's id is derived from its entry in the organisations table. The tables are required because PubMed citations identify the journal name, and its pissn and / or eissn, but not the publisher. The publisher is usually the managing organisation for a journal paper - in the sense that it is the publisher rather than the editorial board that decides whether a journal or paper is open access or not. The publisher's identity is therefore an important part of the metadata for a published paper. These four look up tables are therefore used to identify the publisher. The 'source' tables for this data, as periodically downloaded from crossref, are stored in the '''crossref''' and '''pubs''' schemas. |
====People Data==== | ====People Data==== | ||
Ideally, there would be a 'contextual' list of people that would relate people's names, titles and affiliations to ORCID identifiers, and only the ORCID Ids would be used within the system. Unfortunately only a very small proportion of the names in the source data (as study leads or contacts, or as journal paper authors) come with ORCID identifiers. Names come in a variety of different formats - sometimes with titles, sometimes with first name rather than initials, sometimes with the family name before the given name - and there is no obvious pattern or consistency to this, even within a single source. In general, therefore, standardisation and rationalisation of name and affiliation data is not possible in the MDR, and such data remains largely as it is presented in the source data (though titles are dropped when they can be identified). There are no plans at the moment to create a contextual resource for people, unless and until adoption of ORCID identifiers becomes much more widespread. | Ideally, there would be a 'contextual' list of people that would relate people's names, titles and affiliations to ORCID identifiers, and only the ORCID Ids would be used within the system. Unfortunately only a very small proportion of the names in the source data (as study leads or contacts, or as journal paper authors) come with ORCID identifiers. Names come in a variety of different formats - sometimes with titles, sometimes with first name rather than initials, sometimes with the family name before the given name - and there is no obvious pattern or consistency to this, even within a single source. In general, therefore, standardisation and rationalisation of name and affiliation data is not possible in the MDR, and such data remains largely as it is presented in the source data (though titles are dropped when they can be identified). There are no plans at the moment to create a contextual resource for people, unless and until adoption of ORCID identifiers becomes much more widespread. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 17:30, 11 May 2022
Last updated: 10/05/2022
Contents
Introduction
All the systems and databases within the MDR need to access so called 'contextual' data, which is of two main types:
- Context data: Data relating to the 'landscape' in which clinical research is carried out - organisations and people, countries and regions, systems for classifying keywords and topics etc.
- Controlled terminology: Data relating to the options available for categorised questions within the MDR itself.
Rather than replicate such data in every database, all of these data are stored in a separate context database, in the ctx and lup (look up) schemas respectively.
Postgres does not allow statements made in one database context to reference objects in another DB directly, so when processes in one of the source data databases require access to contextual data, one or both context schemas are imported into the database within a 'foreign table wrapper'. This allows the data, but not the table designs, to be read and manipulated. It does risk, however, cluttering up the databases with additional schema, so within the MDR such foreign table wrappers are always created on a temporary basis, as and when required, and are torn down again once the processing is completed (see the Managing Links between databases section in Data_Structures).
The sections below provide further details on the various types of contextual data in the system.
Organisation Data
The context database contains several tables concerned with organisational data, though the main two are
- ctx.organisations, and
- ctx.org_names
The organisations table itself lists organisations that are legal entities (e.g. universities, healthcare providers, publishers, pharmaceutical companies), that are part of such entities (e.g. trials units, laboratories, hospitals), or are government agencies (e.g. regulatory authorities, ethical review organisations), or are collaborations or research projects (e.g. H2020 projects). In other words an 'organisation' in the MDR is loosely defined as any entity that is not an individual, but which is made up of people with a common purpose, that appears in the data. A group is defined as an organisation when it is convenient to do so, at a level that makes most sense in the context, not because it is legally or otherwise externally defined. In addition, some 'organisations' in the system, such as a multi-national drugs company, are defined as a single organisation when legally they may be composed of many different organisations - e.g. subsidiaries in different countries.
The organisations table has the structure:
id integer primary key , default_name varchar , ror_id varchar the id within the ROR organisational database, if there is one , display_suffix varchar usually location, when location is not obvious from the name , scope_id integer an id from the ctx.geog_entities table that indictes any explicit geographical 'scope' of the organisation , scope_notes varchar any additonal comments on the geographical scope , is_current integer 1 by default, 0 if no longer exists , year_established integer if known , year_ceased integer if applicable and known , created_on timestamptz created automatically
while a few rows from the table are shown in figure 1.
Note that the geographical scope is applied when there is an explicit geographical scope attached to an organisation's work, often for national or multi-national institutions. Most organisations, such as universities, hospitals, or commercial companies, while they are based in a country or countries, are not restrictred to a particular area in terms of the work they do, for example they may take students, patients and customers from outside their own geographical area.
The ROR (Research Organisation Registry) Id is available for many organisations that are also legal entities, and the URL that is the Id provides a link through to the set of associated ROR data - see ROR. As the ROR system expands it is hoped that a greater proportion of the organisations in the system will be linked to ROR Ids.
The structure of other org_names table is:
id integer primary key , org_id integer id of the organisation recvord , qualifier_id integer the nature of the other name, e.g. abbreviation, translation , lang_code varchar the language code of the other name , script_code varchar the code of the script being used (for future use) , name varchar the name itself
Figure 2 shows some sample rows, in this case illustrating the different names for the WHO and some of the trial registries in the system. The qualifier_id field is used to indicate the 'type' of the name compared to the default (qualifier id = 1), for example an abbreviation, a translation, former name etc. The script_code field will be used to indicate the identity of any non-latin script, which, because of the use of translitteration schemes for chinese, japanese, russian etc., may vary independently of the language.
Different names are required in the system for any particular organisation because the names are almost always input into source systems as free text, allowing for a huge variety of spellings and forms. The updating contextual data section of Harvesting Data explains how the organisation tables are used to make the organisational data within various tables = so far as is possible - more standardised.
The org_type_membership table is designed to show the type(s) of the organisation, which may be multiple - for example an organisation may be both a funder and a charity, or both a research institution and a healthcare provider. The table is therefore essentially a join table between organisations and the lup.org_types listing of possible organisation types. Other tables include org_locations and org_links. These tables, and others, are likely to play a bigger role in the future as integration with the ROR system increases and systems are developed to update organisational and other contextual data independently of the MDR, using a separate interface.
The ROR data, integrated with the organisational data in the ctx schema, is downloaded periodically from the ROR web site and is stored in tables in the ror schema.
Topic Data
The topics or keywords associated with studies and journal papers is an obvious pool of data for searching and filtering, but that functionality is not helped by the fact that a 'topic' can often be expressed in many ways. Synonym files can be used (and within the MDR portal they are used) to support search functions, but it would also be good to try and rationalise the topic data in the database, converting terms to a standard vocabulary wherever possible, linking terms to a code and removing duplicates.
Because so much of the data already has MESH coding applied this system is the natural first choice to use for this process. It also has the advantages of being a well known, widely used and comprehensive system covering all aspects of clinical research. Most other systems are either relatively specialised (LOINC, MedDRA, ICD etc.) or are much less widely used - at least in the context of the source data systems (e.g. SNOMED CT).
The MESH data is downloaded periodically as part of a larger periodic download of UMLS data, stored in the and umls schema, and includes two main MESH source tables, one for descriptive terms and the other for supplementary terns. These are combined into a table called mesh_lookup, with over 500,000 records but with just 4 fields:
- entry - a lower case text term that is one of the synonyms for a MESH code
- code - the MESH code itself
- term - the 'preferred' MESH term
- source - whether from the descriptive or supplementary list
Figure 3 provides some examples of the rowes in this table, referring to 3 MESH codes but 16 entry terms.
As is explained in the Updating contextual data section of Harvesting Data, this table is used in a lookup process to replace entry terms in the source data with the corresponding MESH term, although the original term is also preserved in the data. At the same time terms that have not been matched are also stored, to see if they can be mapped to a MESH code in the future.
Geographical Data
Geographical entities - countries, regions, continents, states, etc. - are referenced both within the main MDR, as a possible topic or keyword applied to some studies or papers, and within the contextual data, because organisations may have a geographic scope_id. They can be entered into the source systems under a variety of different names (US, U.S., USA. United States, United States of America, etc.) and it is therefore better to standardise and code them where possible.
The contextual data includes two tables - ctx.geog_entities and lup.geog_entity_types - that list countries and other geographical entities. The entity types are as follows:
- 1 Global : A single entity with a global extent
- 2 Continental : Geonames listed continents
- 3 Regional : Geonames country groupings
- 4 National : Geonames countries
- 5 State : Geonames admin1s, the top-level administrative regions
- 6 Political or economic : Groupings with a political or economic purpose, rather than geographical proximity
- 7 Other : Any entity not included in other categories
A section of the geog_entities is shown in Figure 4, and includes some regional groupings as well as some countries. In fact the most common type of entity is 'State', the main adminitrative division listed within each country. The source of the geographic data is the geonames website (geonames) and the data is periodically downloaded from there and stored in the geonames schema, before some basic processing to create the geog_entities table.
Publisher Data
The context schema includes tables that relate print issns (pissns), electronic issns (eissns) and journal names to a publisher id, and a fourth table that gives the publisher name against that id. The data is largely derived from CrossRef, although the publisher's id is derived from its entry in the organisations table. The tables are required because PubMed citations identify the journal name, and its pissn and / or eissn, but not the publisher. The publisher is usually the managing organisation for a journal paper - in the sense that it is the publisher rather than the editorial board that decides whether a journal or paper is open access or not. The publisher's identity is therefore an important part of the metadata for a published paper. These four look up tables are therefore used to identify the publisher. The 'source' tables for this data, as periodically downloaded from crossref, are stored in the crossref and pubs schemas.
People Data
Ideally, there would be a 'contextual' list of people that would relate people's names, titles and affiliations to ORCID identifiers, and only the ORCID Ids would be used within the system. Unfortunately only a very small proportion of the names in the source data (as study leads or contacts, or as journal paper authors) come with ORCID identifiers. Names come in a variety of different formats - sometimes with titles, sometimes with first name rather than initials, sometimes with the family name before the given name - and there is no obvious pattern or consistency to this, even within a single source. In general, therefore, standardisation and rationalisation of name and affiliation data is not possible in the MDR, and such data remains largely as it is presented in the source data (though titles are dropped when they can be identified). There are no plans at the moment to create a contextual resource for people, unless and until adoption of ORCID identifiers becomes much more widespread.