Difference between revisions of "Data Structures"

From ECRIN-MDR Wiki
Jump to navigation Jump to search
(Differences between core and other schemas)
Line 146: Line 146:
 
* ''Hash fields:'' Tables in the sd and ad schemas all contain additional record_hash fields that summarise in a single field the contents of the rest of the record. They play a central role in comparing records, to identify edits that have occurred. There are also additional composite hash tables in each source sd and ad schema, that hold a 'total hash' for each attribute type, for each study or object, used to identify more easily what attributes have changed. Both study and data object records also contain a 'full_hash' field, that provides an overall indicator of change, for the full study or object data (i.e. including all attributes). The ways in which the hashes are used is detailed in '''[[Missing PIDs and Hashing]]'''.  
 
* ''Hash fields:'' Tables in the sd and ad schemas all contain additional record_hash fields that summarise in a single field the contents of the rest of the record. They play a central role in comparing records, to identify edits that have occurred. There are also additional composite hash tables in each source sd and ad schema, that hold a 'total hash' for each attribute type, for each study or object, used to identify more easily what attributes have changed. Both study and data object records also contain a 'full_hash' field, that provides an overall indicator of change, for the full study or object data (i.e. including all attributes). The ways in which the hashes are used is detailed in '''[[Missing PIDs and Hashing]]'''.  
 
The code (a series of SQL statements called in succession) used to create the sd and ad tables is available at '''[[SD table creation code]]''' and '''[[AD table creation code]]'''.
 
The code (a series of SQL statements called in succession) used to create the sd and ad tables is available at '''[[SD table creation code]]''' and '''[[AD table creation code]]'''.
 +
 +
===Backing up the data and the databases===
 +
An account of how both the data in the databases and the local XML files that are the data sources can be found at '''[[Backing up the data]]'''

Revision as of 15:47, 21 December 2020

This page provides some details of the data infrastructure required to support the MDR. It also outlines the structure of the individual databases used at each stage of the extraction process (most of which are very similar) and describes when and how they are constructed, and how references across databases are established and removed as necessary. Links are provided to individual code segments that build the main tables involved.

Databases and Data Schemas Infrastructure

Figure 1: Pattern of data collection databases and schemas

The MDR system is based on PostgreSQL. In Postgres data is held within multiple databases, each of which can be subdivided into different data schemas, or groups of related tables. Tables in each schema have a different prefix, separated by a '.' before the table name.
The main databases required are (see figure 1):

  • A separate database for each data source, named after that source.
  • A 'context' database, that acts as a central store of contextual data (e.g. organisations, countries) and the look up tables used within the system.
  • A 'mon' (monitor) logging and tracking database, used to monitor the state of each source data package in the system.
  • The mdr database itself, which contains the aggregated data.


In more detail, the schemas required are:

  • In each data source database
    • A schema (sd) that holds the data as harvested each time.
    • A schema (ad) for the accumulated data from that source.
    • In some cases it may be useful to have a third, supplementary schema (pp), to hold source specific support tables.
  • In the context database
    • A schema (lup) that holds a central store of lookup tables, i.e. controlled terminology.
    • A schema (ctx) that holds contextual data relating to organisations and people, countries and regions, and MESH codes.
  • In the mon database
    • A schema (sf), to hold a list of data sources and a log record of the downloads, harvests, imports and aggregations carried out. 'sf' stands for 'search and fetch', which was the initial activity stored in this schema.
Figure 2: Databases on the PostgreSQL installation
  • In the mdr database
    • A schema (st) that holds aggregate study data points, i.e. from all data sources (mostly trial registries), and links them together so that study duplications are resolved.
    • A schema (ob) that holds aggregate data object data points, i.e. from all sources, and links them together so that data object duplications are resolved.
    • A schema (nk) that holds the linkage data, between studies and data objects, i.e. from all sources, and links them together so that duplications are resolved.
    • The metadata repository itself, (schema = core), the final end point of the import processes and the data flowing through them. It brings together the data from the three 'aggregating' schemas.


The list of databases that emerges is shown in figure 2, which shows the PostgreSQL installation, with the context, mon and mdr databases, and then a DB for each source (as of October 2020).

Managing Links between databases

Postgres allows direct links between schemas in the same database - for example queries can be written that reference tables in different schemas. Having 4 schemas within the main mdr database - st, ob, nk and core - simplifies the interactions between these data stores. Postgres does not allow, however, simple references between databases. Instead schemas from an external database can be imported as 'foreign tables' (not copies but references to the source tables and their data) into pre-existing schemas in the importing database. By convention these foreign table schemas are distinct from the 'native' schemas and data, and usually take the same name as the source schema.

In the data systems required for the MDR, each of the source specific databases and the mdr database requires the mon (logging tables), the lup (look up tables) and ctx (context data) available as 'foreign tables'. In that way it is necessary to store only one copy of this important system or contextual data.
In addition the central mdr database, as well as its 4 native schemas, needs the ad schemas of each of the source databases to be available as foreign table schemas, so that the data can be easily transferred during the aggregation process.

Rather than fill the system with these additional imported 'foreign table' schemas, which in some cases could overwhelm the 'native' schemas in each database (for instance the mdr database needs to be able to access every one of 20 - in the future many more - source ad schemas), the data extraction system sets these foreign table links up as temporary features during a particular process, as required, and then tears them down again at the end. Typical code for this operation is shown below. In the aggregation process, each source is considered in turn, and a temporary set of foreign tables representing the ad tables of that source (or for the mon database the sf tables) isestablished in the mdr database to enable data transfer. The operation is carried out by constructing a series of SQL statements that are executed using the Dapper micro-ORM (conn.Execute()), in the context of a connection to the relevant database, (connString, username and password are all established within the containing class's constructor). The foreign table schema name is returned from the function, to be fed back as a parameter in later transfer routines.

        public string SetUpTempFTW(string database_name)
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"CREATE EXTENSION IF NOT EXISTS postgres_fdw
                                     schema core;";
                conn.Execute(sql_string);

                sql_string = @"CREATE SERVER IF NOT EXISTS " + database_name
                           + @" FOREIGN DATA WRAPPER postgres_fdw
                             OPTIONS (host 'localhost', dbname '" + database_name + "');";
                conn.Execute(sql_string);

                sql_string = @"CREATE USER MAPPING IF NOT EXISTS FOR CURRENT_USER
                     SERVER " + database_name
                     + @" OPTIONS (user '" + username + "', password '" + password + "');";
                conn.Execute(sql_string);
                string schema_name = "";
                if (database_name == "mon")
                {
                    schema_name = database_name + "_sf";
                    sql_string = @"DROP SCHEMA IF EXISTS " + schema_name + @" cascade;
                     CREATE SCHEMA " + schema_name + @";
                     IMPORT FOREIGN SCHEMA sf
                     FROM SERVER " + database_name +
                         @" INTO " + schema_name + ";";
                }
                else
                {
                    schema_name = database_name + "_ad";
                    sql_string = @"DROP SCHEMA IF EXISTS " + schema_name + @" cascade;
                     CREATE SCHEMA " + schema_name + @";
                     IMPORT FOREIGN SCHEMA ad
                     FROM SERVER " + database_name +
                         @" INTO " + schema_name + ";";
                }
                conn.Execute(sql_string);
                return schema_name;
            }
        }

At the end of each source specific transfer the temporary schema and its foreign tables are destroyed, leaving the way clear for the next set of source tables, and keeping the mdr database 'clean' and relatively uncluttered by additional schemas:

        public void DropTempFTW(string database_name)
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string schema_name = "";
                if (database_name == "mon")
                {
                    schema_name = database_name + "_sf";
                }
                else
                {
                    schema_name = database_name + "_ad";
                }

                string sql_string = @"DROP USER MAPPING IF EXISTS FOR CURRENT_USER
                     SERVER " + database_name + ";";
                conn.Execute(sql_string);

                sql_string = @"DROP SERVER IF EXISTS " + database_name + " CASCADE;";
                conn.Execute(sql_string);

                sql_string = @"DROP SCHEMA IF EXISTS " + schema_name;
                conn.Execute(sql_string);
            }
        }

Overview of database structure

The structure of the tables in the sd and ad schemas of each source, and that in the mdr database, (split between the st, ob and nk schemas or combined in the core schema), is very similar. This is to make data transfer between these tables as simple as possible.
The final 'core' schema in the mdr is considered below, and then the differences between this and the other schemas and databases are described.
The core schema has 18 main tables - 6 that are concerned with study data, 11 that hold data object data, and one that connects the studies and data objects tables. (The core schema, but only the core schema, also has 2 very simple tables, not shown in figure 3, that hold the json strings derived for studies and data objects respectively).

Figure 3: The MDR Core tables

The 6 'study tables' comprise studies itself, and then

  • study_contributors
  • study_identifiers
  • study_relationships
  • study_titles
  • study_topics

As shown in figure 3, there is a one-to-many relationship between study records and the records in these tables. In the case of study_relationships there are two such joins - one for the 'subject' and the other for the 'target' study of the relationship. All the study tables have a foreign key (study_id) which links them back to the central studies table.

The 11 'data object tables' include data_objects itself, plus

  • object_datasets
  • object_contributors
  • object_dates
  • object_descriptions
  • object_identifiers
  • object_instances
  • object_relationships
  • object_rights
  • object_titles
  • object_topics

All the object tables have a foreign key (object_id) that links them back to the data_objects table. The relationship is again one-to-many, with the exception of the object_datasets table, which has a 1-to-1 link with data_objects, and the object_relationships table which has two joins - for the 'subject' and 'target' objects respectively. The object_datasets fields could be incorporated within data objects, but as only a relatively small proportion of the total data objects are datasets it is much more efficient to factor this data out as a separate table.
As shown by figure 3, the two groups of tables therefore form 2 clusters, one around the studies table and the other around the data_objects table. Each cluster corresponds to the central entity, studies or data objects, with its repeating attributes, as defined by the ECRIN metadata schema. The clusters are joined by a central study_object_links table, which holds the many-to-many relationships between studies and data objects.

The tables are created, in the core schema and elsewhere, by calling a series of SQL statements from the C# code. Each statement, one per table, is passed along in a Dapper call to the PostgreSQL database to be executed (tables, if they exist, are always explicitly dropped beforehand). The code for creating the core tables, which includes the detailed definition for each field, can be found at core table creation code.

Differences between core and other schemas

In some ways the core schema is the simplest schema in the system - it represents the 'end-product' of the extraction pipeline and is the closest to the metadata schema. Although the other schemas include broadly the same tables, with the same data fields, the following lists some of the ways in which those other schemas differ from this core pattern.

  • Ids in the source databases: The integer Ids in the core tables are generated within the aggregating schemas st, ob and nk (details are explained in the page on Aggregating Data). Within the source databases 'source data ids' are used instead. These are strings rather than integers and are derived from the source data. For trial registry sources the source data study id - in the database as sd_sid - is the registry id. For study based data repositories a repository generated id is used, although this cannot always be guaranteed to be persistent. For objects, the 'source data object id', sd_oid in the database, is usually created by the system using a hashing mechanism, because in general data objects do not have an intrinsic identifier. The exception is PubMed citations, where the PubMed Id is available.
  • Coded values include text equivalents in the sd tables: As an aid in checking the accuracy of the download and harvesting processes, the tables in the sd schemas include both the code and the text decode for categorised data points. In the rest of the system the text values are removed, to be added back only during final json production.
  • Source databases include a subset of tables: Although when a table exists in a source database it always has a consistent structure, not all databases require all tables. Source databases using the WHO data, for example, require only a few data object tables because the object data in the WHO source is sparse. The puubmed source database, on the other hand, has almost the full set of data object tables but no study tables at all. The system accesses a table (sf.source_parameters) that details the tables required in each sd and ad schema in the source databases, and uses that data to create only the tables necessary in each case.
  • Source databases can include an additional study_references table: Several of the source databases include this table, that holds the pubmed identifiers of journal articles referenced within study based records. This data is not needed in the core schema, but it plays an important role in defining part of the linkage between studies and journal articles, and is used within the aggregation process.
  • Audit fields in the ad schemas: The ad schema tables include three date-time fields - date_added, date_(last)_edited, date_exported (to the mdr database) which proivide a record of the date and time of these events.
  • Date data in the aggregate schemas: Within the mdr database a datetime aggregated_on field is present on all records in the st and ob schemas, though this disappears from the core tables. The 'date time of data fetch', which is carried by the main study and data object records through the system, is made into a sometimes composite provenance string, using the date time fetch data from however many source records contributed to the final core record. At the moment this applies only to the study records - most data objects have only a single source.
  • Link data: Within the sd and ad schemas, for study based sources, there is no separate link data table. Instead the links between data objects and studies are held within the data object records, as a reference to the 'parent' study. In the pubmed database (and any other object based source) a separate additional table is required to hold the many-to-many relationships that may occur. (This is in addition to the references data found in the study based data sources). The very simple study_object_links data in the core database are a simplified version of the more comprehensive link data that exists in the nk schema. There study-study and study-object links play a critical role in the aggregation process and therefore require several more fields. The core schema table, conversely, is just a simple join table. In the aggregation process the study links in the data objects are removed (they are only valid, and can be only be guaranteed to be unique, within a single source), and as described above the source data ids are replaced with integer accession Ids.
  • Hash fields: Tables in the sd and ad schemas all contain additional record_hash fields that summarise in a single field the contents of the rest of the record. They play a central role in comparing records, to identify edits that have occurred. There are also additional composite hash tables in each source sd and ad schema, that hold a 'total hash' for each attribute type, for each study or object, used to identify more easily what attributes have changed. Both study and data object records also contain a 'full_hash' field, that provides an overall indicator of change, for the full study or object data (i.e. including all attributes). The ways in which the hashes are used is detailed in Missing PIDs and Hashing.

The code (a series of SQL statements called in succession) used to create the sd and ad tables is available at SD table creation code and AD table creation code.

Backing up the data and the databases

An account of how both the data in the databases and the local XML files that are the data sources can be found at Backing up the data