Difference between revisions of "Data Structures"

From ECRIN-MDR Wiki
Jump to navigation Jump to search
(Managing Links between databases)
Line 106: Line 106:
 
         }
 
         }
 
</pre>
 
</pre>
 +
<br/>
 +
==Overview of database structure==
 +
The core database has 19 tables - 6 that are concerned with study data, and 12 that hold data object data, and one that connects the studies and data objects table.
 +
[[File:Table arrangement.png|thumb|600px|<b>Figure 1: Links between the MDR Core tables</b>]]
 +
The 6 'study tables' comprise ''studies'' itself, and then
 +
* study_contributors
 +
* study_identifiers
 +
* study_relationships
 +
* study_titles
 +
* study_topics
 +
The 12 'data object tables' include data_objects itself, plus
 +
* dataset_properties
 +
* object_contributors
 +
* object_dates
 +
* object_descriptions
 +
* object_identifiers
 +
* object_instances
 +
* object_languages
 +
* object_relationships
 +
* object_rights
 +
* object_titles
 +
* object_topics
 +
The final linking table is study_object_links.<br/><br/>
 +
As shown by figure 1, the two groups of tables form 2 clusters, one ariound the studies table and the other around the data_objects table.<br/>
 +
All the study tables have a foreign key (study_id) which links them back to the central studies table.<br/>
 +
All the object tables (bar one) have a foreign key (object_id) that links them back to the data_objects table. The exception is the datset_properties table, which has a 1-to-1 link with data_objects, through the id fields. <br/>
 +
In both cases, the '_relationships' tables have a double link to the central entitiy table, as both the 'source' and 'target entities for the relationship have to be linked.<br/>
 
<br/>
 
<br/>

Revision as of 02:35, 31 October 2020

This page provides some details of the data infrastructure required to support the MDR. It also outlines the structure of the individual databases used at each stage of the extraction process (most of which are very similar) and describes when and how they are constructed, and how references across databases are established and removed as necessary. Links are provided to individual code segments that build the main tables involved.

Databases and Data Schemas Infrastructure

Figure 1: Pattern of data collection databases and schemas

The MDR system is based on PostgreSQL. In Postgres data is held within multiple databases, each of which can be subdivided into different data schemas, or groups of related tables. Tables in each schema have a different prefix, separated by a '.' before the table name.
The main databases required are (see figure 1):

  • A separate database for each data source, named after that source.
  • A 'context' database, that acts as a central store of contextual data (e.g. organisations, countries) and the look up tables used within the system.
  • A 'mon' (monitor) logging and tracking database, used to monitor the state of each source data package in the system.
  • The mdr database itself, which contains the aggregated data.


In more detail, the schemas required are:

  • In each data source database
    • A schema (sd) that holds the data as harvested each time.
    • A schema (ad) for the accumulated data from that source.
    • In some cases it may be useful to have a third, supplementary schema (pp), to hold source specific support tables.
  • In the context database
    • A schema (lup) that holds a central store of lookup tables, i.e. controlled terminology.
    • A schema (ctx) that holds contextual data relating to organisations and people, countries and regions, and MESH codes.
  • In the mon database
    • A schema (sf), to hold a list of data sources and a log record of the downloads, harvests, imports and aggregations carried out. 'sf' stands for 'search and fetch', which was the initial activity stored in this schema.
Figure 2: Databases on the PostgreSQL installation
  • In the mdr database
    • A schema (st) that holds aggregate study data points, i.e. from all data sources (mostly trial registries), and links them together so that study duplications are resolved.
    • A schema (ob) that holds aggregate data object data points, i.e. from all sources, and links them together so that data object duplications are resolved.
    • A schema (nk) that holds the linkage data, between studies and data objects, i.e. from all sources, and links them together so that duplications are resolved.
    • The metadata repository itself, (schema = core), the final end point of the import processes and the data flowing through them. It brings together the data from the three 'aggregating' schemas.


The list of databases that emerges is shown in figure 2, which shows the PostgreSQL installation, with the context, mon and mdr databases, and then a DB for each source (as of October 2020).

Managing Links between databases

Postgres allows direct links between schemas in the same database - for example queries can be written that reference tables in different schemas. Having 4 schemas within the main mdr database - st, ob, nk and core - simplifies the interactions between these data stores. Postgres does not allow, however, simple references between databases. Instead schemas from an external database can be imported as 'foreign tables' (not copies but references to the source tables and their data) into pre-existing schemas in the importing database. By convention these foreign table schemas are distinct from the 'native' schemas and data, and usually take the same name as the source schema.

In the data systems required for the MDR, each of the source specific databases and the mdr database requires the mon (logging tables), the lup (look up tables) and ctx (context data) available as 'foreign tables'. In that way it is necessary to store only one copy of this important system or contextual data.
In addition the central mdr database, as well as its 4 native schemas, needs the ad schemas of each of the source databases to be available as foreign table schemas, so that the data can be easily transferred during the aggregation process.

Rather than fill the system with these additional imported 'foreign table' schemas, which in some cases could overwhelm the 'native' schemas in each database (for instance the mdr database needs to be able to access every one of 20 - in the future many more - source ad schemas), the data extraction system sets these foreign table links up as temporary features during a particular process, as required, and then tears them down again at the end. Typical code for this operation is shown below. In the aggregation process, each source is considered in turn, and a temporary set of foreign tables representing the ad tables of that source (or for the mon database the sf tables) isestablished in the mdr database to enable data transfer. The operation is carried out by constructing a series of SQL statements that are executed using the Dapper micro-ORM (conn.Execute()), in the context of a connection to the relevant database, (connString, username and password are all established within the containing class's constructor). The foreign table schema name is returned from the function, to be fed back as a parameter in later transfer routines.

        public string SetUpTempFTW(string database_name)
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"CREATE EXTENSION IF NOT EXISTS postgres_fdw
                                     schema core;";
                conn.Execute(sql_string);

                sql_string = @"CREATE SERVER IF NOT EXISTS " + database_name
                           + @" FOREIGN DATA WRAPPER postgres_fdw
                             OPTIONS (host 'localhost', dbname '" + database_name + "');";
                conn.Execute(sql_string);

                sql_string = @"CREATE USER MAPPING IF NOT EXISTS FOR CURRENT_USER
                     SERVER " + database_name
                     + @" OPTIONS (user '" + username + "', password '" + password + "');";
                conn.Execute(sql_string);
                string schema_name = "";
                if (database_name == "mon")
                {
                    schema_name = database_name + "_sf";
                    sql_string = @"DROP SCHEMA IF EXISTS " + schema_name + @" cascade;
                     CREATE SCHEMA " + schema_name + @";
                     IMPORT FOREIGN SCHEMA sf
                     FROM SERVER " + database_name +
                         @" INTO " + schema_name + ";";
                }
                else
                {
                    schema_name = database_name + "_ad";
                    sql_string = @"DROP SCHEMA IF EXISTS " + schema_name + @" cascade;
                     CREATE SCHEMA " + schema_name + @";
                     IMPORT FOREIGN SCHEMA ad
                     FROM SERVER " + database_name +
                         @" INTO " + schema_name + ";";
                }
                conn.Execute(sql_string);
                return schema_name;
            }
        }


At the end of each source specific transfer the temporary schema and its foreign tables are destroyed, leaving the way clear for the next set of source tables, and keeping the mdr database 'clean' and relatively uncluttered by additional schemas:

        public void DropTempFTW(string database_name)
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string schema_name = "";
                if (database_name == "mon")
                {
                    schema_name = database_name + "_sf";
                }
                else
                {
                    schema_name = database_name + "_ad";
                }

                string sql_string = @"DROP USER MAPPING IF EXISTS FOR CURRENT_USER
                     SERVER " + database_name + ";";
                conn.Execute(sql_string);

                sql_string = @"DROP SERVER IF EXISTS " + database_name + " CASCADE;";
                conn.Execute(sql_string);

                sql_string = @"DROP SCHEMA IF EXISTS " + schema_name;
                conn.Execute(sql_string);
            }
        }


Overview of database structure

The core database has 19 tables - 6 that are concerned with study data, and 12 that hold data object data, and one that connects the studies and data objects table.

Figure 1: Links between the MDR Core tables

The 6 'study tables' comprise studies itself, and then

  • study_contributors
  • study_identifiers
  • study_relationships
  • study_titles
  • study_topics

The 12 'data object tables' include data_objects itself, plus

  • dataset_properties
  • object_contributors
  • object_dates
  • object_descriptions
  • object_identifiers
  • object_instances
  • object_languages
  • object_relationships
  • object_rights
  • object_titles
  • object_topics

The final linking table is study_object_links.

As shown by figure 1, the two groups of tables form 2 clusters, one ariound the studies table and the other around the data_objects table.
All the study tables have a foreign key (study_id) which links them back to the central studies table.
All the object tables (bar one) have a foreign key (object_id) that links them back to the data_objects table. The exception is the datset_properties table, which has a 1-to-1 link with data_objects, through the id fields.
In both cases, the '_relationships' tables have a double link to the central entitiy table, as both the 'source' and 'target entities for the relationship have to be linked.