Data Structures
This page provides some details of the data infrastructure required to support the MDR. It also outlines the structure of the individual databases used at each stage of the extraction process (most of which are very similar) and describes when and how they are constructed, and how references across databases are established and removed as necessary. Links are provided to individual code segments that build the main tables involved.
Contents
Databases and Data Schemas Infrastructure
The MDR system is based on PostgreSQL. In Postgres data is held within multiple databases, each of which can be subdivided into different data schemas, or groups of related tables. Tables in each schema have a different prefix, separated by a '.' before the table name.
The main databases required are (see figure 1):
- A separate database for each data source, named after that source.
- A 'context' database, that acts as a central store of contextual data (e.g. organisations, countries) and the look up tables used within the system.
- A 'mon' (monitor) logging and tracking database, used to monitor the state of each source data package in the system.
- The mdr database itself, which contains the aggregated data.
In more detail, the schemas required are:
- In each data source database
- A schema (sd) that holds the data as harvested each time.
- A schema (ad) for the accumulated data from that source.
- In some cases it may be useful to have a third, supplementary schema (pp), to hold source specific support tables.
- In the context database
- A schema (lup) that holds a central store of lookup tables, i.e. controlled terminology.
- A schema (ctx) that holds contextual data relating to organisations and people, countries and regions, and MESH codes.
- In the mon database
- A schema (sf), to hold a list of data sources and a log record of the downloads, harvests, imports and aggregations carried out. 'sf' stands for 'search and fetch', which was the initial activity stored in this schema.
- A schema (sf), to hold a list of data sources and a log record of the downloads, harvests, imports and aggregations carried out. 'sf' stands for 'search and fetch', which was the initial activity stored in this schema.
- In the mdr database
- A schema (st) that holds aggregate study data points, i.e. from all data sources (mostly trial registries), and links them together so that study duplications are resolved.
- A schema (ob) that holds aggregate data object data points, i.e. from all sources, and links them together so that data object duplications are resolved.
- A schema (nk) that holds the linkage data, between studies and data objects, i.e. from all sources, and links them together so that duplications are resolved.
- The metadata repository itself, (schema = core), the final end point of the import processes and the data flowing through them. It brings together the data from the three 'aggregating' schemas.
The list of databases that emerges is shown in figure 2, which shows the PostgreSQL installation, with the context, mon and mdr databases, and then a DB for each source (as of October 2020).
Managing Links between databases
Postgres allows direct links between schemas in the same database - for example queries can be written that reference tables in different schemas. Having 4 schemas within the main mdr database - st, ob, nk and core - simplifies the interactions between these data stores. Postgres does not allow, however, simple references between databases. Instead schemas from an external database can be imported as 'foreign tables' (not copies but references to the source tables and their data) into pre-existing schemas in the importing database. By convention these foreign table schemas are distinct from the 'native' schemas and data, and usually take the same name as the source schema.
In the data systems required for the MDR, each of the source specific databases and the mdr database requires the mon (logging tables), the lup (look up tables) and ctx (context data) available as 'foreign tables'. In that way it is necessary to store only one copy of this important system or contextual data.
In addition the central mdr database, as well as its 4 native schemas, needs the ad schemas of each of the source databases to be available as foreign table schemas, so that the data can be easily transferred during the aggregation process.
Rather than fill the system with these additional imported 'foreign table' schemas, which in some cases could overwhelm the 'native' schemas in each database (for instance the mdr database needs to be able to access every one of 20 - in the future many more - source ad schemas), the data extraction system sets these foreign table links up as temporary features during a particular process, as required, and then tears them down again at the end.
Typical code for this operation is shown below. In the aggregation process, each source is considered in turn, and a temporary set of foreign tables representing the ad tables of that source (or for the mon database the sf tables) isestablished in the mdr database to enable data transfer. The operation is carried out by constructing a series of SQL statements that are executed using the Dapper micro-ORM (conn.Execute()), in the context of a connection to the relevant database, (connString, username and password are all established within the containing class's constructor). The foreign table schema name is returned from the function, to be fed back as a parameter in later transfer routines.
public string SetUpTempFTW(string database_name) { using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"CREATE EXTENSION IF NOT EXISTS postgres_fdw schema core;"; conn.Execute(sql_string); sql_string = @"CREATE SERVER IF NOT EXISTS " + database_name + @" FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname '" + database_name + "');"; conn.Execute(sql_string); sql_string = @"CREATE USER MAPPING IF NOT EXISTS FOR CURRENT_USER SERVER " + database_name + @" OPTIONS (user '" + username + "', password '" + password + "');"; conn.Execute(sql_string); string schema_name = ""; if (database_name == "mon") { schema_name = database_name + "_sf"; sql_string = @"DROP SCHEMA IF EXISTS " + schema_name + @" cascade; CREATE SCHEMA " + schema_name + @"; IMPORT FOREIGN SCHEMA sf FROM SERVER " + database_name + @" INTO " + schema_name + ";"; } else { schema_name = database_name + "_ad"; sql_string = @"DROP SCHEMA IF EXISTS " + schema_name + @" cascade; CREATE SCHEMA " + schema_name + @"; IMPORT FOREIGN SCHEMA ad FROM SERVER " + database_name + @" INTO " + schema_name + ";"; } conn.Execute(sql_string); return schema_name; } }
At the end of each source specific transfer the temporary schema and its foreign tables are destroyed, leaving the way clear for the next set of source tables, and keeping the mdr database 'clean' and relatively uncluttered by additional schemas:
public void DropTempFTW(string database_name) { using (var conn = new NpgsqlConnection(connString)) { string schema_name = ""; if (database_name == "mon") { schema_name = database_name + "_sf"; } else { schema_name = database_name + "_ad"; } string sql_string = @"DROP USER MAPPING IF EXISTS FOR CURRENT_USER SERVER " + database_name + ";"; conn.Execute(sql_string); sql_string = @"DROP SERVER IF EXISTS " + database_name + " CASCADE;"; conn.Execute(sql_string); sql_string = @"DROP SCHEMA IF EXISTS " + schema_name; conn.Execute(sql_string); } }
Overview of database structure
The structure of the tables in the sd and ad schemas of each source, and that in the mdr database, (split between the st, ob and nk schemas or combined in the core schema), is very similar. This is to make data transfer between these tables as simple as possible.
The final 'core' schema in the mdr is considered below, and then the differences between this and the other schemas and databases are described.
The core schema has 18 main tables - 6 that are concerned with study data, 11 that hold data object data, and one that connects the studies and data objects tables. (The core schema, but only the core schema, also has 2 very simple tables, not shown in figure 3, that hold the json strings derived for studies and data objects respectively).
The 6 'study tables' comprise studies itself, and then
- study_contributors
- study_identifiers
- study_relationships
- study_titles
- study_topics
As shown in figure 3, there is a one-to-many relationship between study records and the records in these tables. In the case of study_relationships there are two such joins - one for the 'subject' and the other for the 'target' study of the relationship. All the study tables have a foreign key (study_id) which links them back to the central studies table.
The 11 'data object tables' include data_objects itself, plus
- object_datasets
- object_contributors
- object_dates
- object_descriptions
- object_identifiers
- object_instances
- object_relationships
- object_rights
- object_titles
- object_topics
All the object tables have a foreign key (object_id) that links them back to the data_objects table. The relationship is again one-to-many, with the exception of the object_datasets table, which has a 1-to-1 link with data_objects, and the object_relationships table which has two joins - for the 'subject' and 'target' objects respectively. The object_datasets fields could be incorporated within data objects, but as only a relatively small proportion of the total data objects are datasets it is much more efficient to factor this data out as a separate table.
As shown by figure 3, the two groups of tables therefore form 2 clusters, one around the studies table and the other around the data_objects table. Each cluster corresponds to the central entity, studies or data objects, with its repeating attributes, as defined by the ECRIN metadata schema. The clusters are joined by a central study_object_links table, which holds the many-to-many relationships between studies and data objects.
The tables are created, in the core schema and elsewhere, by calling a series of SQL statements from the C# code. Each statement, one per table, is passed along in a Dapper call to the PostgreSQL database to be executed (tables, if they exist, are always explicitly dropped beforehand). The code for creating the core tables, which includes the detailed definition for each field, can be found at core table creation code.
Differences between core and other schemas
In some ways the core schema is the simplest schema in the system - it represents the 'end-product' of the extraction pipeline and is the closest schema to the metadata schema. Although the other schemas include broadly the same tables, with the same data fields, the following lists some of the ways in which those other schemas differ from this core pattern.
- Ids in the source databases: The integer Ids in the core tables are generated within the aggregating schemas st, ob and nk (details are explained in the page on Aggregating Data). Within the source databases 'source data ids' are used instead. These are strings rather than integers and are derived from the source data. For trial registry sources the source data study id - in the database as sd_sid - is the registry id. For study based data repositories a repository generated id is used, although this cannot always be guaranteed to be persistent. For objects, the 'source data object id, sd_oid in the database, is usually created by the system using a hashing mechanism, because in general data objects do not have an intrinsic identifier. The exception is PubMed citations, where the PubMed Id is available.