Difference between revisions of "Data Structures"

From ECRIN-MDR Wiki
Jump to navigation Jump to search
(Managing LInks between databases)
Line 28: Line 28:
 
<br/>
 
<br/>
 
== Managing LInks between databases ==
 
== Managing LInks between databases ==
Postgres allows direct links between schemas in the same database - for example queries can be written that reference tables in different schemas. Having 5 schemas within the main mdr database - ''mn, st, ob, nk'' and ''core'' - simplifies the interactions between these data stores. Postgres does not allow, however, simple references ''between'' databases. Instead schemas from an external database can be imported as 'foreign tables' (not copies but references to the source tables and their data) into pre-existing schemas in the importing database. By convention these foreign table schemas are distinct from the 'native' schemas and data, and usually take the same name as the source schema.<br/>
+
Postgres allows direct links between schemas in the same database - for example queries can be written that reference tables in different schemas. Having 4 schemas within the main mdr database - ''st, ob, nk'' and ''core'' - simplifies the interactions between these data stores. Postgres does not allow, however, simple references ''between'' databases. Instead schemas from an external database can be imported as 'foreign tables' (not copies but references to the source tables and their data) into pre-existing schemas in the importing database. By convention these foreign table schemas are distinct from the 'native' schemas and data, and usually take the same name as the source schema.<br/>
In the data system required for the MDR, each of the source specific databases ''and'' the mdr database requires the lup (look up tables) and ctx (context data) avalable as foreign tables. In that way it is necessary to store only one copy of this important contextual data.<br/>
+
In the data systems required for the MDR, each of the source specific databases ''and'' the mdr database requires the mon (logging tables), the lup (look up tables) and ctx (context data) avalable as foreign tables. In that way it is necessary to store only one copy of this important system or contextual data.<br/>
In addition the central mdr database, as well as its 5 native schemas, needs the ad schemas of each of the source databases to be available as foreign table schemas, so that the data can be easily transferred.<br/>
+
In addition the central mdr database, as well as its 4 native schemas, needs the ad schemas of each of the source databases to be available as foreign table schemas, so that the data can be easily transferred during the aggregation process.<br/>
Figure 2 summarises the schemas and their interactions.<br/>
+
Rather than fill the system with these additional imported 'foreign table' schemas, which in some cases could overwhelm the 'native' schemas in each database (for instance the mdr database needs to be able to access every one of 20 - in the future many more - source ad schemas), the data extraction system sets up these links as temporary features during a particular process, as needed, and then tears them down again at the end.
In a fully developed system, that might have many source systems, (possibly up to a 100) it may be necessary to 'stage' the aggregation process and include intermediate database layers between the source and the main databases, to reduce the number of foreign table schemas in any one database. For instance different intermediate databases could be established for trial registries, or for institutional or bibliographic systems.  
+
Typical code for this operation is shown below. In the aggregation process, each source is considered in turn, and a temporary set of foreign tables representing the ad tables of that source are extablished in the mdr database to enable data transfer. The resulting schema name is returned from the function, to be fed back as a parameter to the transfer routines.
 +
<pre>
 +
public string SetUpTempFTW(string database_name)
 +
        {
 +
            using (var conn = new NpgsqlConnection(connString))
 +
            {
 +
                string sql_string = @"CREATE EXTENSION IF NOT EXISTS postgres_fdw
 +
                                    schema core;";
 +
                conn.Execute(sql_string);
 +
 
 +
                sql_string = @"CREATE SERVER IF NOT EXISTS " + database_name
 +
                          + @" FOREIGN DATA WRAPPER postgres_fdw
 +
                            OPTIONS (host 'localhost', dbname '" + database_name + "');";
 +
                conn.Execute(sql_string);
 +
 
 +
                sql_string = @"CREATE USER MAPPING IF NOT EXISTS FOR CURRENT_USER
 +
                    SERVER " + database_name
 +
                    + @" OPTIONS (user '" + username + "', password '" + password + "');";
 +
                conn.Execute(sql_string);
 +
                string schema_name = "";
 +
                if (database_name == "mon")
 +
                {
 +
                    schema_name = database_name + "_sf";
 +
                    sql_string = @"DROP SCHEMA IF EXISTS " + schema_name + @" cascade;
 +
                    CREATE SCHEMA " + schema_name + @";
 +
                    IMPORT FOREIGN SCHEMA sf
 +
                    FROM SERVER " + database_name +
 +
                        @" INTO " + schema_name + ";";
 +
                }
 +
                else
 +
                {
 +
                    schema_name = database_name + "_ad";
 +
                    sql_string = @"DROP SCHEMA IF EXISTS " + schema_name + @" cascade;
 +
                    CREATE SCHEMA " + schema_name + @";
 +
                    IMPORT FOREIGN SCHEMA ad
 +
                    FROM SERVER " + database_name +
 +
                        @" INTO " + schema_name + ";";
 +
                }
 +
                conn.Execute(sql_string);
 +
                return schema_name;
 +
            }
 +
        }
 +
</pre>
 +
At the end of each source specific transfer the temporary schema and its foreign tables are destroyed, leaving the way clear for the next set of source tables, and keeping the mdr database 'clean' and relatively uncluttered by additional schemas.
 +
<pre>
 +
public void DropTempFTW(string database_name)
 +
        {
 +
            using (var conn = new NpgsqlConnection(connString))
 +
            {
 +
                string schema_name = "";
 +
                if (database_name == "mon")
 +
                {
 +
                    schema_name = database_name + "_sf";
 +
                }
 +
                else
 +
                {
 +
                    schema_name = database_name + "_ad";
 +
                }
 +
 
 +
                string sql_string = @"DROP USER MAPPING IF EXISTS FOR CURRENT_USER
 +
                    SERVER " + database_name + ";";
 +
                conn.Execute(sql_string);
 +
 
 +
                sql_string = @"DROP SERVER IF EXISTS " + database_name + " CASCADE;";
 +
                conn.Execute(sql_string);
 +
 
 +
                sql_string = @"DROP SCHEMA IF EXISTS " + schema_name;
 +
                conn.Execute(sql_string);
 +
            }
 +
        }
 +
</pre>
 
<br/><br/>
 
<br/><br/>

Revision as of 02:10, 31 October 2020

This page provides some details of the data infrastructure required to support the MDR. It also outlines the structure of the individual databases used at each stage of the extraction process (most of which are very similar) and describes when and how they are constructed, and how references across databases are established and removed as necessary. Links are provided to individual code segments that build the main tables involved.

Databases and Data Schemas Infrastructure

The MDR system is based on PostgreSQL. In Postgres data is held within multiple databases, each of which can be subdivided into different data schemas, or groups of related tables. Tables in each schema have a different prefix, separated by a '.' before the table name.
The main databases required are (see figure 1):

Figure 1: Data collection schemas
  • A separate database for each data source, named after that source.
  • A 'context' database, that acts as a central store of contextual data (e.g. organisations, countries) and the look up tables used within the system.
  • A 'mon' (monitor) logging and tracking database, used to monitor the state of each source data package in the system.
  • The mdr database itself, which contains the aggregated data.


In more detail, the schemas required are:

  • In each data source database
    • A schema (sd) that holds the data as harvested each time.
    • A schema (ad) for the accumulated data from that source.
    • In some cases it may be useful to have a third, supplementary schema (pp), to hold source specific support tables.
  • In the context database
    • A schema (lup) that holds a central store of lookup tables, i.e. controlled terminology.
    • A schema (ctx) that holds contextual data relating to organisations and people, countries and regions, and MESH codes.
  • In the mon database
    • A schema (sf), to hold a list of data sources and a log record of the downloads, harvests, imports and aggregations carried out. 'sf' stands for 'search and fetch', which was the initial activity stored in this schema.
  • In the mdr database
    • A schema (st) that holds aggregate study data points, i.e. from all data sources (mostly trial registries), and links them together so that study duplications are resolved.
    • A schema (ob) that holds aggregate data object data points, i.e. from all sources, and links them together so that data object duplications are resolved.
    • A schema (nk) that holds the linkage data, between studies and data objects, i.e. from all sources, and links them together so that duplications are resolved.
    • The metadata repository itself, (schema = core), the final end point of the import processes and the data flowing through them. It brings together the data from the three 'aggregating' schemas.


Managing LInks between databases

Postgres allows direct links between schemas in the same database - for example queries can be written that reference tables in different schemas. Having 4 schemas within the main mdr database - st, ob, nk and core - simplifies the interactions between these data stores. Postgres does not allow, however, simple references between databases. Instead schemas from an external database can be imported as 'foreign tables' (not copies but references to the source tables and their data) into pre-existing schemas in the importing database. By convention these foreign table schemas are distinct from the 'native' schemas and data, and usually take the same name as the source schema.
In the data systems required for the MDR, each of the source specific databases and the mdr database requires the mon (logging tables), the lup (look up tables) and ctx (context data) avalable as foreign tables. In that way it is necessary to store only one copy of this important system or contextual data.
In addition the central mdr database, as well as its 4 native schemas, needs the ad schemas of each of the source databases to be available as foreign table schemas, so that the data can be easily transferred during the aggregation process.
Rather than fill the system with these additional imported 'foreign table' schemas, which in some cases could overwhelm the 'native' schemas in each database (for instance the mdr database needs to be able to access every one of 20 - in the future many more - source ad schemas), the data extraction system sets up these links as temporary features during a particular process, as needed, and then tears them down again at the end. Typical code for this operation is shown below. In the aggregation process, each source is considered in turn, and a temporary set of foreign tables representing the ad tables of that source are extablished in the mdr database to enable data transfer. The resulting schema name is returned from the function, to be fed back as a parameter to the transfer routines.

public string SetUpTempFTW(string database_name)
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"CREATE EXTENSION IF NOT EXISTS postgres_fdw
                                     schema core;";
                conn.Execute(sql_string);

                sql_string = @"CREATE SERVER IF NOT EXISTS " + database_name
                           + @" FOREIGN DATA WRAPPER postgres_fdw
                             OPTIONS (host 'localhost', dbname '" + database_name + "');";
                conn.Execute(sql_string);

                sql_string = @"CREATE USER MAPPING IF NOT EXISTS FOR CURRENT_USER
                     SERVER " + database_name
                     + @" OPTIONS (user '" + username + "', password '" + password + "');";
                conn.Execute(sql_string);
                string schema_name = "";
                if (database_name == "mon")
                {
                    schema_name = database_name + "_sf";
                    sql_string = @"DROP SCHEMA IF EXISTS " + schema_name + @" cascade;
                     CREATE SCHEMA " + schema_name + @";
                     IMPORT FOREIGN SCHEMA sf
                     FROM SERVER " + database_name +
                         @" INTO " + schema_name + ";";
                }
                else
                {
                    schema_name = database_name + "_ad";
                    sql_string = @"DROP SCHEMA IF EXISTS " + schema_name + @" cascade;
                     CREATE SCHEMA " + schema_name + @";
                     IMPORT FOREIGN SCHEMA ad
                     FROM SERVER " + database_name +
                         @" INTO " + schema_name + ";";
                }
                conn.Execute(sql_string);
                return schema_name;
            }
        }

At the end of each source specific transfer the temporary schema and its foreign tables are destroyed, leaving the way clear for the next set of source tables, and keeping the mdr database 'clean' and relatively uncluttered by additional schemas.

public void DropTempFTW(string database_name)
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string schema_name = "";
                if (database_name == "mon")
                {
                    schema_name = database_name + "_sf";
                }
                else
                {
                    schema_name = database_name + "_ad";
                }

                string sql_string = @"DROP USER MAPPING IF EXISTS FOR CURRENT_USER
                     SERVER " + database_name + ";";
                conn.Execute(sql_string);

                sql_string = @"DROP SERVER IF EXISTS " + database_name + " CASCADE;";
                conn.Execute(sql_string);

                sql_string = @"DROP SCHEMA IF EXISTS " + schema_name;
                conn.Execute(sql_string);
            }
        }