Data Structures
This page provides some details of the data infrastructure required to support the MDR. It also outlines the structure of the individual databases used at each stage of the extraction process (most of which are very similar) and describes when and how they are constructed, and how references across databases are established and removed as necessary. Links are provided to individual code segments that build the main tables involved.
Databases and Data Schemas Infrastructure
The MDR system is based on PostgreSQL. In Postgres data is held within multiple databases, each of which can be subdivided into different data schemas, or groups of related tables. Tables in each schema have a different prefix, separated by a '.' before the table name.
The main databases required are (see figure 1):
- A separate database for each data source, named after that source.
- A 'context' database, that acts as a central store of contextual data (e.g. organisations, countries) and the look up tables used within the system.
- A 'mon' (monitor) logging and tracking database, used to monitor the state of each source data package in the system.
- The mdr database itself, which contains the aggregated data.
In more detail, the schemas required are:
- In each data source database
- A schema (sd) that holds the data as harvested each time.
- A schema (ad) for the accumulated data from that source.
- In some cases it may be useful to have a third, supplementary schema (pp), to hold source specific support tables.
- In the context database
- A schema (lup) that holds a central store of lookup tables, i.e. controlled terminology.
- A schema (ctx) that holds contextual data relating to organisations and people, countries and regions, and MESH codes.
- In the mon database
- A schema (sf), to hold a list of data sources and a log record of the downloads, harvests, imports and aggregations carried out. 'sf' stands for 'search and fetch', which was the initial activity stored in this schema.
- A schema (sf), to hold a list of data sources and a log record of the downloads, harvests, imports and aggregations carried out. 'sf' stands for 'search and fetch', which was the initial activity stored in this schema.
- In the mdr database
- A schema (st) that holds aggregate study data points, i.e. from all data sources (mostly trial registries), and links them together so that study duplications are resolved.
- A schema (ob) that holds aggregate data object data points, i.e. from all sources, and links them together so that data object duplications are resolved.
- A schema (nk) that holds the linkage data, between studies and data objects, i.e. from all sources, and links them together so that duplications are resolved.
- The metadata repository itself, (schema = core), the final end point of the import processes and the data flowing through them. It brings together the data from the three 'aggregating' schemas.
Postgres allows direct links between schemas in the same database - for example queries can be written that reference tables in different schemas. Having 5 schemas within the main mdr database - mn, st, ob, nk and core - simplifies the interactions between these data stores. Postgres does not allow, however, simple references between databases. Instead schemas from an external database can be imported as 'foreign tables' (not copies but references to the source tables and their data) into pre-existing schemas in the importing database. By convention these foreign table schemas are distinct from the 'native' schemas and data, and usually take the same name as the source schema.
In the data system required for the MDR, each of the source specific databases and the mdr database requires the lup (look up tables) and ctx (context data) avalable as foreign tables. In that way it is necessary to store only one copy of this important contextual data.
In addition the central mdr database, as well as its 5 native schemas, needs the ad schemas of each of the source databases to be available as foreign table schemas, so that the data can be easily transferred.
Figure 2 summarises the schemas and their interactions.
In a fully developed system, that might have many source systems, (possibly up to a 100) it may be necessary to 'stage' the aggregation process and include intermediate database layers between the source and the main databases, to reduce the number of foreign table schemas in any one database. For instance different intermediate databases could be established for trial registries, or for institutional or bibliographic systems.