Aggregating Data
Contents
Introduction
The program takes all the data within the ad tables in the various source databases and loads it to central tables within the mdr database, dealing with multiple entries for studies and creating the link data information between studies and data objects. The aggregated data is held within tables in the st (study), ob (object) and nk (links) schemas. A fourth schema, 'core' is then populated as a direct import from the others, to provide a single simplified mdr dataset that can be exported to other systems.
Note that the aggregation process starts from scratch each time - there is no attempt to edit existing data. All of the central tables in the st, ob and nk schemas are dropped and then re-created during the main aggregation processes (triggered by -D). All of the core tables are dropped and re-created when the data is transferred to the core schema (triggered by -C). This makes each aggregation longer (it takes about 1 hour in total) but simplifies the processes involved, allowing a focus on the aggregation itself, without a need to consider updates and edits, and it makes the system much easier to maintain.
Although an aggregation can be run at any time it makes most sense to do so after the following sequence of events:
- Downloads are run for all data sources to get the local source file collections as up to date as possible.
- Contextual data is updated and / or augmented, so far as resources and time allow. (This is a gradual process).
- Harvests and imports are run for all study based sources to get the 'baseline' study data as up to date as possible, in each of the study source databases.
- Harvests and imports are run for all object based sources to get additional data object data as up to date as possible, in each of the object source databases.
In other words the aggregation module should be scheduled to run after the application of all other modules.
N.B. The aggregation code can be found at https://github.com/ecrin-github/DataAggregator
Parameters
The program is a console app, to enable it to be more easily scheduled. There are a variety of flag type parameters, that can be used alone or in combination (though only some combinations make sense). These include:
-D: which indicates that the aggregating data transfer should take place, from the source ad tables to the tables in the st (studies), ob (objects) and nk (links) schemas. This is the necessary first step of the aggregation process.
-C: indicates that the core tables should be created and filled from the aggregate tables, i.e. data is combined from the st, ob and nk schemas in to a single, simpler core schema.
-J: indicates that the core data be used to create JSON versions of the data within the core database.
-F: indicates that the core data should be used to create JSON files of two types, one for each study and another for each data object. It has no effect unless the -J parameter is also supplied.
-S: collects statistics about the existing data, from both the ad tables and the central aggregated tables.
-Z: zips the json files created by the -F parameter into a series of zip files, with up to 100,000 files in each. This is for ease of transfer to other systems.
The -S parameter can be provided at any time or on its own. It makes little sense to trigger the other processes without an initial call using -D. A -C call would then normally follows, and then -J (-F), and finally -Z. The system can cope with multiple parameters, and applies them in the order given: -D -C -J -F -S -Z. It is easier to see what is happening, however, to make multiple calls to the program working through the parameter list as described.
Overview
Initial Setup
The main aggregation process, as triggered by -D, begins with the creation of a new set of central tables in the st, ob and nk schemas.
After that the program interrogates the mon database to retrieve a set of 'Source' objects, each one corresponding to a source database. The list is in order of decreasing 'preference', where preference indicates the usefulness of the source as the primary data source for duplicated studies (see the Preferred Source concept section in Study-study links]). It runs through those sources to first obtain a list of the 'other registry ids' in each database. In other words it builds up a list of all studies that are found in more than one data source. About 25,000 studies are registered in 2 registries and about another 1,000 are registered in 3 or more registries. The details are provided in Study-study links.
Some studies (several hundred) have more complex relationships - for example are registered in two registries but in one of those are entered as a group of related studies rather than having a simple 1-to-1 relationship with the other registry study. These are removed and instead added to the study-study relationship data.
Aggregating Study Data
- The study data is then added to the aggregate tables, in the order of most preferred source, working through the list to the least preferred. Apart from the first (ClinicalTriuals.gov) the study id of any imported study is checked against the table of poly-registered studies. If it exists in that table it is not added as a separate record but instead is given the same id as that of the most preferred version of that study.
- Study data, including all attribute data, of studies that are genuinely new to the system are simply added to the aggregate data. Also immediately added are all associated data objects and their attribute data.
- Study data for a study that already exists in the system is checked first to see if it represents new data. The main study data record is not added - that can only come from the 'preferred' source. Study attributes are only added if they do not already exist, so far as that can be readily checked by the program. Data objects in the 'non-preferred' versions of the study may already exist but the nature of the data is that genuine duplication of data objects from different sources is extremely rare. Almost all data objects are therefore added. Studies with multiple entries in different registries therefore have their data built up from a single 'preferred' source for the main study record, from potentially multiple registries for study attributes, and definitely from multiple registries for the associated data objects.
Aggregating Link Data
- The link between data objects and studies - found within the source data object data - is transferred to link tables. The 'parent study' id is transformed into its most 'preferred' form if and when necessary, to ensure that the links work properly in the aggregated system. Also transferred to link tables is the provenance data that indicates when each study and data object was last retrieved from the source.
Aggregating Object Based Data
- For sources where there are no studies - just data objects - the process is necessarily different. It must also follow after the aggregation of study data, to ensure that all studies are in the central system.
- This only applies to PubMed data at the moment. For PubMed, the links between the PubMed data and the studies are first identified. Two sources are used - the 'bank id' data within the PubMed data itself, referring to trial registry ids, and the 'source id' data in the study based sources, where references are provided to relevant papers. These two sets of data are combined and de-duplicated, and two final sets of data are created: the distinct list of PubMed data objects, and the list of links between those objects and studies. Unlike most data objects in the study based resources, PubMed data objects can be linked to multiple studies, and of course studies may have multiple article references. The linkage is therefore complex and requires considerable additional processing.
Creating the Core Tables
Most of the other options provided by the program are relatively simple and self contained. The -C option copies the data from the aggregating schema (st, ob, and nk) to the core schema without any processing, other than creating the provenance strings for both studies and data objects. The latter may be composite if more than one source was involved.