Data Collection Overview
Contents
The Overall Strategy
Data is collected from a growing number of clinical trial registries and data object repositories (collectively known as 'data sources'), transformed into ECRIN schema metadata, and stored in a central database so that it can be accessed by the web portal or APIs. In fact there are four distinct processes involved in data collection and extraction, which apply to all data sources, and which are shown in figure 1.
Data Download
All data used by the MDR is first downloaded onto an ECRIN managed server, and stored as an XML file. The data may start as an XML file at the source (as for ClinicalTrials.gov or Pubmed) in which case downloading the file is relatively straightforward, using API calls to identify the files required. The data may be in a downloadable csv file (e.g. WHO ICTRP data) which is then processed to generate an XML file per record (row) in the file. The data may be need to be scraped from one or more web pages, in which case an XML file is again constructed for each record.
Data files that are created rather than simply downloaded demand more processing, but that processing can be used to start the process of cleaning and transforming the data from its 'raw' state into one that matches the ECRIN schema. The XML files generated are therefore relatively easy to harvest in the next stage of the process, compared with the 'native' XML files from, for example, PubMed.
Successive download operations result in steadily growing collections of source data, stored locally on the MDR database server. Each source has its own folder, or set of folders. For the WHO data the download process splits the records up and distributes the resulting files to different folders according to the source registry.
For small sources, it is often simpler to re-download the whole set of source data and replace the existing files. For large sources this takes too much time, and just a subset of records are downloaded each time - usually those revised or added since the last download. This does not prevent even large datasets being completely replaced at intervals - perhaps annually - to ensure synchronisation between the source and the MDR's version of it. Either way, at any one time, the MDR has the totality of the relevant material from each source available locally.
Data Harvesting
At intervals the local data can be processed and inserted into a database, a process that in this system is known as 'harvesting'. The harvested data is inserted into - effectively - a temporary holding database, so that it can be examined and additional processing carried out as required.
In the mdr system each source has a single database but uses at least two schemas, or distinct sets of tables. The schema that the harvested data is placed into is known as the session data schema (the name of each table in it is prefixed with 'sd.'). This differentiates it from the accumulated data schema tables (all prefixed with 'ad.'). As the name suggests the accumulated data tables hold the totality of data obtained from the source. They are usually created when the source is first accessed and then gradually grow and are revised over time. The session data tables, on the other hand, hold only the data from the last harvested session. These tables are dropped and recreated each time a harvest session takes place.
In most circumstances, harvests are set up to process only files that have been added or revised since the most recent import operation (described below). This means that only data that is potentially new to the system is harvested and placed in the sd tables. In most cases therefore, the sd tables will hold a small fraction of the volume in the ad tables, but it will be the data of current interest, because it is data recently changed in or added to the source. (It is possible to do a '100% harvest', but this would be relatively rare in normal operations).
The other important aspect of harvesting is that it completes the transformation of the data into the structure of the ECRIN metadata schema. The different databases will have different numbers of tables in their sd and ad schemas, (some sources are more complex than others) but a table of a particular type will be the same in all the databases, i.e. contain the same fields, and those fields will conform to the ECRIN schema. For the XML files generated by the download process this second transformation stage is usually straightforward. For ClinicalTriuals.gov and PubMed files, all the transformation has to be done during harvesting, which can therefore be relatively complex.
Figure 1: Data collection data flows
Data flow is shown for two sources,
followed by aggregation. The number of
systems required at each stage is shown,
along with some supplementary systems that
are used with all sources.
Data Import
The data import process brings the data into the accumulated data 'ad' tables. The ad and sd tables are broadly the same in structure (though the ad tables have more audit fields) so the transfer is relatively straightforward. The initial step is to identify what data is new and what has been changed in the sd tables compared to the ad tables. While new data is easy to spot, edited data is not so simple to identify when most data objects do not have intrinsic identifiers, and when data is not necessarily presented in the same order between harvest sessions. The system uses a series of hashing techniques to summarise record content, and this allows changes in records, or closely related groups of records, to be picked up for editing to take place - i.e. replacement of the relevant portion of ad data by the new sd data.
If, but only if, the harvest has been a full 100% harvest, so that the sd tables represent all the data available, it is also possible to see if any study or object data has been deleted from the source, and which therefore should be deleted from the ad tables. This is relatively rare, but can occur with a few (non trial registry) sources.
Once import has taken place the ad tables should once again be synchronised with the source material.
Data Aggregation
The accumulated data from the different sources now need to be brought together to form the central, aggregated database. The problem is that both studies and (some types of ) data objects can be referenced in more than one source. About 27,000 studies, for example, are registered in more than one trial registry. Simply aggregating records would cause these studies to have duplicated (sometimes three or more) records in the system. Similarly, some journal articles are referenced by multiple studies in different sources, and simple aggregation would cause confusing duplication of records. The aggregation process must therefore guard against duplication. For studies this is done by identifying the links between study registrations and ensuring all studies in more than one registry share the same id. When studies are added they are checked to see if they are one of the linked studies, and if they already exist in the system. If they do their id is changed to match the study already present. For PubMed published journal articles all the possible study-article links are first collected together and then de-duplicated using the articles' PubMed identifier, to produce a distinct set. Once that has been done the articles can be added to the system safely.
The aggregation process always starts from scratch - there is no editing of existing data involved. The aggregate tables are dropped and recreated and all data is added to them. This is for simplicity and ease of maintenance - dealing with de-duplication is complex enough without having to deal with possible edits and deletes! At present (October 2020) it takes about 45 minutes for the aggregation process to create all tables and fill them with about 1.5 million study and data object records, with about a further 10 million attribute records.
After initial aggregation the data exists within three different schemas within the main mdr database - one for study data, one for data objects and one for link related data, between studies and data objects and between studies and studies. In general links between data objects (i.e. objects existing in different sources but with different names) do not yet exist in the source data. One final step is to import slightly simplified versions of this data into a single 'core' schema, that can be used as the data source for the web site. The generation of the core schema also sees the production of the 'provenance strings' for each study and object. For studies these may be composite, because the data has been collected from more than one source. Also at the end of the aggregation process, after the core tables have been created, the system can generate the json versions of study and data object data.
Other key aspects
There are a few other 'high level' aspects of the data collection system that are worth pointing out. These include:
'Study' versus 'Data Object' data sources
In broad terms there are two types of data sources. 'Study' sources, which are the great majority at the moment, contain data about studies and, if sometimes only implicitly, associated data objects. Examples are the trial registries, which contain various data points about studies (though they usually include a WHO defined subset) and some basic information about data objects - in the case of registries the registry entries themselves, as well as registry based results summaries in many cases, and for some registries related documents (e.g. protocols) that have been uploaded. Many data repositories are also study based (e.g. Yoda, BioLINCC), in the sense that the documents and datasets that are stored are organised around the source study, with the repository usually displaying a web based study summary (the study 'landing page') that lists the data objects available.
Other data sources are 'Object' sources, because they contain only data objects. PubMed is the only current data source of this type. The data about each data object tends to be much richer than that available for 'study sourced' objects, but the data objects can only be added into the system if they include an explicit link to a study, which only some PubMed records do - or if they are explicitly referenced by a study, which is also the case with many PubMed records - though interestingly there is relatively little overlap between the two groups. The intention is to increase the proportion of Object based sources in the system.
Processing differs for the two types of source. Object studies obviously have no study data and generate no study tables, but linking them to studies, often in multiple ways, can make the processing more complex. The aggregation processing can also be more complex with objects that have many-to-many relationships with studies. Both of these issues occur with PubMed data.
Modular functionality
The four main modules can operate independently of each other, and were specifically designed to do so. This is to make system operation much easier to operate - it is not necessary to perform operations in a fixed order, though it is sometimes sensible to do so. The 'pipeline' is therefore broken up into four distinct modules.
Thus, data download for any source can take place at any time, the local data store is simply updated / augmented, or for a 100% download, replaced.
Harvesting into the session data tables can also take place, and be repeated, at any time. The session tables are recreated each time the harvest process is run, so there are no side effects from repeating the exercise. If looking only for updated / new data, with no additional downloads since the last import into the ad tables, then the harvest will not generate any data - in a sense there is no point, because the system (the ad tables) already 'know about' all the data that is present. Other than that Harvesting simply takes the new / updated data (or optionally all the data) and makes it available for input.
Import can occur any time there is some data in the sd tables to import. The process does not destroy the SD table data (only a new harvest will do that) so in theory the import could be repeated. But repeating an import will not change the ad tables, as nothing has further has changed in the source. An 'update' harvest collects all the data from the local source that has been added or changed since the last import (not the last harvest). Thus an update harvest should always provide all the data available for import, however many downloads have been done in the meantime.
Finally, the aggregation process is a recreation process - all central tables are destroyed and all of the data is re-aggregated by re-adding from each source. It can therefore take place at any time. Aggregation is best done after all the ad tables are updated, and the system is designed to aggregate all the study based sources before the object based ones, so that all study data is present when attempting to link objects, but that is the only ordering built into the processing.
Logging and co-ordination
If the modules are designed to be scheduled and run independently, they do need to be able access a central logging / tracing system that identifies the status of each study (for study based sources) or object (for object based sources) in the system, because the data is processed in packets around each of these 'source' studies or objects. Thus a central database called 'mon', for monitor, includes tables that include a record for each source study and object, across all sources. The 'source data' tables - sf.source_data_studies and sf.source_data_objects, also contain details of remote web URLs and local file locations, the date-times of the most recent downloads, harvests and imports, and the ids of the associated download ('saf' = search and fetch), harvest and import events. In other words these tables indicate where each package of data is in the pipeline.
The modules use these tables to identify the appropriate actions to take at each stage. The download module adds new records to these tables, but updates existing ones. The harvest module can select the local file locations of records whose download date is greater than the last import date (or for whom there is not yet an import date). The import process is driven by differences in the data, but records the imports that occur. Only aggregation ignores these tables, but can produce a set of statistics of its own within the mon database, summarising both the source and the destination tables in any aggregation process.
Contextual data
Some data is common to all sources, and is therefore factored out into a separate database, called 'context'. This has two schemas - 'lup' for lookup tables and 'ctx' for context tables. The lookup tables are a set of about 20 relatively small tables that hold the codes and values for various 'look ups' in the system, that would normally be presented to the user as options within drop down boxes or lists of check boxes, to select or filter studies and/or objects. The look up tables include, for example, tables with the codes (usually integers) and values for 'study types', 'resource types', 'dataset consent types', 'contribution types', 'time units', 'language codes', etc., etc.
The context schema contains a variety of contextual data. This includes codes and names for geographical entities (continents, regions, countries and states), lookup tables that relate publishers names and codes to e-issn and p-issn numbers (used to find the publishers of journal articles), a table of MESH topic codes, and several tables detailing the names, types and other details of organisations. Organisation names appear at several places within the source data, but are often written in several different ways. The system tries to standardise organisation names into a single default form by looking them up within these tables. developing the context tables and making them as comprehensive as possible is an ongoing exercise.