Difference between revisions of "Data Collection Overview"

From ECRIN-MDR Wiki
Jump to navigation Jump to search
(Data Import)
 
(47 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''Needs to be updated...'''
+
<p style="color:blue; text-align:right"><small>'''''Last updated: 10/05/2022'''''</small></p>
  
 
== The Overall Strategy ==
 
== The Overall Strategy ==
Data is collected from a growing number of clinical trial registries and data object repositories (collectively known as 'data sources'), transformed into ECRIN schema metadata, and stored in a central database so that it can be accessed by the web portal or APIs. In fact there are four distinct processes involved in data collection and extraction, which apply to all data sources, and which are shown in figure 1. <br/>
+
Data is collected from a growing number of clinical trial registries and data object repositories (collectively known as 'data sources'), transformed into ECRIN schema metadata, and stored in a central database so that it can be accessed by the web portal or APIs. (The individual data sources are listed at '''[[MDR Data Sources]]''', which has further links to details about each source.) In fact there are four distinct processes involved in data collection and extraction, which apply to all data sources, and which are shown in figure 1. <br/>
 
===Data Download===
 
===Data Download===
 
All data used by the MDR is first downloaded onto an ECRIN managed server, and stored as an XML file. The data may start as an XML file at the source (as for ClinicalTrials.gov or Pubmed) in which case downloading the file is relatively straightforward, using API calls to identify the files required. The data may be in a downloadable csv file (e.g. WHO ICTRP data) which is then processed to generate an XML file per record (row) in the file. The data may be need to be scraped from one or more web pages, in which case an XML file is again constructed for each record.<br/>
 
All data used by the MDR is first downloaded onto an ECRIN managed server, and stored as an XML file. The data may start as an XML file at the source (as for ClinicalTrials.gov or Pubmed) in which case downloading the file is relatively straightforward, using API calls to identify the files required. The data may be in a downloadable csv file (e.g. WHO ICTRP data) which is then processed to generate an XML file per record (row) in the file. The data may be need to be scraped from one or more web pages, in which case an XML file is again constructed for each record.<br/>
 
Data files that are created rather than simply downloaded demand more processing, but that processing can be used to start the process of cleaning and transforming the data from its 'raw' state into one that matches the ECRIN schema. The XML files generated are therefore relatively easy to harvest in the next stage of the process, compared with the 'native' XML files from, for example, PubMed.<br/>
 
Data files that are created rather than simply downloaded demand more processing, but that processing can be used to start the process of cleaning and transforming the data from its 'raw' state into one that matches the ECRIN schema. The XML files generated are therefore relatively easy to harvest in the next stage of the process, compared with the 'native' XML files from, for example, PubMed.<br/>
 
Successive download operations result in steadily growing collections of source data, stored locally on the MDR database server. Each source has its own folder, or set of folders. For the WHO data the download process splits the records up and distributes the resulting files to different folders according to the source registry.<Br/>
 
Successive download operations result in steadily growing collections of source data, stored locally on the MDR database server. Each source has its own folder, or set of folders. For the WHO data the download process splits the records up and distributes the resulting files to different folders according to the source registry.<Br/>
For small sources, it is often simpler to re-download the whole set of source data and replace the existing files. For large sources this takes too much time, and just a subset of records are downloaded each time - usually those revised or added since the last download. This does not prevent even large datasets being completely replaced at intervals - perhaps annually - to ensure synchronisation between the source and the MDR's version of it. Either way, at any one time, the MDR has the totality of the relevant material from each source available locally.
+
For small sources, it is often simpler to re-download the whole set of source data and replace the existing files. For large sources this takes too much time, and just a subset of records are downloaded each time - usually those revised or added since the last download. This does not prevent even large datasets being completely replaced at intervals - perhaps annually - to ensure synchronisation between the source and the MDR's version of it. Either way, at any one time, the MDR has the totality of the relevant material from each source available locally. For more details on the download process, see the '''[[Downloading Data]]''' page.
 
<br/>
 
<br/>
 
===Data Harvesting===
 
===Data Harvesting===
At intervals the local data can be processed and inserted into a database, a process that in this system is known as 'harvesting'. The harvested data is inserted into - effectively - a temporary holding database, so that it can be examined and additional processing carried out as required.<br/> In the mdr system each source has a single database but uses at least two schemas, or distinct sets of tables. The schema that the harvested data is placed into is known as the ''session data'' schema (the name of each table in it is prefixed with 'sd.'). This differentiates it from the ''accumulated data'' schema tables (all prefixed with 'ad.'). As the name suggests the accumulated data tables hold the totality of data obtained from the source. They are usually created when the source is first accessed and then gradually grow and are revised over time. The session data tables, on the other hand, hold only the data from the last harvested session. These tables are dropped and recreated each time a harvest session takes place.<br/>
+
At intervals the local data can be processed and inserted into a database, a process that in this system is known as 'harvesting'. The harvested data is inserted into - effectively - a temporary holding or staging database, so that it can be examined and additional processing carried out as required.<br/> In the mdr system each source has a single database but uses at least two schemas, or distinct sets of tables. The schema that the harvested data is placed into is known as the ''session data'' schema (the name of each table in it is prefixed with 'sd.'). This differentiates it from the ''accumulated data'' schema tables (all prefixed with 'ad.'). As the name suggests the accumulated data tables hold the totality of data obtained from the source. They are usually created when the source is first accessed and then gradually grow and are revised over time. The session data tables, on the other hand, hold only the data from the last harvest session. These tables are dropped and recreated each time a harvest session takes place.<br/>
In most circumstances, harvests are set up to process only files that have been added or revised since the most recent import operation (described below). This means that only data that is potentially new to the system is harvested and placed in the sd tables. In most cases therefore, the sd tables will hold a small fraction of the volume in the ad tables, but it will be the data of current interest, because it is data recently changed in or added to the source. (It is possible to do a '100% harvest', but this would be relatively rare in normal operations).
+
In most circumstances, harvests are set up to process only files that have been added or revised since the most recent import operation (described below). This means that only data that is potentially new to the system (specifically to the ad tables in the system) is harvested and placed in the sd tables, for later transfer to the ad tables. In most cases therefore, the sd tables will hold a small fraction of the volume in the ad tables, but it will be the data of current interest, because it is data recently changed or added to the source. (It is possible to do a '100% harvest', but for most sources this would be relatively rare in normal operations).
The other important aspect of harvesting is that it completes the transformation of the data into the structure of the ECRIN metadata schema. The different databases will have different numbers of tables in their sd and ad schemas, (some sources are more complex than others) but a table of a particular type will be the ''same'' in all the databases, i.e. contain the same fields, and those fields will conform to the ECRIN schema. For the XML files generated by the download process this second transformation stage is usually straightforward. For ClinicalTriuals.gov and PubMed files, all the transformation has to be done during harvesting, which can therefore be relatively complex.
+
The other important aspect of harvesting is that it completes the transformation of the data into the structure of the ECRIN metadata schema. The different databases will have different numbers of tables in their sd and ad schemas, (some sources are more complex than others) but a table of a particular type will be the ''same'' in all the databases, i.e. contain the same fields, and those fields will conform to the ECRIN schema. For the XML files generated by the download process this second transformation stage is usually straightforward. For ClinicalTrials.gov and PubMed files, all the transformation has to be done during harvesting, which can therefore be relatively complex. For Further details please see the '''[[Harvesting Data]]''' page.
<br/>
 
[[File:Data collection workflow.png|thumb|500px|<b>Figure 1: Data collection data flows</b>]]
 
 
<br/>
 
<br/>
 +
 +
[[File:Full mdr system 20201026.png|left|700px|<b>Figure 1: Data collection data flows</b>]]
 +
 +
<br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/>
 +
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<b>Figure 1: Data collection data flows</b><br/>
 +
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Data flow is shown for two sources,<br/>
 +
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;followed by aggregation. The number of<br/>
 +
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;systems required at each stage is shown,<br/>
 +
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;along with some supplementary systems that<br/>
 +
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;are used with all sources.<br/>
 +
<br/><br/><br/><br/><br/><br/><br/><br/><br/><br/><br/>
 +
<br/><br/><br/><br/><br/><br/><br/><br/><br/><br/>
 +
 
===Data Import===
 
===Data Import===
The data import process brings the data into the accumulated data 'ad' tables. The ad and sd tables are broadly the same in structure (though the ad tables have more audit fields) so the transfer is relatively straightforward. The initial step is to identify what data is new and what has been changed
+
The data import process brings the data into the accumulated data 'ad' tables. The ad and sd tables are broadly the same in structure (though the ad tables have more audit fields) so the transfer is relatively straightforward. The initial step is to identify what data is new and what has been changed in the sd tables compared to the ad tables. While new data is easy to spot, edited data is not so simple to identify when most data objects do not have intrinsic identifiers, and when data is not necessarily presented in the same order between harvest sessions. The system uses a series of hashing techniques to summarise record content, and this allows changes in records, or closely related groups of records, to be picked up for editing to take place - i.e. replacement of the relevant portion of ad data by the new sd data. See '''[[Missing PIDs and Hashing]]''' for more details.<br/>
 +
If, but only if, the harvest has been a full 100% harvest, so that the sd tables represent ''all'' the data available, it is possible to see if any study or object data has been deleted from the source, and which therefore should be deleted from the ad tables. This is relatively rare, but can occur with a few (non trial registry) sources.<br/>
 +
Once import has taken place the ad tables should once again be synchronised with the source material. Further details on data import can be found at '''[[Importing Data]]'''.
 
<br/>
 
<br/>
  
 
===Data Aggregation===
 
===Data Aggregation===
 +
The accumulated data from the different sources now need to be brought together to form the central, or core, aggregated database. The problem is that both studies and (some types of ) data objects can be referenced in more than one source. About 27,000 studies, for example, are registered in more than one trial registry. Simply aggregating records would cause these studies to have duplicated (sometimes three or more) records in the system. Similarly, some journal articles are referenced by multiple studies in different sources, and simple aggregation would cause confusing duplication of records. The aggregation process must therefore prevent this duplication. For studies this is done by identifying the links between study registrations and ensuring all studies present in more than one registry are aggregated to the same record in the core MDR database. When studies are added they are checked to see if they have been previously identified as the same to another study, and if they already exist in the system. If they do, their id is changed to match the study already present (see '''[[Identifying Links between Studies]]'''. For PubMed published journal articles all the possible study-article links are first collected together and then de-duplicated using the articles' PubMed identifier, to produce a distinct set. Once that has been done the articles can be added to the system safely. Duplication of data objects other than published papers is rare but checked in a similar way to studies.<br/>
 +
The aggregation process always starts from scratch - there is no editing of existing data involved. The aggregate tables are dropped and then recreated, and all data is added to them. This is for simplicity and ease of maintenance - dealing with de-duplication on addition is complex enough without having to deal with possible edits and deletes! The exception to this is the integer identifiers applied to studies and data objects in the core system. These are in effect database accesssion numbers and are kept constant between each weekly aggregation - they therefore act as persistent ECRIN identifiers. At present (May 2022) it takes about 50 minutes for the aggregation process to create all tables and fill them with about 1.75 million study and data object records, along with an approximate 12 million further attribute records. <br/>
 +
In the initial aggregation phase the data exists within three different schemas within the main mdr database - one for study data, one for data objects and one for link related data, between both studies and data objects and between studies and studies. The second phase involves importing slightly simplified versions of this data into a single 'core' schema, that can be used as the data source for the web site. The generation of the core schema also sees the production of the 'provenance strings' for each study and object. For studies these may be composite, because the data has been collected from more than one source. At the end of the aggregation process, after the core tables have been created, the system generates the json versions of study and data object data. A fuller description of the aggregation process is available at '''[[Aggregating Data]]'''.<br/><br/>
  
== Workflow Principles ==
+
== Other key aspects ==
 +
There are a few other 'high level' aspects of the data collection system that are worth pointing out. These include:<br/>
 +
=== 'Study' versus 'Data Object' data sources ===
 +
In broad terms there are two types of data sources. 'Study' sources, which are the great majority at the moment, contain data about studies ''and'', if sometimes only implicitly, associated data objects. Examples are the trial registries, which contain various data points about studies (though they usually include a WHO defined subset) and some basic information about data objects - in the case of registries the registry entries themselves, as well as registry based results summaries in many cases, and for some registries related documents (e.g. protocols) that have been uploaded. Many data repositories are also study based (e.g. Yoda, BioLINCC), in the sense that the documents and datasets that are stored are organised around the source study, with the repository usually displaying a web based study summary (the study 'landing page') that lists the data objects available.<br/>
 +
Other data sources are 'Object' sources, because they contain only data objects. PubMed is the only current data source of this type. The data about each data object tends to be much richer than that available for 'study sourced' objects, but the data objects can only be added into the system if they include an explicit link to a study, which only some PubMed records do - or if they are explicitly referenced by a study, which is also the case with many PubMed records. The intention is to increase the proportion of Object based sources in the system.<br/>
 +
Processing differs for the two types of source. Object studies obviously have no study data and generate no study tables, but linking them to studies, often in multiple ways, can make the processing more complex. The aggregation processing can also be more complex with objects that have many-to-many relationships with studies. Both of these issues occur with PubMed data. <br/>
  
 +
===Modular functionality and Ordering of Extraction===
 +
The four main modules can operate independently of each other, and were specifically designed to do so. This is to make system operation much easier to operate - it is not necessary to perform operations in a fixed order, though it is often sensible to do so. The 'pipeline' is therefore broken up into four distinct modules.<br/>
 +
Thus, data download for any source can take place at any time, the local data store is simply updated / augmented, or for a 100% download, replaced.<br/>
 +
Harvesting into the session data tables can also take place, and be repeated, at any time. The session tables are recreated each time the harvest process is run, so there are no side effects from repeating the exercise. If havesting only updated / new data (the default), when there have been no additional downloads since the last import into the ad tables, then the harvest will not generate any data - in a sense there is no point, because the system (the ad tables) already 'knows about' all the data that is present. Other than that Harvesting simply takes the new / updated data (or optionally all the data) and makes it available for import in the sd staging databases.<br/>
 +
Import can occur any time there is some data in the sd tables to import. The process does not destroy the SD table data (only a new harvest will do that) so in theory the import could be repeated. But repeating an import will not change the ad tables, as nothing has further has changed in the source. An 'update' harvest collects all the data from the local source that has been added or changed since the last ''import'' (not the last harvest). Thus an update harvest should always provide all the data available for import, however many downloads have been done in the meantime.<br/>
 +
Finally, the aggregation process is a recreation process - all central tables are destroyed (apart from the ones that retain the accession Ids of studies and objects) and all of the data is re-aggregated by re-adding from each source. It can therefore take place at any time. Aggregation is best done after all the ad tables are updated, and the system is designed to aggregate ''all the study based sources before the object based ones'', so that all study data is present when attempting to link objects, but that is the only ordering built into the processing.
 +
<br/>
  
3.&nbsp;&nbsp;'''Comparison of the most recent data import with the data already collected.'''<br>
+
===Logging and co-ordination===
The 'data already collected' are stored in their own tables, which are called accumulated data (schema = '''ad''') tables. The ad tables need to be structured in a way that is close enough to the structure of the sd 'session' tables for comparison to be easy, but near enough to the ECRIN metadata scheme for later steps to be straightforward. Each ad table requires an associated procedure that allows the 'source' sd table, or tables, to be compared across the relevant data points in the ad table. If not done earlier in the process, this is the final stage at which new and / or revised data is identified.
+
If the modules are designed to be scheduled and run independently, they do need to be able access a central logging / tracing system that identifies the status of each study (for study based sources) or object (for object based sources) in the system, because the data is processed in packets around each of these 'source' studies or objects. Thus a central database called 'mon', for monitor, includes tables that include a record for each source study and object, across all sources. The 'source data' tables - sf.source_data_studies and sf.source_data_objects, also contain details of remote web URLs and local file locations, the date-times of the most recent downloads, harvests and imports, and the ids of the associated download (or 'saf' event = search and fetch), harvest and import events. In other words these tables indicate where each source package of data is in the pipeline.<br/>
<br>
+
The modules use these tables to identify the appropriate actions to take at each stage. The download module adds new records to these tables, and updates existing ones. The harvest module is able to select the local file locations of records whose download date is greater than the last import date (or for whom there is not yet an import date). The import process is driven by differences in the data, but records the imports that occur. Only aggregation ignores these tables, but does produce a set of statistics of its own that area stored within the mon database, summarising the results of aggregation for each source and for the systerm as a whole. Further details on logging can be found at '''[[Logging and Tracking]]'''.
4.&nbsp;&nbsp;'''Transfer of new and edited data'''<br>
 
The new or revised data points in the sd data are '''added''' to the ad tables. This means that the ad tables contain ''all'' versions of any data that has changed - which therefore calls for indicators on the data to make it clear which is the current version of any data item. The ad tables are intended to contain a permanent and slowly growing collection of the data obtained from a particular source.
 
<br>
 
5.&nbsp;&nbsp;'''Final coding / restructuring of the new data to match the ECRIN metadata scheme'''<br>
 
Data added to the ad tables would undergo any additional processing and / or coding required to bring it into full compliance with the ECRIN metadata schema, or at least to a state that is easily mappable to that schema. This is necessary for the next stage of the process.
 
<br/><br/>
 
Note there will be no single consistent route through these processes, because each data source will demand a different strategy. The approach taken will be dependent on the nature of the source data, the API facilities available or the web scraping that is possible, any additional data collection, the identifiers and links present in the data, the coding required, etc., etc.
 
<br/><br/>
 
''and then working with all the data...''
 
<br/><br/>
 
6.&nbsp;&nbsp;'''Splitting and transfer of new data.'''<br>
 
The next stage involves aggregating data from different sources.<br>
 
Registry data, by definition, contains study related data, but it also contains at least one data object dataset, relating to the registry entry itself, and often contains additional references to other data objects. Data repository data will tend only to have data object data, but may include a mix of such objects, and will also normally contain additional information about which studies are linked to the data objects.<br>
 
The data from different data sources needs to be combined and compared, and it is simpler to use three different aggregating schemas (all within the main mdr database) - one for study data, one for data objects, and one for the links between them. The most recent data from each source is imported into the relevant aggregating schema, to be added to the existing data there.
 
 
<br/>
 
<br/>
7.&nbsp;&nbsp;'''Resolution of study and data object duplication'''<br/>
 
References to the same study or the same data object (but originating in different data sources) then need to be resolved. Study and data object attribute data may also be updated and / or expanded as new data points become available from different sources. This will require a complex set of comparison mechanisms and procedures to be developed, but only once this is done can the entities receive their final identifiers and the links between studies and data objects, studies and studies, and data objects and data objects be finalised.
 
<br>
 
8.&nbsp;&nbsp;'''Transfer of the new data from the aggregate databases to the metadata repository'''<br>
 
The final step is to transfer the linked and de-duplicated data from the three aggregate databases to the mdr core database itself.<br>
 
Once there where it can be queried, either directly or through export of JSON files to the elastic search system.
 
<br/><br/>
 
  
== Documentation support ==
+
===Scheduling===
Each data source will require its own unique set of processes to extract the data and pass it on to the aggregation process, and these will therefore need to be documented.<br/>
+
The various operations, applied to the different sources, are scheduled using Windows Task Manager. Each operation runs once weekly, with the final aggregated dataset being made available each Sunday evening. The details of the current schedule can be found at '''[[System Scheduling]]'''.
One approach to documentation would be to use the points listed above to structure a document list. Together with introductory information, this creates a list that covers:
 
* The nature of the data source, including its size and scope.
 
* Any legal or other constraints on the use / re-use of the data, and the arrangements put into place by ECRIN to ensure legal requirements have been met.
 
* A description of how relevant data will be identified and selected (if it can be) in the source data system.
 
* The structure of the source data, including the identification of the data points that will be extracted, either because they will form part of the final ECRIN dataset or because they can help in its processing and tracking (or in some cases because they may be potentially useful in the future, even if not required now).
 
* The processes by which the raw data (e.g. the retrieved XML, JSON or files) is filtered / extracted / transformed into the session data, including the coding that is applied. Thius should include or reference a detailed description of the session data tables.<br/>
 
* The comparison mechanisms that take place to turn the sd table data into ad table data, providing (or referencing) a detailed description of the ad tables.
 
* How the ad data, once transferred into the aggregate databases, is processed and compared to identify possible duplicates, integrate data points, and establish final identifiers.
 
* (Periodically), the numbers of records imported and transferred into the core MDR system, and the nature of those records (e.g. data object types).
 
In many cases some of the documents listed above could be combined.
 
<br/><br/>
 
 
 
== Databases and Data Schemas Required ==
 
In Postgres data can held within multiple databases, each of which can be subdivided into different data schemas, or groups of related tables.
 
[[File:Data collection schemas.png|thumb|500px|<b>Figure 2: Data collection schemas</b>]]
 
The databases required will be:
 
* A separate database for each data source, named after that source.
 
* A 'context' database, that acts as a central store of contextual data (e.g. organisations, countries) and the look up tables used within the system.
 
* The mdr database itself, which aggregates all the data into a single set of tables.
 
<br/>
 
In more detail, the schemas required are:<br/>
 
* '' In each data source database''
 
** A schema ('''sd''') that holds the data as imported each time.
 
** A schema ('''ad''') for the accumulated data from that source.
 
** In some cases it may be useful to have a third, supplementary schema ('''pp'''), to hold source specific support tables and procedures.
 
* ''In the context database''
 
** A schema ('''lup''') that holds a central store of lookup tables, i.e. controlled terminology.
 
** A schema ('''ctx''') that holds contextual data relating to organisations and people, countries and regions, and languages.
 
* ''In the mdr database''
 
** A schema ('''st''') that holds aggregate study data points, i.e. from all data sources (mostly trial registries), and links them together so that study duplications are resolved.
 
** A schema ('''ob''') that holds aggregate data object data points, i.e. from all sources, and links them together so that data object duplications are resolved.
 
** A schema ('''nk''') that holds the linkage data, between studies and data objects, i.e. from all sources, and links them together so that duplications are resolved.  
 
** The metadata repository itself, (schema = '''core'''), the final end point of the import processes and the data flowing through them. It brings together the data from the three 'aggregating' schemas. Data processing within this schema, beyond simple import, should be minimal.
 
** A monitor / logging schema ('''mn'''), to hold a list of data sources and a log record of the imports carried out. This assumes a central scheduling system that runs the import routines as required and which logs the reported results to this database.<br/>
 
 
<br/>
 
<br/>
Postgres allows direct links between schemas in the same database - for example queries can be written that reference tables in different schemas. Having 5 schemas within the main mdr database - ''mn, st, ob, nk'' and ''core'' - simplifies the interactions between these data stores. Postgres does not allow, however, simple references ''between'' databases. Instead schemas from an external database can be imported as 'foreign tables' (not copies but references to the source tables and their data) into pre-existing schemas in the importing database. By convention these foreign table schemas are distinct from the 'native' schemas and data, and usually take the same name as the source schema.<br/>
 
In the data system required for the MDR, each of the source specific databases ''and'' the mdr database requires the lup (look up tables) and ctx (context data) avalable as foreign tables. In that way it is necessary to store only one copy of this important contextual data.<br/>
 
In addition the central mdr database, as well as its 5 native schemas, needs the ad schemas of each of the source databases to be available as foreign table schemas, so that the data can be easily transferred.<br/>
 
Figure 2 summarises the schemas and their interactions.<br/>
 
In a fully developed system, that might have many source systems, (possibly up to a 100) it may be necessary to 'stage' the aggregation process and include intermediate database layers between the source and the main databases, to reduce the number of foreign table schemas in any one database. For instance different intermediate databases could be established for trial registries, or for institutional or bibliographic systems.
 
<br/><br/>
 
 
== Identifiers and audit mechanisms required ==
 
 
'' The source data identifier, sd_id''<br/>
 
For each source registry the data is imported into the ‘session data’ (sd) tables. In most cases there will be a ‘source data identifier’ present as part of the data, for example for registry data it will normally be the registry Id, for PubMed data it will be the PMID. This field, imported into the tables as '''sd_id''', links records together across the multiple session data tables. (If there is no source data identifier, as may be the case with some data object sources, then such an identifier will need to be manufactured, by a consistent method, on each import). Because these tables are truncated and refilled on each import, there is no need at this stage to apply a system generated identifier to the records.
 
<br><br>
 
'' The accumulated data identifier, ad_id''<br/>
 
The session data is compared with the accumulated data (ad) tables, in the same source database.<br>
 
Data with a completely new source data identifier can be identified as completely new and transferred to the ad tables.<br>
 
Data that has the same source data identifier as existing ad records is compared to see if any edits have taken place, on a table by table basis. If so the most recent record, i.e. the one from the sd tables, is ''added'' to the ad tables. The ad tables therefore contain every version of the data as imported. (N.B. It may be necessary to exclude some sd fields from this exercise, if they are altered automatically on each import – e.g. they may simply record the date of export from the source system).
 
<br><br>
 
As ad tables may contain multiple instances of data for the same source study the sd_id can no longer be used to uniquely identify records. Instead a new identifier, '''ad_id''', must be generated for each new record added to the ad 'core' entitiy tables (ad.studies and ad.data_objects), as a simple integer accession number. It will also need to be applied to the other 'attribute' tables, when required, to link the records (the sd_id is also retained). This data is a permanent part of the system so it makes sense to apply a system generated initial accession number.
 
[[File:Data collection ids.png|thumb|500px|<b>Figure 3: Scopes of the data collection ids</b>]]
 
'' Accumulated data audit fields''<br/>
 
As well as ad_id field, ad tables also need to include the following audit and support fields:<br>
 
* added_on: Datetime first added – when this records was first added to the ad table. No records are edited in the ad tables - both completely new and new versions of records are simply added.
 
* last_confirmed_on: Datetime this data was last confirmed. For newly added or edited data will be the same as added_on. For data that is unchanged will be the date time the comparison process reported no change.
 
* is_latest_version: a boolean set to true for a new or edited record added to the ad table (in the latter case any existing record with the same sd_id will need to have this value set to false). This field is used to identify the subset of the AD records that must be used when new SD data is compared with it.
 
* record_status_id: An integer that indicates if this record is an unchanged record (=0), is a completely new record (=1) or a new version of an existing record (=2). Therefore used to identify which records need to be transferred on to the rest of the data transformation process (status > 0).
 
* source_id: An integer indicating the source registry or data object source, taken from the master list in the monitor ''mn'' schema. It could be added during the transfer process to the aggregating schemas but probably easier to make it a field within all AD tables.
 
The ad tables in each registry / source database should therefore end up with a comprehensive record of all the unique data gathered from a particular source, including all versions of that data, and the datetime in which it was gathered and last confirmed.
 
<br><br>
 
'' Aggregate schema identifiers''<br/>
 
''For studies:'' In the aggregate study database the study records from different registries will need to be merged so that any study only has a single entry. To do this, in addition to tables that represent the mdr study tables, the database requires a table that links the ad_ids, the source ids, and the sd_id fields imported from different sources to a single id – again created as an integer accession number (this becomes the final ''id'' in the mdr system). The ad_ids cannot be used because they are not necessarily unique across source schemas. The new id, once generated, also needs to be applied to the records in attribute and linkage tables. <br/><br>
 
A record transferred to the aggregate study schema that represents an edit will therefore require a look up in the ad_id – id linkage table, to discover to which study record the update should apply.<br>
 
A record transferred to the aggregate study DB table that represents a new record will need to be checked to see if it really is a new record or just a new registry entry for a study that already exists in the system. The algorithm for this process needs to be developed but it will need to start by considering the study_identifiers table, to see if the sd_id matches an identifier that has not itself been used as a sd_id. If that is the case the record has been matched with an ‘other identifier’ listed elsewhere. A secondary check, e.g. using a processed version of the study title, could also be used.
 
<br><br>
 
''For data objects:'' An analogous process needs to take place, ensuring that objects of the same type, linked to the same study, are identified and checked to see if they represent a new, distinct object or the same object but described in a different data source (as far as that can be checked given the often limited information available). The algorithms needed for this process will need to be developed. For some types of data objects, such as journal articles, the presence of a doi or similar PID may make comparisons easy. For other types, e.g. protocols and SAPs, there may need to be assumptions made about the numbers of such objects expected (though versioning if present may make this more difficult). Again a central lookup table will be required to allow the system to generate (for new records) or use (for edited records) the definitive id that has been assigned to any data object.
 
<br><br>
 
'' Aggregate schema audit data''<br/>
 
The records in the aggregate databases also need some audit fields, though less than the ad tables:<br>
 
* date_of_data: Datetime added or edited; an indication of the date this record was included in the system in this form.
 
* record_status_id: An integer that indicates if this record is unchanged (=0) newly added (=1) or newly edited (=2).
 
* source_id: An integer indicating the source registry or data object source, taken from the master list in the mn schema.
 
Once all the records are linked to a new study id in the aggregate database, the study – data object link records, in the aggregate data object database, need to be modified to reflect that new id.
 
<br><br>
 
''Core mdr audit data''<br/>
 
As a final step the new or edited data is transferred to the mdr system itself, from the aggrgeate databases. In the mdr the necessary audit fields are <br>
 
* date_of_data: The date this record was included in the system in this form. This is necessary, for example, to identify the records that require new or replacement json files to be generated for them.<br>
 
* source_id: An integer indicating the source registry or data object source.<br>
 
The audit fields in the mdr system, plus the ability to link back to the aggregating and source databases, means that it is possible to check (and if required display) the provenance of any data in the system.<br>
 
Figure 3 illustrates the overlapping 'scope' of each id used within the system, that allows this provenance to be tracked. Initially source data ids are used within sd_id fields, then the source specific 'accumulated data' ids are used in ad_id fields, before the definitive mdr ids are applied in the aggregating schemas and used in the core schema.
 
  
== Initial strategy - Early development ==
+
===Contextual data===
The strategy above describes a relatively mature system, with data pipelines in place and periodic imports taking place. The MDR has not yet reached this state, as we  are still in the process of establishing the pipelines and repeated, periodic imports have not yet been set up. <br/>
+
Some data is common to all sources, and is therefore factored out into a separate database, called 'context'. This has several schemas - including 'lup' for lookup tables and 'ctx' for context tables. The lookup tables are a set of about 25 relatively small tables that hold the codes and values for various 'look ups' in the system, that would normally be presented to the user as options within drop down boxes or lists of check boxes, to select or filter studies and/or objects. The look up tables include, for example, tables with the codes (usually integers) and values for 'study types', 'resource types', 'dataset consent types', 'contribution types', 'time units', 'language codes', etc., etc.<br/>
Instead, at the moment, imports are 'one-off' snapshots, with all ad tables being set up new each time, along with the sd tables. This allows us to focus on the processes required and their documentation, and still generate a central MDR, but leave the more complex comparison mechanisms for later development.
+
The context schema contains a variety of contextual data. This includes codes and names for geographical entities (continents, regions, countries and states), lookup tables that relate publishers names and codes to e-issn and p-issn numbers (used to find the publishers of journal articles), a table of MESH topic codes, and several tables detailing the names, types and other details of organisations. Organisation names appear at several places within the source data, but are often written in several different ways. The system tries to standardise organisation names into a single default form by looking them up within these tables. Developing the context tables and making them as comprehensive as possible is an ongoing exercise. The types of contextual data are listed at  '''[[Contextual Data]]''', which includes links to more details about each type of data.

Latest revision as of 15:37, 10 May 2022

Last updated: 10/05/2022

The Overall Strategy

Data is collected from a growing number of clinical trial registries and data object repositories (collectively known as 'data sources'), transformed into ECRIN schema metadata, and stored in a central database so that it can be accessed by the web portal or APIs. (The individual data sources are listed at MDR Data Sources, which has further links to details about each source.) In fact there are four distinct processes involved in data collection and extraction, which apply to all data sources, and which are shown in figure 1.

Data Download

All data used by the MDR is first downloaded onto an ECRIN managed server, and stored as an XML file. The data may start as an XML file at the source (as for ClinicalTrials.gov or Pubmed) in which case downloading the file is relatively straightforward, using API calls to identify the files required. The data may be in a downloadable csv file (e.g. WHO ICTRP data) which is then processed to generate an XML file per record (row) in the file. The data may be need to be scraped from one or more web pages, in which case an XML file is again constructed for each record.
Data files that are created rather than simply downloaded demand more processing, but that processing can be used to start the process of cleaning and transforming the data from its 'raw' state into one that matches the ECRIN schema. The XML files generated are therefore relatively easy to harvest in the next stage of the process, compared with the 'native' XML files from, for example, PubMed.
Successive download operations result in steadily growing collections of source data, stored locally on the MDR database server. Each source has its own folder, or set of folders. For the WHO data the download process splits the records up and distributes the resulting files to different folders according to the source registry.
For small sources, it is often simpler to re-download the whole set of source data and replace the existing files. For large sources this takes too much time, and just a subset of records are downloaded each time - usually those revised or added since the last download. This does not prevent even large datasets being completely replaced at intervals - perhaps annually - to ensure synchronisation between the source and the MDR's version of it. Either way, at any one time, the MDR has the totality of the relevant material from each source available locally. For more details on the download process, see the Downloading Data page.

Data Harvesting

At intervals the local data can be processed and inserted into a database, a process that in this system is known as 'harvesting'. The harvested data is inserted into - effectively - a temporary holding or staging database, so that it can be examined and additional processing carried out as required.
In the mdr system each source has a single database but uses at least two schemas, or distinct sets of tables. The schema that the harvested data is placed into is known as the session data schema (the name of each table in it is prefixed with 'sd.'). This differentiates it from the accumulated data schema tables (all prefixed with 'ad.'). As the name suggests the accumulated data tables hold the totality of data obtained from the source. They are usually created when the source is first accessed and then gradually grow and are revised over time. The session data tables, on the other hand, hold only the data from the last harvest session. These tables are dropped and recreated each time a harvest session takes place.
In most circumstances, harvests are set up to process only files that have been added or revised since the most recent import operation (described below). This means that only data that is potentially new to the system (specifically to the ad tables in the system) is harvested and placed in the sd tables, for later transfer to the ad tables. In most cases therefore, the sd tables will hold a small fraction of the volume in the ad tables, but it will be the data of current interest, because it is data recently changed or added to the source. (It is possible to do a '100% harvest', but for most sources this would be relatively rare in normal operations). The other important aspect of harvesting is that it completes the transformation of the data into the structure of the ECRIN metadata schema. The different databases will have different numbers of tables in their sd and ad schemas, (some sources are more complex than others) but a table of a particular type will be the same in all the databases, i.e. contain the same fields, and those fields will conform to the ECRIN schema. For the XML files generated by the download process this second transformation stage is usually straightforward. For ClinicalTrials.gov and PubMed files, all the transformation has to be done during harvesting, which can therefore be relatively complex. For Further details please see the Harvesting Data page.

Figure 1: Data collection data flows

















     Figure 1: Data collection data flows
     Data flow is shown for two sources,
     followed by aggregation. The number of
     systems required at each stage is shown,
     along with some supplementary systems that
     are used with all sources.





















Data Import

The data import process brings the data into the accumulated data 'ad' tables. The ad and sd tables are broadly the same in structure (though the ad tables have more audit fields) so the transfer is relatively straightforward. The initial step is to identify what data is new and what has been changed in the sd tables compared to the ad tables. While new data is easy to spot, edited data is not so simple to identify when most data objects do not have intrinsic identifiers, and when data is not necessarily presented in the same order between harvest sessions. The system uses a series of hashing techniques to summarise record content, and this allows changes in records, or closely related groups of records, to be picked up for editing to take place - i.e. replacement of the relevant portion of ad data by the new sd data. See Missing PIDs and Hashing for more details.
If, but only if, the harvest has been a full 100% harvest, so that the sd tables represent all the data available, it is possible to see if any study or object data has been deleted from the source, and which therefore should be deleted from the ad tables. This is relatively rare, but can occur with a few (non trial registry) sources.
Once import has taken place the ad tables should once again be synchronised with the source material. Further details on data import can be found at Importing Data.

Data Aggregation

The accumulated data from the different sources now need to be brought together to form the central, or core, aggregated database. The problem is that both studies and (some types of ) data objects can be referenced in more than one source. About 27,000 studies, for example, are registered in more than one trial registry. Simply aggregating records would cause these studies to have duplicated (sometimes three or more) records in the system. Similarly, some journal articles are referenced by multiple studies in different sources, and simple aggregation would cause confusing duplication of records. The aggregation process must therefore prevent this duplication. For studies this is done by identifying the links between study registrations and ensuring all studies present in more than one registry are aggregated to the same record in the core MDR database. When studies are added they are checked to see if they have been previously identified as the same to another study, and if they already exist in the system. If they do, their id is changed to match the study already present (see Identifying Links between Studies. For PubMed published journal articles all the possible study-article links are first collected together and then de-duplicated using the articles' PubMed identifier, to produce a distinct set. Once that has been done the articles can be added to the system safely. Duplication of data objects other than published papers is rare but checked in a similar way to studies.
The aggregation process always starts from scratch - there is no editing of existing data involved. The aggregate tables are dropped and then recreated, and all data is added to them. This is for simplicity and ease of maintenance - dealing with de-duplication on addition is complex enough without having to deal with possible edits and deletes! The exception to this is the integer identifiers applied to studies and data objects in the core system. These are in effect database accesssion numbers and are kept constant between each weekly aggregation - they therefore act as persistent ECRIN identifiers. At present (May 2022) it takes about 50 minutes for the aggregation process to create all tables and fill them with about 1.75 million study and data object records, along with an approximate 12 million further attribute records.
In the initial aggregation phase the data exists within three different schemas within the main mdr database - one for study data, one for data objects and one for link related data, between both studies and data objects and between studies and studies. The second phase involves importing slightly simplified versions of this data into a single 'core' schema, that can be used as the data source for the web site. The generation of the core schema also sees the production of the 'provenance strings' for each study and object. For studies these may be composite, because the data has been collected from more than one source. At the end of the aggregation process, after the core tables have been created, the system generates the json versions of study and data object data. A fuller description of the aggregation process is available at Aggregating Data.

Other key aspects

There are a few other 'high level' aspects of the data collection system that are worth pointing out. These include:

'Study' versus 'Data Object' data sources

In broad terms there are two types of data sources. 'Study' sources, which are the great majority at the moment, contain data about studies and, if sometimes only implicitly, associated data objects. Examples are the trial registries, which contain various data points about studies (though they usually include a WHO defined subset) and some basic information about data objects - in the case of registries the registry entries themselves, as well as registry based results summaries in many cases, and for some registries related documents (e.g. protocols) that have been uploaded. Many data repositories are also study based (e.g. Yoda, BioLINCC), in the sense that the documents and datasets that are stored are organised around the source study, with the repository usually displaying a web based study summary (the study 'landing page') that lists the data objects available.
Other data sources are 'Object' sources, because they contain only data objects. PubMed is the only current data source of this type. The data about each data object tends to be much richer than that available for 'study sourced' objects, but the data objects can only be added into the system if they include an explicit link to a study, which only some PubMed records do - or if they are explicitly referenced by a study, which is also the case with many PubMed records. The intention is to increase the proportion of Object based sources in the system.
Processing differs for the two types of source. Object studies obviously have no study data and generate no study tables, but linking them to studies, often in multiple ways, can make the processing more complex. The aggregation processing can also be more complex with objects that have many-to-many relationships with studies. Both of these issues occur with PubMed data.

Modular functionality and Ordering of Extraction

The four main modules can operate independently of each other, and were specifically designed to do so. This is to make system operation much easier to operate - it is not necessary to perform operations in a fixed order, though it is often sensible to do so. The 'pipeline' is therefore broken up into four distinct modules.
Thus, data download for any source can take place at any time, the local data store is simply updated / augmented, or for a 100% download, replaced.
Harvesting into the session data tables can also take place, and be repeated, at any time. The session tables are recreated each time the harvest process is run, so there are no side effects from repeating the exercise. If havesting only updated / new data (the default), when there have been no additional downloads since the last import into the ad tables, then the harvest will not generate any data - in a sense there is no point, because the system (the ad tables) already 'knows about' all the data that is present. Other than that Harvesting simply takes the new / updated data (or optionally all the data) and makes it available for import in the sd staging databases.
Import can occur any time there is some data in the sd tables to import. The process does not destroy the SD table data (only a new harvest will do that) so in theory the import could be repeated. But repeating an import will not change the ad tables, as nothing has further has changed in the source. An 'update' harvest collects all the data from the local source that has been added or changed since the last import (not the last harvest). Thus an update harvest should always provide all the data available for import, however many downloads have been done in the meantime.
Finally, the aggregation process is a recreation process - all central tables are destroyed (apart from the ones that retain the accession Ids of studies and objects) and all of the data is re-aggregated by re-adding from each source. It can therefore take place at any time. Aggregation is best done after all the ad tables are updated, and the system is designed to aggregate all the study based sources before the object based ones, so that all study data is present when attempting to link objects, but that is the only ordering built into the processing.

Logging and co-ordination

If the modules are designed to be scheduled and run independently, they do need to be able access a central logging / tracing system that identifies the status of each study (for study based sources) or object (for object based sources) in the system, because the data is processed in packets around each of these 'source' studies or objects. Thus a central database called 'mon', for monitor, includes tables that include a record for each source study and object, across all sources. The 'source data' tables - sf.source_data_studies and sf.source_data_objects, also contain details of remote web URLs and local file locations, the date-times of the most recent downloads, harvests and imports, and the ids of the associated download (or 'saf' event = search and fetch), harvest and import events. In other words these tables indicate where each source package of data is in the pipeline.
The modules use these tables to identify the appropriate actions to take at each stage. The download module adds new records to these tables, and updates existing ones. The harvest module is able to select the local file locations of records whose download date is greater than the last import date (or for whom there is not yet an import date). The import process is driven by differences in the data, but records the imports that occur. Only aggregation ignores these tables, but does produce a set of statistics of its own that area stored within the mon database, summarising the results of aggregation for each source and for the systerm as a whole. Further details on logging can be found at Logging and Tracking.

Scheduling

The various operations, applied to the different sources, are scheduled using Windows Task Manager. Each operation runs once weekly, with the final aggregated dataset being made available each Sunday evening. The details of the current schedule can be found at System Scheduling.

Contextual data

Some data is common to all sources, and is therefore factored out into a separate database, called 'context'. This has several schemas - including 'lup' for lookup tables and 'ctx' for context tables. The lookup tables are a set of about 25 relatively small tables that hold the codes and values for various 'look ups' in the system, that would normally be presented to the user as options within drop down boxes or lists of check boxes, to select or filter studies and/or objects. The look up tables include, for example, tables with the codes (usually integers) and values for 'study types', 'resource types', 'dataset consent types', 'contribution types', 'time units', 'language codes', etc., etc.
The context schema contains a variety of contextual data. This includes codes and names for geographical entities (continents, regions, countries and states), lookup tables that relate publishers names and codes to e-issn and p-issn numbers (used to find the publishers of journal articles), a table of MESH topic codes, and several tables detailing the names, types and other details of organisations. Organisation names appear at several places within the source data, but are often written in several different ways. The system tries to standardise organisation names into a single default form by looking them up within these tables. Developing the context tables and making them as comprehensive as possible is an ongoing exercise. The types of contextual data are listed at Contextual Data, which includes links to more details about each type of data.