Difference between revisions of "Importing Data"

From ECRIN-MDR Wiki
Jump to navigation Jump to search
(Identifying new or Changed Data=)
(Identifying new or Changed Data)
Line 18: Line 18:
  
 
===Overview===
 
===Overview===
====Identifying new or Changed Data====
+
====Identifying New or Changed Data====
The system compares the study data ids and content in the session and accumulated data and identifies a) those studies in the sd tables that are new, b) those studies that have been edited in any way, including any change in a study attribute, c) those studies that are unchanged, and d) those studies that have disappeared from the sd data. The last is relatively rare and can only be estimated if the previous harvest was of 100% of the source.<br/>
+
The initial stage of the import process is to recreate some
 +
 
 +
he system compares the study data ids and content in the session and accumulated data and identifies a) those studies in the sd tables that are new, b) those studies that have been edited in any way, including any change in a study attribute, c) those studies that are unchanged, and d) those studies that have disappeared from the sd data. The last is relatively rare and can only be estimated if the previous harvest was of 100% of the source.<br/>
 +
        public void IdentifyNewStudies()
 +
        {
 +
            string sql_string = @"INSERT INTO sd.to_ad_study_recs (sd_sid, status)
 +
                    SELECT s.sd_sid, 1 from sd.studies s
 +
                    LEFT JOIN ad.studies a
 +
                    on s.sd_sid = a.sd_sid
 +
                    WHERE a.sd_sid is null;";
 +
            ExecuteSQL(sql_string);
 +
        }
 +
 
 +
 
 +
        public void IdentifyEditedStudies()
 +
        {
 +
            string sql_string = @"INSERT INTO sd.to_ad_study_recs (sd_sid, status)
 +
                SELECT s.sd_sid, 2 from sd.studies s
 +
                INNER JOIN ad.studies a
 +
                on s.sd_sid = a.sd_sid
 +
                where s.study_full_hash <> a.study_full_hash;";
 +
            ExecuteSQL(sql_string);
 +
        }
 +
 
 +
 
 +
        public void IdentifyIdenticalStudies()
 +
        {
 +
            string sql_string = @"INSERT INTO sd.to_ad_study_recs (sd_sid, status)
 +
                SELECT s.sd_sid, 3 from sd.studies s
 +
                INNER JOIN ad.studies a
 +
                on s.sd_sid = a.sd_sid
 +
                where s.study_full_hash = a.study_full_hash;";
 +
            ExecuteSQL(sql_string);
 +
        }
 +
 
 +
 
 +
        public void IdentifyDeletedStudies()
 +
        {
 +
            string sql_string = @"INSERT INTO sd.to_ad_study_recs(sd_sid, status)
 +
                SELECT a.sd_sid, 4 from ad.studies a
 +
                LEFT JOIN sd.studies s
 +
                on a.sd_sid = s.sd_sid
 +
                WHERE s.sd_sid is null;";
 +
            ExecuteSQL(sql_string);
 +
        }
 +
 
 +
 
 
The same 4 categories are then constructed for the data objects.<br/>  
 
The same 4 categories are then constructed for the data objects.<br/>  
 
For studies that have been edited in any way (this is discovered by comparing composite hash values constructed from the totality of each study's data) the nature of the edit is then identified - i.e. whether or not it involves changes to the main study record, and / or if it involves an addition, edit, or deletion of any study attribute record. Again this process is repeated for the data object data.<br/>  
 
For studies that have been edited in any way (this is discovered by comparing composite hash values constructed from the totality of each study's data) the nature of the edit is then identified - i.e. whether or not it involves changes to the main study record, and / or if it involves an addition, edit, or deletion of any study attribute record. Again this process is repeated for the data object data.<br/>  
 
At the end of this process the system has - within the ad schema of the source database - 4 temporary tables that hold this information, in effect providing a complete record of the differences between the session data and accumulated data sets.<br/>  
 
At the end of this process the system has - within the ad schema of the source database - 4 temporary tables that hold this information, in effect providing a complete record of the differences between the session data and accumulated data sets.<br/>  
 
<br/>
 
<br/>
 +
 
====Actioning the changes====
 
====Actioning the changes====
 
The system then works through the various categories of data that have been identified:
 
The system then works through the various categories of data that have been identified:

Revision as of 11:34, 18 November 2020

Introduction

The program takes the data in the session data (sd) tables in a source database (the 'session data' having been created by the most recent harvest operation), and compares it with the accumulated data for each source, stored in the accumulated data (ad) tables. New and revised data are then transferred to the ad tables. Unlike the preceding download and harvest steps, the import process is essentially the same for all data sources, because the sd and ad tables have the same structure in each database. The listing of tables present in each DB varies but not the table structures themselves.

N.B. The download code can be found at https://github.com/ecrin-github/DataImporter

Parameters

The system is a console app, (to more easily support being scheduled) and can take takes 2 parameters:
-s, followed by a comma separated list of integer ids, each representing a data source within the system: The program takes each of these ids in turn, and carries out the sd to ad import process for each of them.
-T: as a flag. If present, forces the recreation of a new set of accumulated data tables. This parameter should therefore only be used when creating the ad tables for the first time, or when the entire ad schema is recreated from harvested data that represents the totality of the data available from the source. In other words when carrying out a full rebuild of the source's data from a full harvest.
Thus, the parameter string
     -s "100116, 100124, 100132"
will cause data to be imported from the session data tables for each of the Australian, German and Dutch trial registries, in that order.
The parameter string
     -s "101900, 101901" -T
will cause, for the BioLinncc and Yoda repositories respectively, the current ad tables to be dropped and then rebuilt, before being filled with the most recent session data harvested from those repositories.
Note there is no parameter to allow using only part of the harvested data in the sd tables - all of the sd data is always used. The way in which different data is imported, e.g. data that has been revised or added after a certain date, is by controlling the data harvest processing that precedes the import step.

Overview

Identifying New or Changed Data

The initial stage of the import process is to recreate some

he system compares the study data ids and content in the session and accumulated data and identifies a) those studies in the sd tables that are new, b) those studies that have been edited in any way, including any change in a study attribute, c) those studies that are unchanged, and d) those studies that have disappeared from the sd data. The last is relatively rare and can only be estimated if the previous harvest was of 100% of the source.

       public void IdentifyNewStudies()
       {
           string sql_string = @"INSERT INTO sd.to_ad_study_recs (sd_sid, status)
                   SELECT s.sd_sid, 1 from sd.studies s
                   LEFT JOIN ad.studies a
                   on s.sd_sid = a.sd_sid 
                   WHERE a.sd_sid is null;";
           ExecuteSQL(sql_string);
       }


       public void IdentifyEditedStudies()
       {
           string sql_string = @"INSERT INTO sd.to_ad_study_recs (sd_sid, status)
               SELECT s.sd_sid, 2 from sd.studies s
               INNER JOIN ad.studies a
               on s.sd_sid = a.sd_sid
               where s.study_full_hash <> a.study_full_hash;";
           ExecuteSQL(sql_string);
       }


       public void IdentifyIdenticalStudies()
       {
           string sql_string = @"INSERT INTO sd.to_ad_study_recs (sd_sid, status)
               SELECT s.sd_sid, 3 from sd.studies s
               INNER JOIN ad.studies a
               on s.sd_sid = a.sd_sid
               where s.study_full_hash = a.study_full_hash;";
           ExecuteSQL(sql_string);
       }


       public void IdentifyDeletedStudies()
       {
           string sql_string = @"INSERT INTO sd.to_ad_study_recs(sd_sid, status)
               SELECT a.sd_sid, 4 from ad.studies a
               LEFT JOIN sd.studies s
               on a.sd_sid = s.sd_sid
               WHERE s.sd_sid is null;";
           ExecuteSQL(sql_string);
       }


The same 4 categories are then constructed for the data objects.
For studies that have been edited in any way (this is discovered by comparing composite hash values constructed from the totality of each study's data) the nature of the edit is then identified - i.e. whether or not it involves changes to the main study record, and / or if it involves an addition, edit, or deletion of any study attribute record. Again this process is repeated for the data object data.
At the end of this process the system has - within the ad schema of the source database - 4 temporary tables that hold this information, in effect providing a complete record of the differences between the session data and accumulated data sets.

Actioning the changes

The system then works through the various categories of data that have been identified:

  • Any new studies are directly imported into the ad tables along with all their attributes.
  • For edited studies, the nature of the edit is examined. If a change has occured in the main (singleton) study record, that record is replaced with the new version from the sd data. If a change has occured in a study attribute, *all* the attribute records of that type are replaced by *all* the attribute records of that type in the sd data. There is no attempt to try and match individual attribute records to see which specifically have been changed / added / deleted - partly because of the lack of persistent identifiers for these records. Instead the whole set of attributes is replaced. If a completely new type of attribute appears for a study all the records of that attribvute type are added. If an attribute type completely disapears from a study all the corresponding attribute records are removed.
  • For unchanged studies the 'date of data fetch' is updated to match that in the sd data but no other changes are applied. This indicates the last date the data was examined, even if the data was unchanged. The same update, of 'date of data fetch' is made to edited records and is contained automatically within new records.
  • For deleted studies, if it has been possible to identify these, the entire study and all attributes are removed from the ad tables.

All 4 steps are then repeated for the data object data.