Harvesting Data

From ECRIN-MDR Wiki
Revision as of 12:27, 17 November 2020 by Admin (talk | contribs) (Overview of harvesting process)
Jump to navigation Jump to search

Introduction

In the harvest phase the XML files for a data source, or more normally a subset (which one is controlled by the associated parameters) are converted into data in the 'sd' schema (= session data) tables within each source database. On each harvest 'run' the sd tables are dropped and created anew, and thus only ever contain the data from the most recent harvest. The tables present will vary in different databases, though if a table is present it will have a consistent structure in every database. The conversion to sd data therefore represents the second and final stage of the conversion of the source data into the consistent ECRIN schema. For that reason the detailed code for different sources can vary widely.

N.B. The harvest code can be found at https://github.com/ecrin-github/DataHarvester

Parameters

The system is a console app, and thus easily schedulable, and takes the following parameters:
-s, followed by a comma delimited string of integer ids, each representing a source: The source systems to be harvested.
-t, followed by 1 or 2: indicates the type of harvesting to be carried out:
If 1, the harvesting will be of all files in the source folder, representing 100% of the data that has been downloaded.
If 2, the harvest is only of files that have been (re-)downloaded, because they represent new or changed source data, after the date time of the last import process. Note that it is the last import event (for that source) which is important here, not the last harvest event. Multiple harvests between imports do not affect, therefore, the files that are harvested.
-G: is a flag that can be applied that prevents a normal harvest occurring, so that the sd tables are not recreated and reloaded. Instead they are updated using revised contextual data, so that - for example - organisation Ids and topic data codes can be re-applied. The option provides a relatively efficient way of updating data, though obviously works better if preceded with a type 1 full harvest of all data. Because the data is revised the various composite hash values that summarise data content also have to be re-created (see Missing PIDs and Hashing).

Thus, the parameter string
     -s "100120" -t2
will harvest data from source 100120 (ClinicalTrials.gov)that has been (re-)downloaded since the last import process, but the parameter string
     -s "100126" -t1
will harvest all the data from source 100126 (ISRCTN) The parameter string
      -s "101900, 101901" -G
will update the organisation, topic and other context related data for BioLincc and Yoda sd data, but will not re-harvest that data first.

Overview of harvesting process

Unless the -G flag has been applied the initial stage is to recreate the sd tables, to receive the harvested data. The system fires off a series of SQL statements that first drops all sd tables and then drops them. After that...

  • The program selects the relevant source data records from the monitor database - depending on the t parameter. For type 1 all records are selected, though for the larger sources this is done in batches, for type 2 the date time of the last import process - also available from the monitor database - is used to compare against the datetime of the XML file's download / production. Only files downloaded or created since the last import are harvested. The relevant code is provided in Logging and Tracking
  • The program then loops through each of the source XML files, using the local path value in the source data records to locate the file.
  • The xml file is deserialised into a C# object for further processing. The processing that follows is very different for different sources, although all the sources that are derived from WHO ICTRP files are processed in a very similar way. Most current sources are study based, so the usual pattern of processing is to identify the main study data points from the source data first, and then the study attributes (identifiers, titles, topics, features, contributors etc.) and then the associated data objects (registry entries, results entries, linked documents, etc.) and their attributes (instance details, titles, dates, etc.). This results in the creation of a study object
   public class Study
   {
       public string sd_sid { get; set; }
       public string display_title { get; set; }
       public string title_lang_code { get; set; }
       public string brief_description { get; set; }
       public bool bd_contains_html { get; set; }
       public string data_sharing_statement { get; set; }
       public bool dss_contains_html { get; set; }
       public int? study_start_year { get; set; }
       public int? study_start_month { get; set; }
       public int? study_type_id { get; set; }
       public string study_type { get; set; }
       public int? study_status_id { get; set; }
       public string study_status { get; set; }
       public int? study_enrolment { get; set; }
       public int? study_gender_elig_id { get; set; }
       public string study_gender_elig { get; set; }
       public int? min_age { get; set; }
       public int? min_age_units_id { get; set; }
       public string min_age_units { get; set; }
       public int? max_age { get; set; }
       public int? max_age_units_id { get; set; }
       public string max_age_units { get; set; }
       public DateTime? datetime_of_data_fetch { get; set; }
       public List<StudyIdentifier> identifiers { get; set; }
       public List<StudyTitle> titles { get; set; }
       public List<StudyContributor> contributors { get; set; }
       public List<StudyReference> references { get; set; }
       public List<StudyTopic> topics { get; set; }
       public List<StudyFeature> features { get; set; }
       public List<StudyRelationship> relationships { get; set; }
       public List<StudyLink> studylinks { get; set; }
       public List<AvailableIPD> ipd_info { get; set; }
       public List<DataObject> data_objects { get; set; }
       public List<ObjectDataset> object_datasets { get; set; }
       public List<ObjectTitle> object_titles { get; set; }
       public List<ObjectDate> object_dates { get; set; }
       public List<ObjectInstance> object_instances { get; set; }
   }
  • Cleaning of common data / formatting errors occurs as part of the harvesting process, for example the removal of values that inappropriately say 'none' or 'nil' for common study variables, or the extraction of other registry ids from the strings in which they can be buried (using regular expressions).
  • In each case the requirement is to end up with the data that is compatible with the ECRIN metadata schema. For some sources the harvest step is the second part of this conversion process, the first taking place during file generation. For others, where data are downloaded as pre-formed XML files from the source - at the moment ClinicaTrials.gov and PubMed - all the conversion process takes place in the harvesting step.
  • For PubMed data in particular substantial additional processing is necessary. For instance the source data contains information about the journal but not the publisher of the article - that has to be obtained from separate lookup processes using contextual databases, that take place after the main harvesting process, using pissn and eissn numbers in the PubMNed files.

Updating contextual data

  • After the harvest has created the session data in the sd tables, it is necessary to update organisation data in various tables. This uses an internal database of organisations to try and identify as many as possible of the cited organisations in a standardised form - i.e. with a default name and a system id. Without this step the organisation data becomes difficult to interrogate, as the same organisation can be present in the system in so many different forms.
  • Topic data is also standardised as much as possible, with the application of MESH codes (to complement the MESH coding that exists inthe source data) to as many topic terms as possible.
  • Once all data has been harvested and updated, the final steps involve the creation of md5 hashes for the key fields in the sd tables. For attribute records, for both studies and data objects, these hashes are then 'rolled up' to form a single hash for each set of attribute data, and then finally all the hashes for a single study or data object are rolled up into a single 'full hash' for that entity. These hash values play a critical role in identifying edited data during the data import process (see the MDR wiki for more details).