Difference between revisions of "Logging and Tracking"

From ECRIN-MDR Wiki
Jump to navigation Jump to search
(System status: The source data tables)
(System status: The source data tables)
Line 31: Line 31:
 
These two tables are accessed and manipulated in the download, harvest and import processes using a data repository class (logging_repo) that connects the mon database, and which includes a number of standard or similar functions.<br/>
 
These two tables are accessed and manipulated in the download, harvest and import processes using a data repository class (logging_repo) that connects the mon database, and which includes a number of standard or similar functions.<br/>
 
For instance, in the download system, the logging repo contains a function to either create a new source data record (for a newly discovered study or data object) or update the record of an existing record - in particular the revision date, download status and the event id and date-time of the most recent download. In the harvesting system an analogous function updates the id of the last harvest id and the date-time of the harvest, while the import system updates the import event id and the import date-time.<br/>
 
For instance, in the download system, the logging repo contains a function to either create a new source data record (for a newly discovered study or data object) or update the record of an existing record - in particular the revision date, download status and the event id and date-time of the most recent download. In the harvesting system an analogous function updates the id of the last harvest id and the date-time of the harvest, while the import system updates the import event id and the import date-time.<br/>
The logging_repo also accesses the source data tables to identify the files to be harvested, comparing the dates of last download and import to identify the relevant files.
+
The logging_repo also accesses the source data tables to identify the files to be harvested, comparing the dates of last download and import to identify the relevant files. As the code below shows, a SQL statement is constructed using a standard select list (set up in the class's constructor) and then a GetWhereClause() function to generate most of the remaining SQL required.<br/>
 +
The GetWhereClause uses the harvest type parameter to determine if all the records from a particular source are selected, or just those where the import date-time is null or less than the download date-time.<br/>
 +
When large numbers of records are involved a version that adds an offset and limit clause to the SQL statement is available.
 +
 
 
<div style="font-family: monospace; font-size: 13px" >
 
<div style="font-family: monospace; font-size: 13px" >
 
<pre>
 
<pre>

Revision as of 22:09, 15 November 2020

The need for logging and tracking

The 'mon' (for monitoring) database contains various tables that hold data to support logging and tracking. Not all these systems are fully developed, but the main elements are in place.
Here 'logging' refers to recording what has happened within the system, during any major operation, and 'tracking' refers to maintaining a record of the state of the system, in particular the status of the various source data packages.
For a study based data sources a 'source data package' is the complete block of data obtained about an individual study, i.e. the study details, the study attribute data, the linked data object details and the object attribute data.
For object based data sources such as PubMed each source data package consists only of the object details and the object attribute data.
In both cases these are distinct packages that they are downloaded and harvested as a unit, even if they are split into different components before being imported and aggregated. There are currently (November 2020) about 600,000 study source data packages tracked by the system, and about 200,000 object source data packages.
Because the 4 major processes (download, harvest, import and aggregation) operate independently of each other, they need to use, and update, a common set of reference data that provides the status of each source data package. This allows, for example, the harvest system to locate the files previously downloaded or created during data download, or identify those that have been downloaded since the last import for any particular source.
The logging systems are required to provider feedback on the system's activity, to check the correct functioning of the system, and to identify apparent errors and anomalies.

System status: The source data tables

The source data tables are used to store the details about the source data packages, and are therefore the chief stores of the system's state. There are two of them - one for studies (sf.source_data_studies) and one for objects (sf.source_data_objects), though they both have exactly the same structure, as listed below.

  • An integer id, created as an identity field
  • the source id, the integer id of the data source
  • the object's own id, in the source data (e.g. a registry identifier or PubMed id),
  • the URL of its record on the web - if it has one. This applies even to data that is not collected directly from the web, such as from WHO csv files.
  • the local path where the XML file downloaded or created is stored
  • the datetime that the record was last revised, if available
  • a boolean indicating if the record is assumed complete (used when no revision date is available)
  • the download status - an integer - where 0 indicates found in a search but not yet (re)downloaded, and 2 indicates downloaded.
  • the id of the fetch / search event in which the data package was last downloaded / created
  • the date-time of that fetch / search
  • the id of the harvest event in which it was last harvested
  • the date-time of that harvest
  • the id of the import event in which it was last imported
  • the date-time of that import

Figure 1 provides an image of a few of the rows in the source_data_studies table.

Source data studies.png
Figure 1: Sample lines from the source_data_studies table

These two tables are accessed and manipulated in the download, harvest and import processes using a data repository class (logging_repo) that connects the mon database, and which includes a number of standard or similar functions.
For instance, in the download system, the logging repo contains a function to either create a new source data record (for a newly discovered study or data object) or update the record of an existing record - in particular the revision date, download status and the event id and date-time of the most recent download. In the harvesting system an analogous function updates the id of the last harvest id and the date-time of the harvest, while the import system updates the import event id and the import date-time.
The logging_repo also accesses the source data tables to identify the files to be harvested, comparing the dates of last download and import to identify the relevant files. As the code below shows, a SQL statement is constructed using a standard select list (set up in the class's constructor) and then a GetWhereClause() function to generate most of the remaining SQL required.
The GetWhereClause uses the harvest type parameter to determine if all the records from a particular source are selected, or just those where the import date-time is null or less than the download date-time.
When large numbers of records are involved a version that adds an offset and limit clause to the SQL statement is available.

        ...
        ...(in constructor)
        sql_file_select_string = "select id, source_id, sd_id, remote_url, last_revised, ";
            sql_file_select_string += " assume_complete, download_status, local_path, last_saf_id, last_downloaded, ";
            sql_file_select_string += " last_harvest_id, last_harvested, last_import_id, last_imported ";
        ...


        public IEnumerable<StudyFileRecord> FetchStudyFileRecords(int source_id, int harvest_type_id = 1)
        {
            string sql_string = sql_file_select_string;
            sql_string += " from sf.source_data_studies ";
            sql_string += GetWhereClause(source_id, harvest_type_id);
            sql_string += " order by local_path";

            using (NpgsqlConnection Conn = new NpgsqlConnection(connString))
            {
                return Conn.Query<StudyFileRecord>(sql_string);
            }
        }

        public IEnumerable<StudyFileRecord> FetchStudyFileRecordsByOffset(int source_id, int offset_num,
                                      int amount, int harvest_type_id = 1)
        {
            string sql_string = sql_file_select_string;
            sql_string += " from sf.source_data_studies ";
            sql_string += GetWhereClause(source_id, harvest_type_id);
            sql_string += " order by local_path ";
            sql_string += " offset " + offset_num.ToString() + " limit " + amount.ToString();

            using (NpgsqlConnection Conn = new NpgsqlConnection(connString))
            {
                return Conn.Query<StudyFileRecord>(sql_string);
            }
        }

        private string GetWhereClause(int source_id, int harvest_type_id)
        {
            string where_clause = "";
            if (harvest_type_id == 1)
            {
                // Count all files.
                where_clause = " where source_id = " + source_id.ToString();
            }
            else if (harvest_type_id == 2)
            {
                // Harvest files that have been downloaded since the last import, 
                // NOTE - not since the last harvest, as multiple harvests may have
                // been carried out. A file should be harvested for import if it 
                // has not yet been imported, or a new download (possible a new version) 
                // has taken place since the import.
                // So files needed where their download date > import date, or they are new
                // and therefore have a null import date

                where_clause = " where source_id = " + source_id.ToString() +
                               " and (last_downloaded >= last_imported or last_imported is null) ";
            }
            where_clause += " and local_path is not null";
            
            return where_clause;
        }

standard functions for doing the tracking

The source parameters

Holds a central position
Sources are orgs so that the original records for each source are in the contexts organisation table
That gives the ids
But processing specific data is in this table
Used in all phases of processing to know what tables to expect to find / process in each database

The event records

The events tables and types
Creating and filling event records
statistics linked to aggregation

Extraction notes

Purpose and usage
Extraction tables
feedback and notes, serialising feedback