Difference between revisions of "Logging and Tracking"

Revision as of 12:02, 3 November 2020

The 'mon' (for monitoring) database contains various tables that hold data to support logging and tracking. Not all these systems are fully developed, but the main elements are in place.
Here 'logging' referes to recording what has happened within the system, during any major operation, and 'tracking' refers to maintaining a record of the state of the system, in particular the status of the various source data packages. A 'source data package' is the complete block of data obtained about any study, from a study based source, i.e. the study details, the study attribute data, the linked data onbject details and the object attribute data. For object based data sources such as PubMed each package consists of the object details and the object attribute data. These are distinct packages in the sense that they are downloaded and harvested as a unit, even if they are split into different components before being imported and aggregated. There are currently (November 2020) about 600,000 study data packages tracked by the system, and about 200,000 object packages.
Because the 4 major processes (download, harvest, import and aggregation) operate independently of each other, they need to use, and update, a common set of status reference data.This allows, for example, the harvest system to locate the files previously downloaded or created during data download, or identify those that have been downloaded since the last import for any particular source. The logging systems are required to provider feedback on the system's activity, to check the correct functioning of the system, and to identify apparent errors and anomalies.

System status: The source data tables

The source data tables store the details abouit the source data packages, and are therfeore the chief stores of the system's state. There are two of them - one for studies (sf.source_data_studies) and one for objects (sf.source_data_objects), though they both have exactly the same structure, as listed below.

An integer id, created as an identity field
the source id, the integer id of the data source
the object's own id, in the source data (e.g. a registry identifier or PubMed id),
the URL of its record on the web - if it has one. This applies even to data that is not collected directly from the web, such as from WHO csv files.
the local path where the XML file downloaded or created is stored
the datetime that the record was last revised, if available
a boolean indicating if the record is assumed complete (used when no revision date is available)
the download status - an integer - where 0 indicates found in a search but not yet (re)downloaded, and 2 indicates downloaded.
the id of the fetch / search event in which the data package was last downloaded / created
the date-time of that fetch / search
the id of the harvest event in which it was last harvested
the date-time of that harvest
the id of the import event in which it was last imported
the date-time of that import

Figure 1 provides an image of a few of the rows in the source_data_studies table.

Figure 1: Sample lines from the source_data_studies table

These two tables

logging data layer - the logging repo
standard functions for doing the tracking

The source parameters

Holds a central position
Sources are orgs so that the original records for each source are in the contexts organisation table
That gives the ids
But processing specific data is in this table
Used in all phases of processing to know what tables to expect to find / process in each database

The event records

The events tables and types
Creating and filling event records
statistics linked to aggregation

Extraction notes

Purpose and usage
Extraction tables
feedback and notes, serialising feedback

@@ Line 6: / Line 6: @@
 ===System status: The source data tables===
-The source data tables store the details abouit the source data packages, and are thus the chief stores of the system's state. There are two of them - one for studies (sf.source_data_studies) and one for objects (sf.source_data_objects), though they both have exactly the same structure, as listed below.
+The source data tables store the details abouit the source data packages, and are therfeore the chief stores of the system's state. There are two of them - one for studies (sf.source_data_studies) and one for objects (sf.source_data_objects), though they both have exactly the same structure, as listed below.
 * An integer id, created as an identity field
 * the source id,  the integer id of the data source

Difference between revisions of "Logging and Tracking"

Revision as of 12:02, 3 November 2020

Contents

The need for logging and tracking

System status: The source data tables

The source parameters

The event records

Extraction notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

The Project

Metadata schemas

Data Structures

Data Extraction

The Portal

Help and F.A.Q.