Difference between revisions of "Logging and Tracking"
(→The source data tables) |
|||
Line 1: | Line 1: | ||
===The need for logging and tracking=== | ===The need for logging and tracking=== | ||
− | + | The 'mon' (for monitoring) database contains various tables that hold data to support logging and tracking. Not all these systems are fully developed, but the main elements are in place.<br/> | |
− | + | Here 'logging' referes to recording what has happened within the system, during any major operation, and 'tracking' refers to maintaining a record of the state of the system, in particular the status of the various source data packages. A 'source data package' is the complete block of data obtained about any study, from a study based source, i.e. the study details, the study attribute data, the linked data onbject details and the object attribute data. For object based data sources such as PubMed each package consists of the object details and the object attribute data. These are distinct packages in the sense that they are downloaded and harvested as a unit, even if they are split into different components before being imported and aggregated. There are currently (November 2020) about 600,000 study data packages tracked by the system, and about 200,000 object packages.<br/><br/> | |
− | + | The two main drivers behind developing the mon tables and systems were | |
+ | * The need for each major process to be able to see the status of each data package - for example to identify those that had been downloaded since the last import, or those which are new to the system. This is because the 4 major processes (download, harvest, import and aggregation) operate independently of each other, and so need a common to use, and update, a common set of reference data. | ||
+ | * The need to check the correct functioning of the system and to pick up apparent errors and anomalies when they occur. | ||
− | ===The source data tables=== | + | ===System status: The source data tables=== |
The source data tables<br/> | The source data tables<br/> | ||
<< diagram of table structure>><br/> | << diagram of table structure>><br/> |
Revision as of 11:44, 3 November 2020
Contents
The need for logging and tracking
The 'mon' (for monitoring) database contains various tables that hold data to support logging and tracking. Not all these systems are fully developed, but the main elements are in place.
Here 'logging' referes to recording what has happened within the system, during any major operation, and 'tracking' refers to maintaining a record of the state of the system, in particular the status of the various source data packages. A 'source data package' is the complete block of data obtained about any study, from a study based source, i.e. the study details, the study attribute data, the linked data onbject details and the object attribute data. For object based data sources such as PubMed each package consists of the object details and the object attribute data. These are distinct packages in the sense that they are downloaded and harvested as a unit, even if they are split into different components before being imported and aggregated. There are currently (November 2020) about 600,000 study data packages tracked by the system, and about 200,000 object packages.
The two main drivers behind developing the mon tables and systems were
- The need for each major process to be able to see the status of each data package - for example to identify those that had been downloaded since the last import, or those which are new to the system. This is because the 4 major processes (download, harvest, import and aggregation) operate independently of each other, and so need a common to use, and update, a common set of reference data.
- The need to check the correct functioning of the system and to pick up apparent errors and anomalies when they occur.
System status: The source data tables
The source data tables
<< diagram of table structure>>
study-study links (??? - in nk surely)
logging data layer - the logging repo
standard functions for doing the tracking
Logging of data dowwnload is critical because it provides the basis for orchestrating processes later on in the extraction pathway. A record is created for each study that is downloaded (in study based sources like trial registries) or for each data object downloaded (for object based resources like PubMed). The **'data source record'** that is established includes:
- the source id,
- the object's own id, in the source data (e.g. a registry identifier or PubMed id),
- the URL of its record on the web - if it has one. This applies even to data that is not collected directly from the web, such as from WHO csv files.
- the local path where the XML file downloaded or created is stored
- the datetime that the record was last revised, if available
- a boolean indicating if the record is assumed complete (used when no revision date is available)
- the download status - an integer - where 0 indicates found in a search but not yet (re)downloaded, and 2 indicates downloaded.
- the id of the fetch / search event in which it was last downloaded / created
- the date-time of that fetch / search
- the id of the harvest event in which it was last harvested
- the date-time of that harvest
- the id of the import event in which it was last imported
- the date-time of that import
The source parameters
Holds a central position
Sources are orgs so that the original records for each source are in the contexts organisation table
That gives the ids
But processing specific data is in this table
Used in all phases of processing to know what tables to expect to find / process in each database
The event records
The events tables and types
Creating and filling event records
statistics linked to aggregation
Extraction notes
Purpose and usage
Extraction tables
feedback and notes, serialising feedback