System Scheduling

From ECRIN-MDR Wiki
Revision as of 18:04, 13 January 2021 by Admin (talk | contribs) (Aggregating data (Sunday))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

The various data extraction and data aggregation processes are all scheduled on a weekly basis. Each is done using a console application, which can be switched on with the relevant parameters to carry out the specified operation on the target source. Windows Task Scheduler is used to control the scheduling. Details of the scheduling will vary as more sources are added, but the situation as of January 2021 is described below. Several days in the week are each assigned a focus on a particular task, with a few spare days available for catch up.

Downloads (Monday)

Most of the downloads are scheduled for Monday. Because they take place on a weekly basis most can be done quite quickly - the main exception is that for EU CTR, which does not provide a 'date last revised' field. The current schedule is as follows, with the relevant parameters given:

Time Target Call
07:00 Yoda ...\DataDownloader.exe -s 101901 -t 102
07:30 BioLINCC ...\DataDownloader.exe -s 101900 -t 102
09:00 ClinicalTrials.gov ...\DataDownloader.exe -s 100120 -t 111
11:00 ISRCTN ...\DataDownloader.exe -s 100126 -t 112
13:00 WHO ...\DataDownloader.exe -s 100115 -t 113 -f "C:\MDR_sources\WHO\<file name>.csv"
14:00 EUCTR ...\DataDownloader.exe -s 100123 -t 142

Note that the WHO 'download' involves processing a specific named file (with a different date-stamp each week), and the file name must therefore be manually inserted into the call each week.
The EUCTR download is placed last because it is by far the longest and also the one most prone to errors, because of apparent issues and maintenance work on the web site. It may therefore need to be re-run the following day.

Harvests (Wednesday)

Harvests are relatively straightforward and are scheduled for Wednesdays. The ids for the sources are string arrays and therefore enclosed in quotes, even when there is only one of them. (The Downloader exe, by contrast, only expects a single integer source.)

Time Target Call
07:00 BioLINCC ...\DataHarvester.exe -s "101900" -t 1
07:30 Yoda ...\DataHarvester.exe -s "101901" -t 1
08:00 ClinicalTrials.gov ...\DataHarvester.exe -s "100120" -t 2
09:00 ISRCTN ...\DataHarvester.exe -s "100126" -t 2
10:00 EUCTR ...\DataHarvester.exe -s "100123" -t 2
13:00 WHO A ...\DataHarvester.exe -s "100116, 100117, 100118, 100119" -t 2
13:00 WHO B ...\DataHarvester.exe -s "100121, 100122, 100124, 100125" -t 2
13:00 WHO C ...\DataHarvester.exe -s "100127, 100128, 100129, 100130, 100131, 1000132, 101989" -t 2

The WHO A, B and C harvests refers to different collections of WHO registries. A processes sources 100116 (the Australia / New Zealand registry), 100117 (the Brazilian registry), 100118 (the Chinese registry) and 100119 (thew South Korean registry). B processes 100121 (the Indian registry), 100122 (the Cuban registry), 100124 (the German DRKS registry) and 100125 (the Iranian registry). C processes the rest - 100127 through to 100132 (Registries in Japan, Africa, Peru, Sri Lanka, Thailand and the Netherlands respectively), plus 101989 (the Lebanese registry).

Imports (Thursday)

Imports are also relatively straightforward, taking place on Thursdays. The parameters (just the source id(s)) and organisation of the calls is very similar to that for harvests.

Time Target Call
09:00 BioLINCC ...\DataImporter.exe -s "101900"
09:20 Yoda ...\DataImporter.exe -s "101901"
09:40 ISRCTN ...\DataImporter.exe -s "100126"
10:00 EUCTR ...\DataImporter.exe -s "100123"
11:00 ClinicalTrials.gov ...\DataImporter.exe -s "100120"
12:00 WHO A ...\DataImporter.exe -s "100116, 100117, 100118, 100119"
12:30 WHO B ...\DataImporter.exe -s "100121, 100122, 100124, 100125"
13:00 WHO C ...\DataImporter.exe -s "100127, 100128, 100129, 100130, 100131, 1000132, 101989"

Processing PubMed data (Friday)

Processing of the PubMed data is best done after all the other (study based) sources have been imported, because one of the two mechanisms for identifying relevant PubMed records uses references inside other source databases. It is therefore scheduled for Friday, and runs through all aspects of the extraction process, including two initial downloads. as shown below.

Time Target Call
09:00 PubMed             ...\DataDownloader.exe -s 100135 -t 114 -q 10003
10:30 PubMed ...\DataDownloader.exe -s 100135 -t 114 -q 10004
17:00 PubMed ...\DataHarvester.exe -s "100135" -t 2
18:00 PubMed ...\DataImporter.exe -s "100135"

Aggregating data (Sunday)

The final phase is the aggregation of the data. which takes places in the sequence of: a) aggregation from source databases, b) creation of new core tables, c) creation of statistics of the aggregation, d) creation of JSON and JSON files, and e) zipping of JSON files. The detailed schedule is as below:

Time Target Call
10:00 All source specific DBs ...\DataAggregator.exe -D
13:00 MDR central schema (st, ob, nk) ...\DataAggregator.exe -C
14:00 All databases ...\DataAggregator.exe -S
14:30 MDR central schema (core) ...\DataAggregator.exe -J -F
19:00 JSON files ...\FileZipper.exe -J