Difference between revisions of "System Scheduling"
(→Imports (Thursday)) |
(→Aggregating data (Sunday)) |
||
Line 90: | Line 90: | ||
===Aggregating data (Sunday)=== | ===Aggregating data (Sunday)=== | ||
+ | The final phase is the aggregation of the data. which takes places in the sequence of: a) aggregation from source databases, b) creation of new core tables, c) creation of statistics of the aggregation, d) creation of JSON files, and e) zipping of JSON files. The detailed schedule is as below: |
Revision as of 17:59, 13 January 2021
Contents
Introduction
The various data extraction and data aggregation processes are all scheduled on a weekly basis. Each is done using a console application, which can be switched on with the relevant parameters to carry out the specified operation on the target source. Windows Task Scheduler is used to control the scheduling. Details of the scheduling will vary as more sources are added, but the situation as of January 2021 is described below. Several days in the week are each assigned a focus on a particular task, with a few spare days available for catch up.
Downloads (Monday)
Most of the downloads are scheduled for Monday. Because they take place on a weekly basis most can be done quite quickly - the main exception is that for EU CTR, which does not provide a 'date last revised' field. The current schedule is as follows, with the relevant parameters given:
Time | Target | Call |
---|---|---|
07:00 | Yoda | ...\DataDownloader.exe -s 101901 -t 102 |
07:30 | BioLINCC | ...\DataDownloader.exe -s 101900 -t 102 |
09:00 | ClinicalTrials.gov | ...\DataDownloader.exe -s 100120 -t 111 |
11:00 | ISRCTN | ...\DataDownloader.exe -s 100126 -t 112 |
13:00 | WHO | ...\DataDownloader.exe -s 100115 -t 113 -f "C:\MDR_sources\WHO\<file name>.csv" |
14:00 | EUCTR | ...\DataDownloader.exe -s 100123 -t 142 |
Note that the WHO 'download' involves processing a specific named file (with a different date-stamp each week), and the file name must therefore be manually inserted into the call each week.
The EUCTR download is placed last because it is by far the longest and also the one most prone to errors, because of apparent issues and maintenance work on the web site. It may therefore need to be re-run the following day.
Harvests (Wednesday)
Harvests are relatively straightforward and are scheduled for Wednesdays. The ids for the sources are string arrays and therefore enclosed in quotes, even when there is only one of them. (The Downloader exe, by contrast, only expects a single integer source.)
Time | Target | Call |
---|---|---|
07:00 | BioLINCC | ...\DataHarvester.exe -s "101900" -t 1 |
07:30 | Yoda | ...\DataHarvester.exe -s "101901" -t 1 |
08:00 | ClinicalTrials.gov | ...\DataHarvester.exe -s "100120" -t 2 |
09:00 | ISRCTN | ...\DataHarvester.exe -s "100126" -t 2 |
10:00 | EUCTR | ...\DataHarvester.exe -s "100123" -t 2 |
13:00 | WHO A | ...\DataHarvester.exe -s "100116, 100117, 100118, 100119" -t 2 |
13:00 | WHO B | ...\DataHarvester.exe -s "100121, 100122, 100124, 100125" -t 2 |
13:00 | WHO C | ...\DataHarvester.exe -s "100127, 100128, 100129, 100130, 100131, 1000132, 101989" -t 2 |
The WHO A, B and C harvests refers to different collections of WHO registries. A processes sources 100116 (the Australia / New Zealand registry), 100117 (the Brazilian registry), 100118 (the Chinese registry) and 100119 (thew South Korean registry). B processes 100121 (the Indian registry), 100122 (the Cuban registry), 100124 (the German DRKS registry) and 100125 (the Iranian registry). C processes the rest - 100127 through to 100132 (Registries in Japan, Africa, Peru, Sri Lanka, Thailand and the Netherlands respectively), plus 101989 (the Lebanese registry).
Imports (Thursday)
Imports are also relatively straightforward, taking place on Thursdays. The parameters (just the source id(s)) and organisation of the calls is very similar to that for harvests.
Time | Target | Call |
---|---|---|
09:00 | BioLINCC | ...\DataImporter.exe -s "101900" |
09:20 | Yoda | ...\DataImporter.exe -s "101901" |
09:40 | ISRCTN | ...\DataImporter.exe -s "100126" |
10:00 | EUCTR | ...\DataImporter.exe -s "100123" |
11:00 | ClinicalTrials.gov | ...\DataImporter.exe -s "100120" |
12:00 | WHO A | ...\DataImporter.exe -s "100116, 100117, 100118, 100119" |
12:30 | WHO B | ...\DataImporter.exe -s "100121, 100122, 100124, 100125" |
13:00 | WHO C | ...\DataImporter.exe -s "100127, 100128, 100129, 100130, 100131, 1000132, 101989" |
Processing PubMed data (Friday)
Processing of the PubMed data is best done after all the other (study based) sources have been imported, because one of the two mechanisms for identifying relevant PubMed records uses references inside other source databases. It is therefore scheduled for Friday, and runs through all aspects of the extraction process, including two initial downloads. as shown below.
Time | Target | Call |
---|---|---|
09:00 | PubMed | ...\DataDownloader.exe -s 100135 -t 114 -q 10003 |
10:30 | PubMed | ...\DataDownloader.exe -s 100135 -t 114 -q 10004 |
17:00 | PubMed | ...\DataHarvester.exe -s "100135" -t 2 |
18:00 | PubMed | ...\DataImporter.exe -s "100135" |
Aggregating data (Sunday)
The final phase is the aggregation of the data. which takes places in the sequence of: a) aggregation from source databases, b) creation of new core tables, c) creation of statistics of the aggregation, d) creation of JSON files, and e) zipping of JSON files. The detailed schedule is as below: