Downloading Data
Contents
Role of the Downloading process
The functioning of the mdr begins with the creation of a local copy of all the source data. A folder is set up to receive the data, one per source, and the data download process adds files to that folder, or replaces files with newer versions. The download events are self contained and can take place independently of any further processing. The local copy of the source data simply grows, and / or is kept up to date, with successive download events. At any point in time the folder therefore holds *all* the data relevant to the mdr from its source, but because the basic details of each file, including the date and time of its download, are recorded in the monitoring database later processing stages can select subsets of files from that data store.
N.B. The download code can be found at https://github.com/ecrin-github/DataDownloader
Downloading methods and options
The data sources are trial registries and data repositories, and the mechanisms for obtaining data include
- downloading XML files directly from a source's API, (e.g. for ClicalTrials.gov, PubMed)
- scraping web pages and generating the XML files from the data obtained (e.g. for ISRCTN, EUCTR, Yoda, BioLincc)
- downloading CSV files and converting the data into XML files (e.g. for WHO ICTRP data).
It is also sometimes useful to
- search for and identify records in a source that meet some criteria, creating source data records but not downloading the data until later.
The format of the XML files created vary from source to source but represent the initial stage in the process of converting the source data into a consistent schema.
Various options are available for all the three main download methods. For each, the program can download
- all the source data, thus replacing all local files (the case for relatively small sources such as BioLINCC and Yoda)
- all the source data packages that have been revised or added after a certain date (the more common pattern, where a revision date is available)
- all the source data packages that have not been previously marked as 'complete' - which is what happens with sources where the revision date is not exposed (e.g. EUCTR)
- all the source data packages that meet some filtering criteria
- the source data identified by one or more previous searches
- data identified by a combination of two of the methods above, for example data that has been added or revised since a certain date AND which meets certain filter criteria.
An example of the final type of download is provided by PubMed data, where the download is often of data that has been revised or added since the previous download, but which also includes 'bank ids' in the record that have been assigned by a trial registry.
The combination of the download method and selection criteria generates a matrix of different download types, each of which has an integer id. These are shown in the table below.
Selection Criteria | API | web scraping | file download | Search |
---|---|---|---|---|
All records | 101 | 102 | 103 | 201 |
Revised on or after cut off date | 111 | 112 | 113 | 202 |
Filtered AND >= cut-off date | 114 | - | - | 204 |
Meets filter criteria | 121 | 122 | 123 | 203 |
As identified in prior search(es) | 131 | 132 | 133 | - |
Assumed incomplete | 141 | 142 | 143 | - |
There is an additional type (205), where data is identified by processing data already held in the MDR. This applies at the moment only to the aggregation of 'study references', i.e. lists of PubMed identifiers, which are held by some study databases.
Parameters and their processing
The download program, like the other three main modules in the extraction system, is a console app that makes use of a command line interpreter. The interpreter processes the parameters attached to the console command, checks that they are in the correct form and complete - the exact parameters required will depend upon the type parameter - and passes them into the program as properties of the main downloader class.
The system takes the following parameters:
-s, followed by an integer: the id of the source to be downloaded, e.g. 100120 = ClinicalTrials.gov.
-t, followed by an integer: the id of the type of fetch (or sometimes search) - see below for more details.
-f, followed by a file path: the path should be that of the source file for those download types that require it.
-d, followed by a date, in the ISO yyyy-MM-dd format: the date should be the cut-off date for those download data types that require one.
-q, followed by an integer: the integer id of a listed query that filters the selected data against specified criteria
-p, followed by a string of comma delimited integers: the ids of the previous searches that should be used as the basis of this download.
-L: a flag indicating that no logging should take place. Useful in some testing and development scenarios.
Thus, a parameter string such as
-s 100120 -t 111 -d 2020-09-23
will cause the system to download files from PubMed that have been revised or added since the 23rd September (the ClinicalTrials.gov API allows this sort of call to be made). The parameters
-s 100135 -t 114 -d 2020-07-14 -q 100003
would cause the system to download files from PubMed that have been revised since the 14th July and which also contain references to clinical trial registry ids, while the string
-s 100115 -t 113 -f "C:\data\who\update 20200813.csv"
would cause the system to update the WHO linked data sources with data from the named csv file. The parameter strings:
-s 100126 -t 202 -d 2020-06-12
-s 100126 -t 132 -p 100054
would first cause the data in ISRCTN that had been added or revised since the 12th of June to be identified (that search having an id of 100054), and then cause that data to be downloaded, as a separate process. The second process does not need to be run immediately after the first.
Overview of the various download processes
The details of any download process are not only dependent upon the fetch / search type specified and the other parameters - the process is also highly source dependent.
Direct download using an API
The simplest downloads are those that simply take pre-existing XML files using API calls. This is the case for both ClinicalTrials.gov and PubMed. The data are not transformed in any way at this stage - simply downloaded and stored for later harvesting. The process usually consists of an initial call to a search API, to see how many files will need to be downloaded, followed by successive calls to the search API, each one generating a dataset to be iterated over, with an XML file generated for each study source data package. The relevant code for ClinicalTrials.gov is shown below. The cut-off date is first translated into three parameters - year, month and day - that can be inserted into the query string as cut_off_params. The API call asks for full study data that has a last revision date on or before the cut off date, and asks for that data 20 studies at a time (using the min_rank and max_rank parameters). An initial call is used to simply find the total number of records to be downloaded (in num_found_string) and this is then used to calculate the number of times the main loop has to be called (loop_count).
if (cutoff_date != null) { cutoff_date = (DateTime)cutoff_date; string year = cutoff_date.Value.Year.ToString(); string month = cutoff_date.Value.Month.ToString("00"); string day = cutoff_date.Value.Day.ToString("00"); int min_rank = 1; int max_rank = 20; string start_url = "https://clinicaltrials.gov/api/query/full_studies?expr=AREA%5BLastUpdatePostDate%5DRANGE%5B"; string cut_off_params = month + "%2F" + day + "%2F" + year; string end_url = "%2C+MAX%5D&min_rnk=" + min_rank.ToString() + "&max_rnk=" + max_rank.ToString() + "&fmt=xml"; string url = start_url + cut_off_params + end_url; // Do initial search string responseBody = await webClient.GetStringAsync(url); XmlDocument xdoc = new XmlDocument(); xdoc.LoadXml(responseBody); var num_found_string = xdoc.GetElementsByTagName("NStudiesFound")[0].InnerText; if (Int32.TryParse(num_found_string, out int record_count)) { // Then go through the identified records 20 at a time int loop_count = record_count % 20 == 0 ? record_count / 20 : (record_count / 20) + 1; ...
The program then obtains the XML 20 studies at a time, splits the returned data into the constituent 'full_study' nodes, and writes each full_study to a separate file. Before doing so it checks the data to obtain the last revised date. A record of the download is incorporated into the source_data_studies table, either as a new record or as an edit of an existing record. Note that too many API calls in quick succession can lead to access being blocked (the hosts suspect a denial of service attack) so the download process is automatically paused for 800ms between calls. This greatly increases the time required but is required to prevent blocked calls.
for (int i = 0; i < loop_count; i++) { System.Threading.Thread.Sleep(800); min_rank = (i * 20) + 1; max_rank = (i * 20) + 20; end_url = "%2C+MAX%5D&min_rnk=" + min_rank.ToString() + "&max_rnk=" + max_rank.ToString() + "&fmt=xml"; url = start_url + cut_off_params + end_url; responseBody = await webClient.GetStringAsync(url); xdoc.LoadXml(responseBody); XmlNodeList full_studies = xdoc.GetElementsByTagName("FullStudy"); // Write each record in turn and update table in mon DB. foreach (XmlNode full_study in full_studies) { // Obtain basic information from the file - enough for // the details to be filed in source_study_data table. res.num_checked++; ctg_basics st = processor.ObtainBasicDetails(full_study); // Then write out file. string folder_path = file_base + st.file_path; if (!Directory.Exists(folder_path)) { Directory.CreateDirectory(folder_path); } string full_path = Path.Combine(folder_path, st.file_name); XmlDocument filedoc = new XmlDocument(); filedoc.LoadXml(full_study.OuterXml); filedoc.Save(full_path); // Record details of updated or new record in source_study_data. bool added = logging_repo.UpdateStudyDownloadLog(source.id, st.sd_sid, st.remote_url, saf_id, st.last_updated, full_path); res.num_downloaded++; if (added) res.num_added++; } ... }
Using file download
The WHO ICTRP dataset is available as csv files, with each row corresponding to a study. The download process here consists of downloading the file (usually as weekly updates) and then running through the data to generate the XML files. The XML is structured to largely match the WHO schema the harvesting process is used later to transform it into the ECRIN schema. The WHO data is relatively 'thin', with a lot of study related data points but relatively few associated data objects, other than the trial registry entry itself. The volume of links to results entries varies enormously from one source registry to another but in general is low (and the format is inconsistent). The structure of the created study class, that is serialised into an XML file, is given below.
public class WHORecord { public int source_id { get; set; } public string record_date { get; set; } public string sd_sid { get; set; } public string public_title { get; set; } public string scientific_title { get; set; } public string remote_url { get; set; } public string public_contact_givenname { get; set; } public string public_contact_familyname { get; set; } public string public_contact_email { get; set; } public string public_contact_affiliation { get; set; } public string scientific_contact_givenname { get; set; } public string scientific_contact_familyname { get; set; } public string scientific_contact_email { get; set; } public string scientific_contact_affiliation { get; set; } public string study_type { get; set; } public string date_registration { get; set; } public string date_enrolment { get; set; } public string target_size { get; set; } public string study_status { get; set; } public string primary_sponsor { get; set; } public string secondary_sponsors { get; set; } public string source_support { get; set; } public string interventions { get; set; } public string agemin { get; set; } public string agemin_units { get; set; } public string agemax { get; set; } public string agemax_units { get; set; } public string gender { get; set; } public string inclusion_criteria { get; set; } public string exclusion_criteria { get; set; } public string primary_outcome { get; set; } public string secondary_outcomes { get; set; } public string bridging_flag { get; set; } public string bridged_type { get; set; } public string childs { get; set; } public string type_enrolment { get; set; } public string retrospective_flag { get; set; } public string results_actual_enrollment { get; set; } public string results_url_link { get; set; } public string results_summary { get; set; } public string results_date_posted { get; set; } public string results_date_first_publication { get; set; } public string results_url_protocol { get; set; } public string ipd_plan { get; set; } public string ipd_description { get; set; } public string results_date_completed { get; set; } public string results_yes_no { get; set; } public string folder_name { get; set; } public string design_string { get; set; } public string phase_string { get; set; } public List<string> country_list { get; set; } public List<Secondary_Id> secondary_ids { get; set; } public List<StudyFeature> study_features { get; set; } public List<StudyCondition> condition_list { get; set; } }
The generated XML files are distributed to different folders according to the ultimate (trial registry) source of the data, so that the mdr sees each registry as a separate source with a separate database. This is partly to help manage a large dataset, but mostly to more easily substitute the WHO data (which is not only relatively sparse but also often often requires extensive cleaning) with richer data scraped directly from the source registry.
Using web scraping
A growing group of sources require active web scraping, usually starting from a search page, going to a details page for the main data for each study, and then often to further pages with specific details. Again, the scraping process usually needs to be "throttled down" with the inclusion of automatic pauses. The relatively simple top level loop for the NIH BioLINCC repository is shown below (there is only a single search page. In other cases the system has to walk through a succession of search pages). Each row in the search results is used to download the study specific details and these are turned into a study class that is serialised as an XML file.
public DownloadResult LoopThroughPages() { // Get list of studies from the Biolincc start page. WebPage homePage = browser.NavigateToPage(new Uri("https://biolincc.nhlbi.nih.gov/studies/")); var study_list_table = homePage.Find("div", By.Class("table-responsive")); HtmlNode[] studyRows = study_list_table.CssSelect("tbody tr").ToArray(); XmlSerializer writer = new XmlSerializer(typeof(BioLincc_Record)); DownloadResult res = new DownloadResult(); // Consider each study in turn. foreach (HtmlNode row in studyRows) { // fetch the constructed study record res.num_checked++; BioLincc_Record st = processor.GetStudyDetails(browser, biolincc_repo, res.num_checked, row); if (st != null) { // Write out study record as XML. string file_name = source.local_file_prefix + st.sd_sid + ".xml"; string full_path = Path.Combine(file_base, file_name); file_writer.WriteBioLINCCFile(writer, st, full_path); bool added = logging_repo.UpdateStudyDownloadLog(source_id, st.sd_sid, st.remote_url, saf_id, st.last_revised_date, full_path); res.num_downloaded++; if (added) res.num_added++; // Put a pause here if necessary. System.Threading.Thread.Sleep(1000); } ... } return res; }
The richness of the data varies widely between sources: study registries tend to have mainly study data, with few linked data objects, while data repositories have more object data, linked to relatively thin study data. The structure of the study model for the BioLINCC data repository, the result of the GetStudyDetails call above, is shown below.
public class BioLincc_Record { public string sd_sid { get; set; } public string remote_url { get; set; } public string title { get; set; } public string acronym { get; set; } public int? study_type_id { get; set; } public string study_type { get; set; } public string brief_description { get; set; } public string study_period { get; set; } public string date_prepared { get; set; } public DateTime? page_prepared_date { get; set; } public string last_updated { get; set; } public DateTime? last_revised_date { get; set; } public int publication_year { get; set; } public string study_website { get; set; } public int num_clinical_trial_urls { get; set; } public int num_primary_pub_urls { get; set; } public int num_associated_papers { get; set; } public string resources_available { get; set; } public int dataset_consent_type_id { get; set; } public string dataset_consent_type { get; set; } public string dataset_consent_restrictions { get; set; } public List<PrimaryDoc> primary_docs { get; set; } public List<RegistryId> registry_ids { get; set; } public List<Resource> resources { get; set; } public List<AssocDoc> assoc_docs { get; set; } }
The XML files that are created and stored by the scraping process are usually relatively straightforward to harvest into the database, as much of the preparatory work has been done during this download phase. This does mean, however, that the download code for these sources is complex, as well as being very specific to the source.
Download strategies
For large data sources the strategy is generally to use incremental downloads, on a weekly or even nightly basis, to keep the store of XML files as up to date as possible. For smaller sources it is possible and simpler to re-download the whole source. Unfortunately not all large sources have easy mechanisms to identify new or revised data. The EUCTR for example does not publicly display this date, and is also a difficult web site to scrape (attempts are frequently blocked even with large gaps between scraping calls). To save re-scraping all the data each time, which can take a few days, the assumption is made that if a study meets certain criteria - e.g. is marked as 'completed' and has a completed results section - it is unlikely to change further. That is why one of the download options is to re-download only those files classified as 'assumed not yet completed'. The exact criteria for this status would be source dependent.
Even when incremental updating is relatively straightforward, however, the intention is to do a 100% download at least once a year, to ensure that the basic raw material from each registry is regularly 'rebased' to a valid state.