Difference between revisions of "Identifying Links between Studies"
(→Exclusion of One-to-Many links) |
(→Initial Links Data Collection) |
||
Line 113: | Line 113: | ||
where source_2 = 100118;"; | where source_2 = 100118;"; | ||
conn.Execute(sql_string); | conn.Execute(sql_string); | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
===Data Processing to remove duplicates=== | ===Data Processing to remove duplicates=== |
Revision as of 16:06, 18 November 2020
Contents
Introduction
One of the issues that has to be tackled during aggregation of data from different sources is the fact that the same study can be found, and in tens of thousands of cases is found, in more than one study based source, and that it will have a different persistent identifier in each source. Partly this is because studies can be registered in more than one trial registry, especially when local regulations mandate a registration for any study carried out within a particular country or region. This is especially the case with the EU, which insists all trials involving medicinal products must be registered in the EUCTR. About a third of these studies, however, are also registered in other registries, especially Clinicaltrials.gov. In addition, within a data repository, studies will usually be referenced by a local id rather than a pre-existing registry id.
Study titles cannot be relied upon to identify the same study in different source locations. A study title is often expressed slightly differently in different contexts, and cannot in any case be relied upon to be unique (even within the same source). It may be that further research will indicate how titles could be reframed (e.g. to a smaller number of keywords, expressed in a fixed order) to allow duplicate entries to be discovered using text, but for the moment the only easy way of doing this is by using the 'other identifiers' material found in the source data. Almost all sources contain this material, which usually include any other trial registry ids, (i.e. other than that used in the source registry entry), as well as ids assigned by the sponsor, funder or, sometimes, a regulatory authority.
These 'other registry ids' can be used to build up a table of study-study links which can then be used during the aggregation process to identify when duplicate studies are being added. In fact the creation of this table is always the first stage of any aggregation. This page describes this process in detail.
At the moment, with one exception, the requirement for identifying duplications only extends to studies - not data objects. The exception is provided by PubMed citations for journal papers, which can also be found multiple times in the source data, both within a single source and across multiple sources. The particular complications in processing PubMed data are described in Processing PubMed Data. The assumption for now is that other data objects are not duplicated across data sources, although this may need to be considered in the future. The rules for adding data objects are described more fully in Aggregating Data.
The Preferred Source concept
If details about a study and its attributes can be found in more than one data source, the obvious question is how should this data be merged in the final aggregated MDR database? In fact there are several aspects to this question:
- How should 'single-occurrence' details about the study be merged? (i.e. the data points that appear in the study record itself, such as study display title, study type and status, enrolment target, min and max ages etc.)
- How should study attributes (identifiers, titles, contributors, topics etc.) be merged?
- How should data objects in different sources (trial registry entries, data and document references) be merged?
- How should data object attributes be merged?
The first question is probably the most difficult. If the system allowed these basic study parameters to be merged from a variety of sources the problem of 'precedence' immediately arises: how could the system 'know' which source to use for each parameter, if they were available in both? For instance how would the system select the enrolment target from one repository rather than another? If the data is edited in one source but not another, should the most recent data always take precedence over the old? (it may be less complete than what is already there).
Because of these issues it was decided to
- Only take the study single occurrence data items from a single data source
- Use the idea of 'preferred sources' to order the precedence of sources in a consistent fashion.
All data sources are therefore given a number that indicate its place in this ordering - the exact number is not important, only the place it gives to the data source when they are listed, in ascending order, using this parameter. The number is stored within the sf.source_parameters table in the mon database, and is available as a property of the Source objects when they are retrieved from this table. During the aggregation process, the data sources are processed in order, most preferred first, down to the 'least preferred'. In general, more preferred sources have richer or at least more consistent data than less preferred sources. The most preferred source is ClincalTrials.gov, followed (in broad terms) by registries that are extracted individually, then the various trial registries using the WHO dataset, and then data repositories (e.g. Yoda, BioLINCC), and finally object based data sources (PubMed).
Although it cannot be guaranteed in every case, this usually means that the richer data is added to the system first, and that data coming from later, less preferred sources augments rather than replaces it. In general (and as described in more detail in Aggregating Data), during aggregation the core databases are recreated and then completely rebuilt by adding data from each source in turn. If a study does not already exist in the core system when it is 'presented' from a data source database, then it, its attributes, its associated data objects and their attributes are all added during the aggregation process. But if a study is added that already exists in the core system, because it has been added from a 'more preferred' source earlier in the process:
- The single-occurrence study details are ignored
- Study attributes are only added if they are definitely different from any that have already been added
- Data objects are added (unless, very rarely, they can be seen to be already in the system)
- Data object attributes are added if the data object records themselves are.
The Preferred source idea also plays a part in establishing the study-study link data. If a study has multiple source (registry / repository) identifiers in the system, each of those identifiers, other than the 'most preferred', will need to be linked to the identifier in its 'most preferred' source. That way the system can find the first instance of the study in the system (because it will be guaranteed to always be present in the core systems), if and when a 'less preferred' set of data for the study is presented. The study-study link data therefore needs to be ordered - it should relate each less preferred identifier to its most preferred twin. Exactly how this is done is explained below.
Initial Links Data Collection
public void SetUpTempLinkCollectorTable()
{ using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_study_links_collector; CREATE TABLE nk.temp_study_links_collector( source_1 int , sd_sid_1 varchar , sd_sid_2 varchar , source_2 int) "; conn.Execute(sql_string); } }
public void SetUpTempLinkSortedTable() { using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_study_links_sorted; CREATE TABLE nk.temp_study_links_sorted( source_id int , sd_sid varchar , preferred_sd_sid varchar , preferred_source_id int) "; conn.Execute(sql_string); } }
public IEnumerable<StudyLink> FetchLinks(int source_id, string database_name) { string conn_string = repo.GetConnString(database_name);
using (var conn = new NpgsqlConnection(conn_string)) { string sql_string = @"select " + source_id.ToString() + @" as source_1, sd_sid as sd_sid_1, identifier_value as sd_sid_2, identifier_org_id as source_2 from ad.study_identifiers where identifier_type_id = 11 and identifier_org_id > 100115 and (identifier_org_id < 100133 or identifier_org_id = 101989) and identifier_org_id <> " + source_id.ToString(); return conn.Query<StudyLink>(sql_string); } }
public ulong StoreLinksInTempTable(PostgreSQLCopyHelper<StudyLink> copyHelper, IEnumerable<StudyLink> entities) { using (var conn = new NpgsqlConnection(connString)) { conn.Open(); return copyHelper.SaveAll(conn, entities); } }
public void TidyIds1()
{ string sql_string = ""; using (var conn = new NpgsqlConnection(connString)) { sql_string = @"DELETE from nk.temp_study_links_collector where sd_sid_2 ilike 'U1111%' or sd_sid_2 ilike 'UTRN%'"; conn.Execute(sql_string);
// replace n dashes sql_string = @"UPDATE nk.temp_study_links_collector set sd_sid_2 = replace(sd_sid_2, '–', '-');"; conn.Execute(sql_string);
sql_string = @"UPDATE nk.temp_study_links_collector SET sd_sid_2 = 'ACTRN' || sd_sid_2 WHERE source_2 = 100116 and length(sd_sid_2) = 14"; conn.Execute(sql_string);
sql_string = @"UPDATE nk.temp_study_links_collector SET sd_sid_2 = left(sd_sid_2, 19) WHERE source_2 = 100116 and length(sd_sid_2) > 19"; conn.Execute(sql_string);
sql_string = @"UPDATE nk.temp_study_links_collector set sd_sid_2 = replace(sd_sid_2, 'Chinese Clinical Trial Register', ) where source_2 = 100118;"; conn.Execute(sql_string);
Data Processing to remove duplicates
public void CreateDistinctSourceLinksTable()
{ // The nk.temp_study_links_sorted table will have // many duplicates... create a distinct version of the data
using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links; CREATE TABLE nk.temp_distinct_links as SELECT distinct source_id, sd_sid, preferred_sd_sid, preferred_source_id FROM nk.temp_study_links_sorted";
conn.Execute(sql_string); } }
Exclusion of One-to-Many links
// One set of relationships are not 'same study in a different registry'
// but multiple studies in a different registry. // Such studies have a study relationship rather than being straight equivalents. // There can be multiple studies in the 'preferred' registry // or in the existing studies registry - each group being equivalent to // a registry entry that represents a single study, or sometimes a // single project / programme, or grant
public void IdentifyGroupedStudies() { // Set up a table to hold group definitions (i.e. the list // of studies in each group, can be from the LHS or the RHS // of the distinct links table
using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_grouping_studies; CREATE TABLE nk.temp_grouping_studies ( source_id INT, sd_sid VARCHAR, matching_source_id INT, side VARCHAR );"; conn.Execute(sql_string);
// Studies of interest have more than one matching study // within the SAME matching source registry. // Therefore group on one side, plus the source_id of the other
sql_string = @"INSERT INTO nk.temp_grouping_studies (source_id, sd_sid, matching_source_id, side) SELECT source_id, sd_sid, preferred_source_id, 'L' FROM nk.temp_distinct_links group by source_id, sd_sid, preferred_source_id HAVING count(sd_sid) > 1;"; conn.Execute(sql_string);
sql_string = @"INSERT INTO nk.temp_grouping_studies (source_id, sd_sid, matching_source_id, side) SELECT preferred_source_id, preferred_sd_sid, source_id, 'R' FROM nk.temp_distinct_links group by preferred_source_id, preferred_sd_sid, source_id HAVING count(preferred_sd_sid) > 1;"; conn.Execute(sql_string); } }
public void ExtractGroupedStudiess() { string sql_string;
using (var conn = new NpgsqlConnection(connString)) { // create a table that takes the rows from the linked // studies table that match the 'L' grouping studies
// The source_id side is the group and the preferred side // is comprised of the grouped studies.
sql_string = @"DROP TABLE IF EXISTS nk.temp_linked_studies; create table nk.temp_linked_studies as select k.* from nk.temp_distinct_links k inner join nk.temp_grouping_studies g on k.source_id = g.source_id and k.sd_sid = g.sd_sid and k.preferred_source_id = g.matching_source_id where g.side = 'L';";
conn.Execute(sql_string);
// To retain the same arrangement of grouping study on // the LHS the input data from the RHS has to be switched around
sql_string = @"INSERT into nk.temp_linked_studies (source_id, sd_sid, preferred_source_id, preferred_sd_sid) select k.preferred_source_id, k.preferred_sd_sid, k.source_id, k.sd_sid from nk.temp_distinct_links k inner join nk.temp_grouping_studies g on k.preferred_source_id = g.source_id and k.preferred_sd_sid = g.sd_sid and k.source_id = g.matching_source_id where g.side = 'R'; ";
conn.Execute(sql_string);
// Put this data into the permanent linked_study_groups table // The study relationships are // 25 Includes target as one of a group of non-registered studies // This study includes<the target study>.That study is not registered independently, // but instead shares this registry entry with one or more other non-registered studies. // 26 Non registered but included within a registered study group // This study is registered as <the target study>, along with one or more other studies // that share the same registry entry and id. // 28 Includes target as one of a group of registered studies // This study includes <the target study>, which is registered elsewhere along with one // or more other registered studies, forming a group that collectively equates to this study. // 29 Registered and is included elsewhere in group // This study is also registered, along with one or more other studies that together form an // equivalent group, as <the target study>.
sql_string = @"INSERT INTO nk.linked_study_groups (source_id, sd_sid, relationship_id, target_sd_sid, target_source_id) select distinct source_id, sd_sid, case when preferred_source_id = 101900 or preferred_source_id = 101901 then 25 else 28 end, preferred_sd_sid, preferred_source_id from nk.temp_linked_studies;"; conn.Execute(sql_string);
sql_string = @"INSERT INTO nk.linked_study_groups (source_id, sd_sid, relationship_id, target_sd_sid, target_source_id) select distinct preferred_source_id, preferred_sd_sid, case when source_id = 101900 or source_id = 101901 then 26 else 29 end, sd_sid, source_id from nk.temp_linked_studies;"; conn.Execute(sql_string); } }