Difference between revisions of "Identifying Links between Studies"
(→Data Processing to remove duplicates) |
(→Data Processing to remove duplicates) |
||
Line 244: | Line 244: | ||
sql_string = @"DROP TABLE IF EXISTS nk.temp_missing_links; | sql_string = @"DROP TABLE IF EXISTS nk.temp_missing_links; | ||
DROP TABLE IF EXISTS nk.temp_new_links;"; | DROP TABLE IF EXISTS nk.temp_new_links;"; | ||
+ | conn.Execute(sql_string); | ||
+ | } | ||
+ | } | ||
+ | |||
+ | public void CascadeLinksInDistinctLinksTable() | ||
+ | { | ||
+ | using (var conn = new NpgsqlConnection(connString)) | ||
+ | { | ||
+ | // telescope the preferred links to the most preferred | ||
+ | // i.e. A -> B, B -> C becomes A -> C, B -> C | ||
+ | // do as long as there remains links to be telescoped | ||
+ | // (a few have to be done twice) | ||
+ | |||
+ | string sql_string; | ||
+ | int match_number = 500; // arbitrary start number | ||
+ | while (match_number > 0) | ||
+ | { | ||
+ | // get match number as number of link records where the rhs sd_sid | ||
+ | // appears elsewhere on the left... | ||
+ | |||
+ | sql_string = @"SELECT count(*) | ||
+ | FROM nk.temp_distinct_links t1 | ||
+ | inner join nk.temp_distinct_links t2 | ||
+ | on t1.preferred_source_id = t2.source_id | ||
+ | and t1.preferred_sd_sid = t2.sd_sid"; | ||
+ | |||
+ | match_number = conn.ExecuteScalar<int>(sql_string); | ||
+ | |||
+ | if (match_number > 0) | ||
+ | { | ||
+ | // do the update | ||
+ | |||
+ | sql_string = @"UPDATE nk.temp_distinct_links t1 | ||
+ | SET preferred_source_id = t2.preferred_source_id, | ||
+ | preferred_sd_sid = t2.preferred_sd_sid | ||
+ | FROM nk.temp_distinct_links t2 | ||
+ | WHERE t1.preferred_source_id = t2.source_id | ||
+ | AND t1.preferred_sd_sid = t2.sd_sid"; | ||
+ | |||
+ | conn.Execute(sql_string); | ||
+ | } | ||
+ | } | ||
+ | |||
+ | // but in some cases the telescoped link may have already been | ||
+ | // present - i.e. a study has two other reg identifiers | ||
+ | // one of which will be the most preferred | ||
+ | // Process above will result in duplicates in these cases | ||
+ | // and these duplicates therefore need to be removed. | ||
+ | |||
+ | sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links2; | ||
+ | CREATE TABLE nk.temp_distinct_links2 | ||
+ | as SELECT distinct * FROM nk.temp_distinct_links"; | ||
+ | |||
+ | conn.Execute(sql_string); | ||
+ | |||
+ | sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links; | ||
+ | ALTER TABLE nk.temp_distinct_links2 RENAME TO temp_distinct_links;"; | ||
+ | |||
conn.Execute(sql_string); | conn.Execute(sql_string); | ||
} | } |
Revision as of 16:08, 18 November 2020
Contents
Introduction
One of the issues that has to be tackled during aggregation of data from different sources is the fact that the same study can be found, and in tens of thousands of cases is found, in more than one study based source, and that it will have a different persistent identifier in each source. Partly this is because studies can be registered in more than one trial registry, especially when local regulations mandate a registration for any study carried out within a particular country or region. This is especially the case with the EU, which insists all trials involving medicinal products must be registered in the EUCTR. About a third of these studies, however, are also registered in other registries, especially Clinicaltrials.gov. In addition, within a data repository, studies will usually be referenced by a local id rather than a pre-existing registry id.
Study titles cannot be relied upon to identify the same study in different source locations. A study title is often expressed slightly differently in different contexts, and cannot in any case be relied upon to be unique (even within the same source). It may be that further research will indicate how titles could be reframed (e.g. to a smaller number of keywords, expressed in a fixed order) to allow duplicate entries to be discovered using text, but for the moment the only easy way of doing this is by using the 'other identifiers' material found in the source data. Almost all sources contain this material, which usually include any other trial registry ids, (i.e. other than that used in the source registry entry), as well as ids assigned by the sponsor, funder or, sometimes, a regulatory authority.
These 'other registry ids' can be used to build up a table of study-study links which can then be used during the aggregation process to identify when duplicate studies are being added. In fact the creation of this table is always the first stage of any aggregation. This page describes this process in detail.
At the moment, with one exception, the requirement for identifying duplications only extends to studies - not data objects. The exception is provided by PubMed citations for journal papers, which can also be found multiple times in the source data, both within a single source and across multiple sources. The particular complications in processing PubMed data are described in Processing PubMed Data. The assumption for now is that other data objects are not duplicated across data sources, although this may need to be considered in the future. The rules for adding data objects are described more fully in Aggregating Data.
The Preferred Source concept
If details about a study and its attributes can be found in more than one data source, the obvious question is how should this data be merged in the final aggregated MDR database? In fact there are several aspects to this question:
- How should 'single-occurrence' details about the study be merged? (i.e. the data points that appear in the study record itself, such as study display title, study type and status, enrolment target, min and max ages etc.)
- How should study attributes (identifiers, titles, contributors, topics etc.) be merged?
- How should data objects in different sources (trial registry entries, data and document references) be merged?
- How should data object attributes be merged?
The first question is probably the most difficult. If the system allowed these basic study parameters to be merged from a variety of sources the problem of 'precedence' immediately arises: how could the system 'know' which source to use for each parameter, if they were available in both? For instance how would the system select the enrolment target from one repository rather than another? If the data is edited in one source but not another, should the most recent data always take precedence over the old? (it may be less complete than what is already there).
Because of these issues it was decided to
- Only take the study single occurrence data items from a single data source
- Use the idea of 'preferred sources' to order the precedence of sources in a consistent fashion.
All data sources are therefore given a number that indicate its place in this ordering - the exact number is not important, only the place it gives to the data source when they are listed, in ascending order, using this parameter. The number is stored within the sf.source_parameters table in the mon database, and is available as a property of the Source objects when they are retrieved from this table. During the aggregation process, the data sources are processed in order, most preferred first, down to the 'least preferred'. In general, more preferred sources have richer or at least more consistent data than less preferred sources. The most preferred source is ClincalTrials.gov, followed (in broad terms) by registries that are extracted individually, then the various trial registries using the WHO dataset, and then data repositories (e.g. Yoda, BioLINCC), and finally object based data sources (PubMed).
Although it cannot be guaranteed in every case, this usually means that the richer data is added to the system first, and that data coming from later, less preferred sources augments rather than replaces it. In general (and as described in more detail in Aggregating Data), during aggregation the core databases are recreated and then completely rebuilt by adding data from each source in turn. If a study does not already exist in the core system when it is 'presented' from a data source database, then it, its attributes, its associated data objects and their attributes are all added during the aggregation process. But if a study is added that already exists in the core system, because it has been added from a 'more preferred' source earlier in the process:
- The single-occurrence study details are ignored
- Study attributes are only added if they are definitely different from any that have already been added
- Data objects are added (unless, very rarely, they can be seen to be already in the system)
- Data object attributes are added if the data object records themselves are.
The Preferred source idea also plays a part in establishing the study-study link data. If a study has multiple source (registry / repository) identifiers in the system, each of those identifiers, other than the 'most preferred', will need to be linked to the identifier in its 'most preferred' source. That way the system can find the first instance of the study in the system (because it will be guaranteed to always be present in the core systems), if and when a 'less preferred' set of data for the study is presented. The study-study link data therefore needs to be ordered - it should relate each less preferred identifier to its most preferred twin. Exactly how this is done is explained below.
Initial Links Data Collection
public void SetUpTempLinkCollectorTable()
{ using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_study_links_collector; CREATE TABLE nk.temp_study_links_collector( source_1 int , sd_sid_1 varchar , sd_sid_2 varchar , source_2 int) "; conn.Execute(sql_string); } }
public void SetUpTempLinkSortedTable() { using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_study_links_sorted; CREATE TABLE nk.temp_study_links_sorted( source_id int , sd_sid varchar , preferred_sd_sid varchar , preferred_source_id int) "; conn.Execute(sql_string); } }
public IEnumerable<StudyLink> FetchLinks(int source_id, string database_name) { string conn_string = repo.GetConnString(database_name);
using (var conn = new NpgsqlConnection(conn_string)) { string sql_string = @"select " + source_id.ToString() + @" as source_1, sd_sid as sd_sid_1, identifier_value as sd_sid_2, identifier_org_id as source_2 from ad.study_identifiers where identifier_type_id = 11 and identifier_org_id > 100115 and (identifier_org_id < 100133 or identifier_org_id = 101989) and identifier_org_id <> " + source_id.ToString(); return conn.Query<StudyLink>(sql_string); } }
public ulong StoreLinksInTempTable(PostgreSQLCopyHelper<StudyLink> copyHelper, IEnumerable<StudyLink> entities) { using (var conn = new NpgsqlConnection(connString)) { conn.Open(); return copyHelper.SaveAll(conn, entities); } }
public void TidyIds1()
{ string sql_string = ""; using (var conn = new NpgsqlConnection(connString)) { sql_string = @"DELETE from nk.temp_study_links_collector where sd_sid_2 ilike 'U1111%' or sd_sid_2 ilike 'UTRN%'"; conn.Execute(sql_string);
// replace n dashes sql_string = @"UPDATE nk.temp_study_links_collector set sd_sid_2 = replace(sd_sid_2, '–', '-');"; conn.Execute(sql_string);
sql_string = @"UPDATE nk.temp_study_links_collector SET sd_sid_2 = 'ACTRN' || sd_sid_2 WHERE source_2 = 100116 and length(sd_sid_2) = 14"; conn.Execute(sql_string);
sql_string = @"UPDATE nk.temp_study_links_collector SET sd_sid_2 = left(sd_sid_2, 19) WHERE source_2 = 100116 and length(sd_sid_2) > 19"; conn.Execute(sql_string);
sql_string = @"UPDATE nk.temp_study_links_collector set sd_sid_2 = replace(sd_sid_2, 'Chinese Clinical Trial Register', ) where source_2 = 100118;"; conn.Execute(sql_string);
Data Processing to remove duplicates
public void CreateDistinctSourceLinksTable()
{ // The nk.temp_study_links_sorted table will have // many duplicates... create a distinct version of the data
using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links; CREATE TABLE nk.temp_distinct_links as SELECT distinct source_id, sd_sid, preferred_sd_sid, preferred_source_id FROM nk.temp_study_links_sorted";
conn.Execute(sql_string); } }
public void ManageIncompleteLinks()
{
// There are a set if links that may be missing, in the sense that // Study A is listed as being the same as Study B and Study C, but no // link exists beteween either Study B to C, or Study C to B. // The 'link path' is therefore broken and the B to C link needs to be added. // These studies have two, occasionally more, 'preferred studies', which // does not make sense in the system.
// First create a table with these 'missing link' records
// Working from the inside out, this query // a) gets the source id/sd_sids of the LHS of the study links table // that has more than one 'preferred' study associated with it (dataset d) // b) takes those records and identifies the distinct preferred source ids // that are linked to each RHS study (dataset a) // c) Further identifies the records that have more than one // source referenced on the RHS (so all the linked records have the // 'impossible' property of having more than one preferred // source / sd_sid study record (dataset agg) // d) joins that dataset back to the linked records table, to // identify the source records that meet the criteria of // having a 'missing link'
string sql_string; using (var conn = new NpgsqlConnection(connString)) { // First identify the studies that have more than one 'preferred' option // cutting across more than one source. // Groups have already been removed, so this should find only those // with the 'missing link'.
sql_string = @"DROP TABLE IF EXISTS nk.temp_studies_with_multiple_links; CREATE TABLE nk.temp_studies_with_multiple_links as SELECT source_id, sd_sid from nk.temp_distinct_links group by source_id, sd_sid having count(distinct preferred_source_id) > 1;"; conn.Execute(sql_string);
sql_string = @"DROP TABLE IF EXISTS nk.temp_missing_links; CREATE TABLE nk.temp_missing_links as select k.source_id, r1.preference_rating as source_rating, k.sd_sid, k.preferred_source_id, r2.preference_rating, k.preferred_sd_sid from nk.temp_distinct_links k inner join nk.temp_studies_with_multiple_links m on k.source_id = m.source_id and k.sd_sid = m.sd_sid inner join nk.temp_preferences r1 on k.source_id = r1.id inner join nk.temp_preferences r2 on k.preferred_source_id = r2.id order by k.source_id, k.sd_sid, preferred_source_id;"; conn.Execute(sql_string);
// Create a further temp table that will hold the links between studies B and C, // which are currently both 'preferred' studies (both on the RHS of the table) // for any particular source id / sd_sid study.
// This table has 6 fields - two initially populated with the source id / sd_sid, // to identify the record, and the next pair that have the source / sd_sid pair // that does NOT have the minimum source rating, i.e. is the study that will need // to be 'existing studies' in the new link
sql_string = @"DROP TABLE IF EXISTS nk.temp_new_links; CREATE TABLE nk.temp_new_links as select m.source_id, m.sd_sid, m.preferred_source_id as new_source_id, m.preferred_sd_sid as new_sd_sid, 0 as new_preferred_source, as new_preferred_sd_sid from nk.temp_missing_links m inner join (select source_id, sd_sid, min(preference_rating) as min_rating from nk.temp_missing_links group by source_id, sd_sid) mins on m.source_id = mins.source_id and m.sd_sid = mins.sd_sid and m.preference_rating <> mins.min_rating order by source_id, sd_sid;"; conn.Execute(sql_string);
// Update the last pair of tables in ther temp_new_links table with the source / sd_sid // that represents the study with the minimally rated source id, i.e. the 'correct' preferred option
sql_string = @"UPDATE nk.temp_new_links k SET new_preferred_source = min_set.preferred_source_id , new_preferred_sd_sid = min_set.preferred_sd_sid FROM (select m.* from nk.temp_missing_links m INNER JOIN (select source_id, sd_sid, min(preference_rating) as min_rating from nk.temp_missing_links group by source_id, sd_sid) mins on m.source_id = mins.source_id and m.sd_sid = mins.sd_sid and m.preference_rating = mins.min_rating) min_set WHERE k.source_id = min_set.source_id AND k.sd_sid = min_set.sd_sid;"; conn.Execute(sql_string);
// Insert the new links into the distinct_links table. // These links will need re-processing through the CascadeLinksTable() function.
sql_string = @"INSERT INTO nk.temp_distinct_links (source_id, sd_sid, preferred_sd_sid, preferred_source_id) SELECT new_source_id, new_sd_sid, new_preferred_sd_sid, new_preferred_source from nk.temp_new_links;"; conn.Execute(sql_string);
// drop the temp tables sql_string = @"DROP TABLE IF EXISTS nk.temp_missing_links; DROP TABLE IF EXISTS nk.temp_new_links;"; conn.Execute(sql_string); } }
public void CascadeLinksInDistinctLinksTable()
{ using (var conn = new NpgsqlConnection(connString)) { // telescope the preferred links to the most preferred // i.e. A -> B, B -> C becomes A -> C, B -> C // do as long as there remains links to be telescoped // (a few have to be done twice)
string sql_string; int match_number = 500; // arbitrary start number while (match_number > 0) { // get match number as number of link records where the rhs sd_sid // appears elsewhere on the left...
sql_string = @"SELECT count(*) FROM nk.temp_distinct_links t1 inner join nk.temp_distinct_links t2 on t1.preferred_source_id = t2.source_id and t1.preferred_sd_sid = t2.sd_sid";
match_number = conn.ExecuteScalar<int>(sql_string);
if (match_number > 0) { // do the update
sql_string = @"UPDATE nk.temp_distinct_links t1 SET preferred_source_id = t2.preferred_source_id, preferred_sd_sid = t2.preferred_sd_sid FROM nk.temp_distinct_links t2 WHERE t1.preferred_source_id = t2.source_id AND t1.preferred_sd_sid = t2.sd_sid";
conn.Execute(sql_string); } }
// but in some cases the telescoped link may have already been // present - i.e. a study has two other reg identifiers // one of which will be the most preferred // Process above will result in duplicates in these cases // and these duplicates therefore need to be removed.
sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links2; CREATE TABLE nk.temp_distinct_links2 as SELECT distinct * FROM nk.temp_distinct_links";
conn.Execute(sql_string);
sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links; ALTER TABLE nk.temp_distinct_links2 RENAME TO temp_distinct_links;";
conn.Execute(sql_string); } }
Exclusion of One-to-Many links
// One set of relationships are not 'same study in a different registry'
// but multiple studies in a different registry. // Such studies have a study relationship rather than being straight equivalents. // There can be multiple studies in the 'preferred' registry // or in the existing studies registry - each group being equivalent to // a registry entry that represents a single study, or sometimes a // single project / programme, or grant
public void IdentifyGroupedStudies() { // Set up a table to hold group definitions (i.e. the list // of studies in each group, can be from the LHS or the RHS // of the distinct links table
using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_grouping_studies; CREATE TABLE nk.temp_grouping_studies ( source_id INT, sd_sid VARCHAR, matching_source_id INT, side VARCHAR );"; conn.Execute(sql_string);
// Studies of interest have more than one matching study // within the SAME matching source registry. // Therefore group on one side, plus the source_id of the other
sql_string = @"INSERT INTO nk.temp_grouping_studies (source_id, sd_sid, matching_source_id, side) SELECT source_id, sd_sid, preferred_source_id, 'L' FROM nk.temp_distinct_links group by source_id, sd_sid, preferred_source_id HAVING count(sd_sid) > 1;"; conn.Execute(sql_string);
sql_string = @"INSERT INTO nk.temp_grouping_studies (source_id, sd_sid, matching_source_id, side) SELECT preferred_source_id, preferred_sd_sid, source_id, 'R' FROM nk.temp_distinct_links group by preferred_source_id, preferred_sd_sid, source_id HAVING count(preferred_sd_sid) > 1;"; conn.Execute(sql_string); } }
public void ExtractGroupedStudiess() { string sql_string;
using (var conn = new NpgsqlConnection(connString)) { // create a table that takes the rows from the linked // studies table that match the 'L' grouping studies
// The source_id side is the group and the preferred side // is comprised of the grouped studies.
sql_string = @"DROP TABLE IF EXISTS nk.temp_linked_studies; create table nk.temp_linked_studies as select k.* from nk.temp_distinct_links k inner join nk.temp_grouping_studies g on k.source_id = g.source_id and k.sd_sid = g.sd_sid and k.preferred_source_id = g.matching_source_id where g.side = 'L';";
conn.Execute(sql_string);
// To retain the same arrangement of grouping study on // the LHS the input data from the RHS has to be switched around
sql_string = @"INSERT into nk.temp_linked_studies (source_id, sd_sid, preferred_source_id, preferred_sd_sid) select k.preferred_source_id, k.preferred_sd_sid, k.source_id, k.sd_sid from nk.temp_distinct_links k inner join nk.temp_grouping_studies g on k.preferred_source_id = g.source_id and k.preferred_sd_sid = g.sd_sid and k.source_id = g.matching_source_id where g.side = 'R'; ";
conn.Execute(sql_string);
// Put this data into the permanent linked_study_groups table // The study relationships are // 25 Includes target as one of a group of non-registered studies // This study includes<the target study>.That study is not registered independently, // but instead shares this registry entry with one or more other non-registered studies. // 26 Non registered but included within a registered study group // This study is registered as <the target study>, along with one or more other studies // that share the same registry entry and id. // 28 Includes target as one of a group of registered studies // This study includes <the target study>, which is registered elsewhere along with one // or more other registered studies, forming a group that collectively equates to this study. // 29 Registered and is included elsewhere in group // This study is also registered, along with one or more other studies that together form an // equivalent group, as <the target study>.
sql_string = @"INSERT INTO nk.linked_study_groups (source_id, sd_sid, relationship_id, target_sd_sid, target_source_id) select distinct source_id, sd_sid, case when preferred_source_id = 101900 or preferred_source_id = 101901 then 25 else 28 end, preferred_sd_sid, preferred_source_id from nk.temp_linked_studies;"; conn.Execute(sql_string);
sql_string = @"INSERT INTO nk.linked_study_groups (source_id, sd_sid, relationship_id, target_sd_sid, target_source_id) select distinct preferred_source_id, preferred_sd_sid, case when source_id = 101900 or source_id = 101901 then 26 else 29 end, sd_sid, source_id from nk.temp_linked_studies;"; conn.Execute(sql_string); } }
public void DeleteGroupedStudyLinkRecords() { string sql_string; using (var conn = new NpgsqlConnection(connString)) { // Now need to delete these grouped records from the links table...
sql_string = @"DELETE FROM nk.temp_distinct_links k USING nk.temp_grouping_studies g WHERE k.source_id = g.source_id and k.sd_sid = g.sd_sid and k.preferred_source_id = g.matching_source_id and g.side = 'L';"; conn.Execute(sql_string);
sql_string = @"DELETE FROM nk.temp_distinct_links k USING nk.temp_grouping_studies g WHERE k.preferred_source_id = g.source_id and k.preferred_sd_sid = g.sd_sid and k.source_id = g.matching_source_id and g.side = 'R';"; conn.Execute(sql_string);
sql_string = @"DROP TABLE IF EXISTS nk.temp_grouping_studies; DROP TABLE IF EXISTS nk.temp_linked_studies"; conn.Execute(sql_string);
} }