Identifying Links between Studies
Last updated: 05/03/2022 - edit still in progress
Contents
Introduction - the problem of duplicates
One of the issues that has to be tackled during aggregation of data from different sources is the fact that the same study can be found, and in tens of thousands of cases is found, in more than one study based source, and that it will have a different identifier in each source. This is mostly because studies can be registered in more than one trial registry, especially when local regulations mandate a registration for any study carried out within a particular country or region. This is especially the case with the EU, which insists all trials involving medicinal products that are run in the EU must be registered in the EUCTR. About a third of these studies, however, are also registered in other registries, especially Clinicaltrials.gov. In addition, within a data repository, studies will usually be referenced by a local id rather than a pre-existing registry id.
Study titles cannot be relied upon to identify the same study in different source locations. A study title is often expressed slightly differently in different contexts, and cannot in any case be relied upon to be unique (even within the same source). It may be that further research will indicate how titles could be reframed (e.g. to a smaller number of lexemes, expressed in a fixed order) to allow duplicate entries to be discovered using text, but for the moment the only easy way of identifying duplicate studies is by using the 'other identifiers' material found in the source data. Almost all sources contain this material, which includes 'other trial registry' ids, (i.e. other than that used in the source registry entry), as well as ids assigned by the sponsor, funder or, sometimes, a regulatory authority.
These 'other registry ids' can be used to build up a table of study-study links which can then be used during the aggregation process to identify when duplicate studies are being added. In fact the creation of this table is always the first stage of any aggregation. This page describes this process in detail.
This page only describes how duplicate study entries are managed. Duplication between data objects is much rarer, with one exception - the PubMed citations for journal papers. These can also be found multiple times in the source data, both within a single source and across multiple sources. The particular complications in processing PubMed data are described in Processing PubMed Data. The way in which duplications of data objects that are not PubMed citations are managed is described more fully in Aggregating Data.
The Preferred Source concept
If details about a study and its attributes can be found in more than one data source, the obvious question is how should this data be merged in the final aggregated MDR database? In fact there are several aspects to this question:
- How should 'single-occurrence' details about the study be merged? (i.e. the data points that appear once, in the study record itself, such as study display title, study type and status, enrolment target, min and max ages etc.)
- How should study attributes (identifiers, titles, contributors, topics etc.) be merged?
- How should the data objects linked to the same study in different sources (trial registry entries, data and document references) be merged?
- How should data object attributes be merged?
The first question is probably the most difficult. If the system allowed these basic study parameters to be merged from a variety of sources the problem of precedence immediately arises: how could the system 'know' which source to use for each parameter, if they were available in both? For instance how would the system select the enrolment target from one repository rather than another? (There is no guarantee, unfortunately, that the same values will be listed). If the data is edited in one source but not another, should the most recent data always take precedence over the old? (it may be less complete than what is already there).
Because of these issues it was decided to
- Only take the study single occurrence data items from a single data source
- Use the idea of 'preferred sources' to order the precedence of sources in a consistent fashion.
Each data source is therefore given a number that indicate its place in this 'order of preference' - the exact number is not important, only the place it gives to the data source when they are listed, in ascending order, using this parameter. The number is stored within the sf.source_parameters table in the mon database, and is available as a property of the Source objects when they are retrieved from this table. During the aggregation process, the data sources are processed in order, most preferred first, down to the 'least preferred'. In general, more preferred sources have richer or at least more consistent data than less preferred sources. The most preferred source is ClincalTrials.gov, followed (in broad terms) by registries that are extracted individually, then the various trial registries using the WHO dataset, and then data repositories (e.g. Yoda, BioLINCC), and finally object based data sources (PubMed).
Although it cannot be guaranteed in every case, this usually means that the richer data is added to the system first, and that data coming from later, less preferred sources augments rather than replaces it. In general (and as described in more detail in Aggregating Data), during aggregation the core databases are recreated and then completely rebuilt by adding data from each source in turn. If a study does not already exist in the core system when it is 'presented' from a data source database, then it, its attributes, its associated data objects and their attributes are all added during the aggregation process. But if a study is added that already exists in the core system, because it has been added from a 'more preferred' source earlier in the process:
- The single-occurrence study details from the new source are ignored
- Study attributes are only added if they are clearly different from any that have already been added
- Data objects are added (unless, very rarely, they can be seen to be already in the system)
- Data object attributes are added, for all the data object records that are added.
The Preferred Source idea also plays a central role in establishing and using the study-study link data. If a study has multiple source (registry / repository) identifiers in the system, then in general each of those identifiers, other than the 'most preferred', will need to be linked to the 'most preferred' identifier (i.e. the one used by the 'most preferred' source). This is the role of the nk.study_study_links table. When additional 'less preferred' data for a study is added to the system, the first most-preferred instance of the study is guaranteed to already be present in the central mdr study tables, because of the ordering of the addition. The links table is used to match and compare the new data with the existing set, and so determine how the data will be added. How the nk.study_study_links table is constructed is explained in detail below.
Initial Links Data Collection
The initial step is to collect in all the 'other identifier' information from the source databases.
Two temporary tables are established in the links (nk) schema of the mdr database. One simply collects the link data found, and the second is used to hold it once it has been ordered properly.
public void SetUpTempLinkCollectorTable() { using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_study_links_collector; CREATE TABLE nk.temp_study_links_collector( source_1 int , sd_sid_1 varchar , sd_sid_2 varchar , source_2 int) "; conn.Execute(sql_string); } } public void SetUpTempLinkSortedTable() { using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_study_links_sorted; CREATE TABLE nk.temp_study_links_sorted( source_id int , sd_sid varchar , preferred_sd_sid varchar , preferred_source_id int) "; conn.Execute(sql_string); } }
The system then iterates through all the source, interrogating each of the associated databases in turn to bring back the 'other identifier' data, as sd_sid_2, and the id of the organisation assigning the identifier as source_2. That is matched against the source organisation (source_1) and the study's identifier sd_sid (sd_sid_1). The data is pre-filtered to include only identifiers associated with trial registries (and not those assigned from sponsors or funders, etc.). Repository data sources are included in this process but the repository identifiers themselves are not returned - so Yoda and BioLINCC (for example) identifiers appear as sd_sid_1, with a matching sd_sid_2 from a trial registry, but never as sd_sid_2. The data is brought back as an IEnumerable collection of StudyLink objects, which is then immediately stored in the temp_study_links_collector table using the PostgreSQLCopyHelper nuget package, which allows quick and easy storage of batches of data in one call. As the iteration across the sources proceeds, so this table gradually grows with the records found each time.
foreach (Source s in sources) { // Fetch the study-study links and store them // in the Collector table IEnumerable<StudyLink> links = slh.FetchLinks(s.id, s.database_name); slh.StoreLinksInTempTable(CopyHelpers.links_helper, links); } public IEnumerable<StudyLink> FetchLinks(int source_id, string database_name) { string conn_string = repo.GetConnString(database_name); using (var conn = new NpgsqlConnection(conn_string)) { string sql_string = @"select " + source_id.ToString() + @" as source_1, sd_sid as sd_sid_1, identifier_value as sd_sid_2, identifier_org_id as source_2 from ad.study_identifiers where identifier_type_id = 11 and identifier_org_id > 100115 and (identifier_org_id < 100133 or identifier_org_id = 101989) and identifier_org_id <> " + source_id.ToString(); return conn.Query<StudyLink>(sql_string); } } public ulong StoreLinksInTempTable(PostgreSQLCopyHelper<StudyLink> copyHelper, IEnumerable<StudyLink> entities) { using (var conn = new NpgsqlConnection(connString)) { conn.Open(); return copyHelper.SaveAll(conn, entities); } }
Cleaning the 'other Ids'
One of the difficulties in the process is that the other identifier data is rarely constrained in the source systems - it can therefore sometimes be poorly formatted or otherwise mis-entered. It is necessary to try and clean some of the data before proceeding further, using a succession of SQL statements - a few of which are shown below. In the examples shown, the SQL a) deletes WHO Universal trial numbers incorrectly entered as trial registry Ids, b) replaces 'n dashes' with ordinary hyphens, making later processing easier, c) updates entries indicating an Australian ACTRN identifier, that are not prefixed by 'ACTRN', and d) removes the redundant 'Chinese Clinical Trial Register' statement from Ids from that registry which include it. These are, however, just 4 examples from a total of over 30 'cleaning' functions.
public void TidyIds1() { string sql_string = ""; using (var conn = new NpgsqlConnection(connString)) { sql_string = @"DELETE from nk.temp_study_links_collector where sd_sid_2 ilike 'U1111%' or sd_sid_2 ilike 'UTRN%'"; conn.Execute(sql_string); // replace n dashes sql_string = @"UPDATE nk.temp_study_links_collector set sd_sid_2 = replace(sd_sid_2, '–', '-');"; conn.Execute(sql_string); sql_string = @"UPDATE nk.temp_study_links_collector SET sd_sid_2 = 'ACTRN' || sd_sid_2 WHERE source_2 = 100116 and length(sd_sid_2) = 14"; conn.Execute(sql_string); sql_string = @"UPDATE nk.temp_study_links_collector set sd_sid_2 = replace(sd_sid_2, 'Chinese Clinical Trial Register', '') where source_2 = 100118;"; conn.Execute(sql_string); ... ...
Once the data is cleaned it needs to be ordered properly, so that preferred and non-preferred ids are clearly identified. This is done by adding the collected data to the nk.temp_study_links_sorted table using two insert statements. The first takes the data that already has the less preferred source/id (i.e. has the higher preference number) as the first two fields, and puts this data straight into the the 'sorted' table. The second takes those where the less preferred source/id are in the last two fields and swaps them round before inserting the data.
public void TransferLinksToSortedTable() { using (var conn = new NpgsqlConnection(_connString)) { // needs to be done twice to keep the ordering of sources correct // A lower rating means 'more preferred' - i.e. should be used in preference // Therefore lower rated source data should be in the 'preferred' fields // and higher rated data should be on the left hand side // Original data matches what is required string sql_string = @"INSERT INTO nk.temp_study_links_sorted( source_id, sd_sid, preferred_sd_sid, preferred_source_id) SELECT t.source_1, t.sd_sid_1, t.sd_sid_2, t.source_2 FROM nk.temp_study_links_collector t inner join nk.temp_preferences r1 on t.source_1 = r1.id inner join nk.temp_preferences r2 on t.source_2 = r2.id WHERE r1.preference_rating > r2.preference_rating"; int res1 = conn.Execute(sql_string); // Original data is the opposite of what is required - therefore switch sql_string = @"INSERT INTO nk.temp_study_links_sorted( source_id, sd_sid, preferred_sd_sid, preferred_source_id) SELECT t.source_2, t.sd_sid_2, t.sd_sid_1, t.source_1 FROM nk.temp_study_links_collector t inner join nk.temp_preferences r1 on t.source_1 = r1.id inner join nk.temp_preferences r2 on t.source_2 = r2.id WHERE r1.preference_rating < r2.preference_rating"; int res2 = conn.Execute(sql_string); _logger.Information((res1 + res2).ToString() + " total study-study links found in source data"); } }
The sorted table will obviously have many duplicates - data that was entered originally as both source_id_1/sd_sid_1 <=> source_id_2/sd_sid_2, and source_id_2/sd_sid_2 <=> source_id_1/sd_sid_1, will now appear as two identical records: source_id/sd_sid <=> pref_source_id/pref_sd_sid. It is therefore necessary to generate a distinct version of the links data. The nk.temp_distinct_links table is therefore created, containing all the distinct links between studies, with both the more 'preferred' and the less 'preferred' versions of each pair identified. Note that a boolean field called valid is included in the table, with a default value of true. The nk.temp_distinct_links table is used as the maim links table during both the final stages of data cleaning and the next stages of processing.
public void CreateDistinctSourceLinksTable() { // The nk.temp_study_links_sorted table will have // many duplicates... create a distinct version of the data using (var conn = new NpgsqlConnection(_connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links; CREATE TABLE nk.temp_distinct_links as SELECT distinct source_id, sd_sid, preferred_sd_sid, preferred_source_id, true as valid FROM nk.temp_study_links_sorted"; conn.Execute(sql_string); sql_string = @"SELECT COUNT(*) FROM nk.temp_distinct_links"; int res = conn.ExecuteScalar<int>(sql_string); _logger.Information(res.ToString() + " distinct study-study links found"); } }
Even though the 'other ids' are now sorted and (mostly) properly formatted they may not all be valid. There may have been errors during data entry, or some other reason why the registry Ids as entered do not correspond to a real study in the system. These invalid ids have to be removed, or the system will try to link data to non-existent studies, causing a variety of problems during the aggregation process. The system therefore loops through each of the sources with study tables, and uses a call to the source database to retrieve a list of the current study ids for that source. These are all guaranteed to be 'real' ids as they have been harvested directly from the sources as the registry ids. They are also the most up to date set of study ids known to the system, being a function of the most recent import processes.
foreach (Source source in sources) { ... if (source.has_study_tables) { string source_conn_string = _credentials.GetConnectionString(source.database_name, _testing); slh.ObtainStudyIds(source.id, source_conn_string, CopyHelpers.studyids_checker); slh.CheckIdsAgainstSourceStudyIds(source.id); } } slh.DeleteInvalidLinks(); public void ObtainStudyIds(int source_id, string source_conn_string, PostgreSQLCopyHelper<IdChecker> copyHelper) { using (var conn = new NpgsqlConnection(_connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_id_checker; CREATE TABLE nk.temp_id_checker (sd_sid VARCHAR) "; conn.Execute(sql_string); } IEnumerable<IdChecker> Ids; using (var conn = new NpgsqlConnection(source_conn_string)) { string sql_string = @"select sd_sid from ad.studies"; Ids = conn.Query<IdChecker>(sql_string); } using (var conn = new NpgsqlConnection(_connString)) { conn.Open(); copyHelper.SaveAll(conn, Ids); } ... }
Once the study ids have been collected the IDs in the nk.temp_distinct_links table can be checked against them. Both sides (preferred and non-preferred) must be checked separately. If a link contains an invalid study id then it is marked as a invalid. At the end of the process, after all sources have been checked, all the invalid records are deleted. In most cases the proportion of invalid records is very small. In fact only about 50 records are identified (from 32,000 plus) as invalid across all the sources, with one marked exception. That exception is the European Clinical Trials Registry (EU CTR). In this case over 5000 Ids, often called EUDRACT numbers, listed in other trial registries as an 'other registry id', do not correspond to any entry in the EU CTR itself. Checking the relevant titles / key words in several cases indicate that this is not a data entry error issue - the trials simply do not exist in the EU CTR, under any Id. This may be a function of sponsors seeking an EU CTR Id in anticipation of running a study within the EU, or to allow possible expansion to an EU country, but without proceeding to do so.
public void CheckIdsAgainstSourceStudyIds(int source_id) { using (var conn = new NpgsqlConnection(_connString)) { string sql_string = @"UPDATE nk.temp_distinct_links t SET valid = false FROM (SELECT k.sd_sid as sd_sid FROM nk.temp_distinct_links k LEFT JOIN nk.temp_id_checker s ON k.sd_sid = s.sd_sid WHERE k.source_id = " + source_id.ToString() + @" AND s.sd_sid is null) invalids where t.sd_sid = invalids.sd_sid"; int res1 = conn.Execute(sql_string); _logger.Information(res1.ToString() + " set to invalid on sd_sid"); sql_string = @"UPDATE nk.temp_distinct_links t SET valid = false FROM (SELECT k.preferred_sd_sid FROM nk.temp_distinct_links k LEFT JOIN nk.temp_id_checker s ON k.preferred_sd_sid = s.sd_sid WHERE k.preferred_source_id = " + source_id.ToString() + @" AND s.sd_sid is null) invalids where t.preferred_sd_sid = invalids.preferred_sd_sid"; int res2 = conn.Execute(sql_string); _logger.Information(res2.ToString() + " set to invalid on preferred_sd_sid"); } }
Exclusion of One-to-Many links
An additional issue is that not all of the links identified between studies are 'simple' one-to-one links. In some cases they are one-to-many. In other words, rather than having the same study listed under different identifiers in different contexts, (scenario 1 in figure 1 below) we have a single study in one source listed as two or more studies in another source (scenario 2). In these circumstances the MDR does not attempt to merge the study data. The system cannot know how data objects and attributes listed on the 'One' side could be allocated to the studies listed on the 'Many' side, so it keeps all studies that are in this sort of relationship as individual studies.
In fact the situation can be more complex - for 30 - 40 groups of studies the relationships are many-to-many. In this case a study may be listed as equivalent to two or more studies in another registry or source, one of which is itself listed as two or more studies (see scenario 3 in figure 1). Usually one of the second set will be the original study, but others are also included. In such case a a group of 4,5 or more (in one case 10!) are listed as equivalent studies in some way. The groupings appear to be a mix of exact equivalents, registered in muiltiple locations, and closely related studies that are listed as equivalent but are not in fact exactly the same (e.g. they may be a follow up study).
Figure 1: Types of linkage between studies
Although the MDR does not attempt to merge the data from one-to-many or many-to-many groupings, it does record the fact that these relationships exist, as a 'study_relationship' record. While some study relationships are between studies listed in the same data source, the ones considered here clearly cut across different sources. The task is therefore to identify this form of linkage and to extract it as study relationship data, removing it from the larger group of 'linked studies'. The first stage is to establish a temporary table that can hold identified study groups, and then identify the groups within nk.temp_distinct_links. A 'group' in this context is two or more studies that are all in the same data source (have the same source id) that are all matched to a particular study. The matching could be at either the preferred or less preferred side of the data so two insert statements are required. Note that at this stage it is just the 'grouping' study and source that are identified (the 'One' side) - not the members of that group (the 'Many' side).
private void Identify1ToNGroupedStudies() { string sql_string; using (var conn = new NpgsqlConnection(_connString)) { // create a table that takes the rows that involve // studies with multiple matches in the same source sql_string = @"DROP TABLE IF EXISTS nk.temp_linked_studies; CREATE TABLE nk.temp_linked_studies( group_source_id int , group_sd_sid varchar , member_sd_sid varchar , member_source_id int , source_side varchar , complex int DEFAULT 0 )"; conn.Execute(sql_string); // The source_id side is the group and the preferred side // is comprised of the grouped studies. sql_string = @"INSERT into nk.temp_linked_studies (group_source_id, group_sd_sid, member_source_id, member_sd_sid, source_side) SELECT k.source_id, k.sd_sid, k.preferred_source_id, k.preferred_sd_sid, 'L' from nk.temp_distinct_links k inner join (SELECT source_id, sd_sid FROM nk.temp_distinct_links group by source_id, sd_sid, preferred_source_id HAVING count(sd_sid) > 1) lhs_groups ON k.source_id = lhs_groups.source_id AND k.sd_sid = lhs_groups.sd_sid"; conn.Execute(sql_string); // The preferred_source_id side is the group and the // non-prefrred side is comprised of grouped studies sql_string = @"INSERT into nk.temp_linked_studies (group_source_id, group_sd_sid, member_source_id, member_sd_sid, source_side) SELECT k.preferred_source_id, k.preferred_sd_sid, k.source_id, k.sd_sid, 'R' from nk.temp_distinct_links k inner join (SELECT preferred_source_id, preferred_sd_sid FROM nk.temp_distinct_links group by preferred_source_id, preferred_sd_sid, source_id HAVING count(preferred_sd_sid) > 1) rhs_groups ON k.preferred_source_id = rhs_groups.preferred_source_id AND k.preferred_sd_sid = rhs_groups.preferred_sd_sid"; conn.Execute(sql_string); } }
Having identified the grouping studies it is possible to extract the 'group members' - the studies that are linked to each of them. The code below does that, initially selecting the temp_distinct_links records to create the table, and then reversing the field selection for those groups where the grouping study is from the right hand preferred side.
private void Identify1ToNGroupedStudies() { string sql_string; using (var conn = new NpgsqlConnection(_connString)) { // create a table that takes the rows that involve // studies with multiple matches in the same source sql_string = @"DROP TABLE IF EXISTS nk.temp_linked_studies; CREATE TABLE nk.temp_linked_studies( group_source_id int , group_sd_sid varchar , member_sd_sid varchar , member_source_id int , source_side varchar , complex int DEFAULT 0 )"; conn.Execute(sql_string); // The source_id side is the group and the preferred side // is comprised of the grouped studies. sql_string = @"INSERT into nk.temp_linked_studies (group_source_id, group_sd_sid, member_source_id, member_sd_sid, source_side) SELECT k.source_id, k.sd_sid, k.preferred_source_id, k.preferred_sd_sid, 'L' from nk.temp_distinct_links k inner join (SELECT source_id, sd_sid FROM nk.temp_distinct_links group by source_id, sd_sid, preferred_source_id HAVING count(sd_sid) > 1) lhs_groups ON k.source_id = lhs_groups.source_id AND k.sd_sid = lhs_groups.sd_sid"; conn.Execute(sql_string); // The preferred_source_id side is the group and the // non-prefrred side is comprised of grouped studies sql_string = @"INSERT into nk.temp_linked_studies (group_source_id, group_sd_sid, member_source_id, member_sd_sid, source_side) SELECT k.preferred_source_id, k.preferred_sd_sid, k.source_id, k.sd_sid, 'R' from nk.temp_distinct_links k inner join (SELECT preferred_source_id, preferred_sd_sid FROM nk.temp_distinct_links group by preferred_source_id, preferred_sd_sid, source_id HAVING count(preferred_sd_sid) > 1) rhs_groups ON k.preferred_source_id = rhs_groups.preferred_source_id AND k.preferred_sd_sid = rhs_groups.preferred_sd_sid"; conn.Execute(sql_string); } }
The temp_linked_studies table now holds all the grouped studies, with the grouping source/studies on the left hand side, and the grouped source/studies on the right. This data can now be used to create linked study group records with the correct relationship codes. Two complementary pairs of such relationships are possible:
- 25 Includes target as one of a group of non-registered studies: This study includes<the target study>.That study is not registered independently, but instead shares this registry entry with one or more other non-registered studies.
- 26 Non registered but included within a registered study group: This study is registered as <the target study>, along with one or more other studies that share the same registry entry and id.
- 28 Includes target as one of a group of registered studies: This study includes <the target study>, which is registered elsewhere along with one or more other registered studies, forming a group that collectively equates to this study.
- 29 Registered and is included elsewhere in group: This study is also registered, along with one or more other studies that together form an equivalent group, as <the target study>.
BioLINCC (id = 101900) and Yoda (id = 101901) are the only two non registry study sources at the moment, and this is reflected in the SQL below. The relationship codes for these data repositories are different than for the trial registries.
sql_string = @"INSERT INTO nk.linked_study_groups (source_id, sd_sid, relationship_id, target_sd_sid, target_source_id) select distinct source_id, sd_sid, case when preferred_source_id = 101900 or preferred_source_id = 101901 then 25 else 28 end, preferred_sd_sid, preferred_source_id from nk.temp_linked_studies;"; conn.Execute(sql_string); sql_string = @"INSERT INTO nk.linked_study_groups (source_id, sd_sid, relationship_id, target_sd_sid, target_source_id) select distinct preferred_source_id, preferred_sd_sid, case when source_id = 101900 or source_id = 101901 then 26 else 29 end, sd_sid, source_id from nk.temp_linked_studies;"; conn.Execute(sql_string); } }
Finally, it is necessary to remove the grouped study data from the main set of nk.temp_distinct_links, as shown below, and drop the temporary tables that have been used.
public void DeleteGroupedStudyLinkRecords() { string sql_string; using (var conn = new NpgsqlConnection(connString)) { // Now need to delete these grouped records from the links table... sql_string = @"DELETE FROM nk.temp_distinct_links k USING nk.temp_grouping_studies g WHERE k.source_id = g.source_id and k.sd_sid = g.sd_sid and k.preferred_source_id = g.matching_source_id and g.side = 'L';"; conn.Execute(sql_string); sql_string = @"DELETE FROM nk.temp_distinct_links k USING nk.temp_grouping_studies g WHERE k.preferred_source_id = g.source_id and k.preferred_sd_sid = g.sd_sid and k.source_id = g.matching_source_id and g.side = 'R';"; conn.Execute(sql_string); sql_string = @"DROP TABLE IF EXISTS nk.temp_grouping_studies; DROP TABLE IF EXISTS nk.temp_linked_studies"; conn.Execute(sql_string); } }
Cascading Links
The linked study data requires further processing to ensure that it is complete. There are two additional problems that can arise when a study is found in three or more sources, that need to be resolved. These occur because there is no guarantee that all possible links, or the 'correct' links, were in the original data.
The first problem is that although the links table currently identifies the more 'preferred' of two linked study ids, it does not necessarily identify the most preferred when there are three or more links. For instance a study could have identifiers in sources A, B and C, where A is less preferred than B, and B less preferred than C. In the source data there are links between A and B, and B and C, but no link exists between A and C, so the data shows A -> B and B -> C. For the aggregation process to work properly the data has to show A -> C and B -> C, i.e. all links should be to the most preferred id. The B -> C link is fine, but the A -> B link has to be 'telescoped' with the 'B -> C' link, and replaced by the resultant A -> C link.
The second issue is that some links may be missing - they are implied by the existing data but not present explicitly. For example study D is linked to both E and F in the source data (D -> E and D -> F), with the more preferred again on the right hand side, but no link exists between E and F. D -> F is fine but D -> E needs to be replaced by the missing E -> F.
Figure 2: Repairing the linkage between 3 or more studies
It is necessary to add any 'missing links' first, otherwise the telescoping process described above cannot be guaranteed to work properly. To do this, studies that have more than one 'preferred' option are first identified - these will always link across to more than one preferred source (e.g. D -> E, D -> F) because the grouped studies (linked to the same source / study) have already been removed. This list of studies (in the example all the studies of type 'D') is inserted into temp_studies_with_multiple_links. The process then creates the temp_missing_links table, adding the right hand side preferred fields, by joining this table with the records in the main links table temp_distinct_links, at the same time incorporating the preference ratings of the sources involved. temp_missing_links therefore holds the 'records of interest' from the original links table - all the D -> E, D -> F pairs - together with the preference ratings for all the Ds, Es and Fs.
public void ManageIncompleteLinks() { string sql_string; using (var conn = new NpgsqlConnection(connString)) { sql_string = @"DROP TABLE IF EXISTS nk.temp_studies_with_multiple_links; CREATE TABLE nk.temp_studies_with_multiple_links as SELECT source_id, sd_sid from nk.temp_distinct_links group by source_id, sd_sid having count(distinct preferred_source_id) > 1;"; conn.Execute(sql_string); sql_string = @"DROP TABLE IF EXISTS nk.temp_missing_links; CREATE TABLE nk.temp_missing_links as select k.source_id, r1.preference_rating as source_rating, k.sd_sid, k.preferred_source_id, r2.preference_rating, k.preferred_sd_sid from nk.temp_distinct_links k inner join nk.temp_studies_with_multiple_links m on k.source_id = m.source_id and k.sd_sid = m.sd_sid inner join nk.temp_preferences r1 on k.source_id = r1.id inner join nk.temp_preferences r2 on k.preferred_source_id = r2.id order by k.source_id, k.sd_sid, preferred_source_id;"; conn.Execute(sql_string); ...
Now a further temp table is created (temp_new_links) to construct the missing links between the E and F studies. This table (temp_new_links) has 6 fields - two initially populated with the source id / sd_sid, to identify the source records in nk.temp_missing_links, and the next pair that have the source / sd_sid pair that has the least preferred rating, i.e. is the study that will need to be on the left hand side in the new link (E in the example above). Place holders are put in for the new_preferred_source and new_preferred_sd_sid fields. Then the table is updated, but this time with the source / sd_sid pair that has the more preferred rating, i.e. is the study that will need to be on the right hand side in the new link (F in the example above) going into the new_preferred_ fields. Thus, if nk.temp_missing_links includes the fields
D-source id D-sd_sid D-preference E-source id E-sd_sid E-preference
D-source id D-sd_sid D-preference F-source id F-sd_sid F-preference
The fields in temp_new_links will initially be:
D-source id D-sd_sid E-source id E-sd_sid 0 '--'
D-source id D-sd_sid E-source id E-sd_sid 0 '--'
and are then updated to:
D-source id D-sd_sid E-source id E-sd_sid F-source F-sd_sid
D-source id D-sd_sid E-source id E-sd_sid F-source F-sd_sid
The four right hand fields now represent the missing (E -> F) links that were required and as such can be inserted into the temp_distinct_links table (as a distinct selection, as they exist in pairs). Of course in some cases the E -> F links may have already existed, as well as the D -> E, D -> F links. The additional record is then redundant, and will be removed later, but the process has to be carried out on the assumption that the E -> F type links are missing. Finally, the temporary tables are dropped.
... sql_string = @"DROP TABLE IF EXISTS nk.temp_new_links; CREATE TABLE nk.temp_new_links as select m.source_id, m.sd_sid, m.preferred_source_id as new_source_id, m.preferred_sd_sid as new_sd_sid, 0 as new_preferred_source, '' as new_preferred_sd_sid from nk.temp_missing_links m inner join (select source_id, sd_sid, min(preference_rating) as min_rating from nk.temp_missing_links group by source_id, sd_sid) mins on m.source_id = mins.source_id and m.sd_sid = mins.sd_sid and m.preference_rating <> mins.min_rating order by source_id, sd_sid;"; conn.Execute(sql_string); // Update the last pair of tables in the temp_new_links table with the source / sd_sid // that represents the study with the minimally rated source id, i.e. the 'correct' preferred option sql_string = @"UPDATE nk.temp_new_links k SET new_preferred_source = min_set.preferred_source_id , new_preferred_sd_sid = min_set.preferred_sd_sid FROM (select m.* from nk.temp_missing_links m INNER JOIN (select source_id, sd_sid, min(preference_rating) as min_rating from nk.temp_missing_links group by source_id, sd_sid) mins on m.source_id = mins.source_id and m.sd_sid = mins.sd_sid and m.preference_rating = mins.min_rating) min_set WHERE k.source_id = min_set.source_id AND k.sd_sid = min_set.sd_sid;"; conn.Execute(sql_string); // Insert the new links into the distinct_links table. // These links will need re-processing through the CascadeLinksTable() function. sql_string = @"INSERT INTO nk.temp_distinct_links (source_id, sd_sid, preferred_sd_sid, preferred_source_id) SELECT distinct new_source_id, new_sd_sid, new_preferred_sd_sid, new_preferred_source from nk.temp_new_links;"; conn.Execute(sql_string); // drop the temp tables sql_string = @"DROP TABLE IF EXISTS nk.temp_missing_links; DROP TABLE IF EXISTS nk.temp_new_links;"; conn.Execute(sql_string); } }
Having filled in any 'missing links' the system can safely tackle the telescoping of links. In this scenario, (both A -> B and B -> C exist but not A -> C) the same study (B in this case) will appear in both the left hand side 'less preferred' columns in the temp_distinct_links table and in the more preferred right hand side. The requirement is to replace the B on the right hand side with the 'true' most preferred study id, which is C. A self join on the temp_distinct_links table, linking studies on both 'sides' of the table, can be used to first see if there are any studies that are both 'less' and 'more' preferred, and then to make the switch, replacing the preferred side data on the right of the table with the 'most preferred' values. In the example, A -> B is linked to B -> C by the join, and A -> B becomes A -> C after the update, 'telescoping' the links together.
Because a few studies are registered 4 or even more times the process needs to be repeated until no further studies are found which exist in both the less and more preferred columns. A while loop is therefore used to repeat the action as often as is necessary - usually twice.
In some cases the telescoped link may have already been present - i.e. A -> B, B -> C and A -> C were all there. In these cases the process described above (still necessary to remove the A -> B link) will result in duplicates. Any duplicates therefore have to be removed at the end of the process. This is done by a select distinct into a new table, which is then renamed back to temp_distinct_links.
public void CascadeLinksInDistinctLinksTable() { using (var conn = new NpgsqlConnection(connString)) { string sql_string; int match_number = 500; // arbitrary start number while (match_number > 0) { // get match number as number of link records where the rhs sd_sid // appears elsewhere on the left... sql_string = @"SELECT count(*) FROM nk.temp_distinct_links t1 inner join nk.temp_distinct_links t2 on t1.preferred_source_id = t2.source_id and t1.preferred_sd_sid = t2.sd_sid"; match_number = conn.ExecuteScalar<int>(sql_string); if (match_number > 0) { // do the update sql_string = @"UPDATE nk.temp_distinct_links t1 SET preferred_source_id = t2.preferred_source_id, preferred_sd_sid = t2.preferred_sd_sid FROM nk.temp_distinct_links t2 WHERE t1.preferred_source_id = t2.source_id AND t1.preferred_sd_sid = t2.sd_sid"; conn.Execute(sql_string); } } sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links2; CREATE TABLE nk.temp_distinct_links2 as SELECT distinct * FROM nk.temp_distinct_links"; conn.Execute(sql_string); sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links; ALTER TABLE nk.temp_distinct_links2 RENAME TO temp_distinct_links;"; conn.Execute(sql_string); } }
There is one final stage to this process. The processing described above can lead to a small number of new 'groupings' appearing, when more than one study in a single source is linked to a single study elsewhere. The system therefore re-runs the 'exclusion of one-many' routine, to remove these from the nk.temp_distinct_links table.
Updating links with Study_Id
The data is then transferred from the temporary table to the study_study_links table - a table that is created anew at the beginning of each aggregation porcess. Again a select distinct is used, as a final check to ensure that any duplicates are removed.
public void TransferNewLinksToDataTable() { // A distinct selection is required because the most recent // link cascade may have generated duplicates using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"Insert into nk.study_study_links (source_id, sd_sid, preferred_sd_sid, preferred_source_id) select distinct source_id, sd_sid, preferred_sd_sid, preferred_source_id from nk.temp_distinct_links"; conn.Execute(sql_string); } }
The study links table is then updated with the study Id to be used for each non-preferred / preferred pair, as described below, and is then available to play two critical roles during the aggregation process.
- To ensure that a study Id is allocated consistently across all aggregations, and
- To check if a study is a 'non-preferred' version of a study that has been added earlier, and thus modify the details of the data aggregation process
These two tasks are distinct because a study's non-preferred entry may be added to the MDR before its preferred set of data, for example the EU CTR entry may be added a week before the same study appears in ClinicalTrials.gov. In that example, the persistent study Id within the system (the ECRIN ID) will be allocated when the EUCTR data is first added, and will initially be linked only to that EUCTR data. After the ClinicalTrials.gov (CTG) record is added the following week, the EUCTR data becomes 'non-preferred', and the CTG data will be aggregated before the EUCTR data, but the ECRIN study Id must not change - it must remain as originally allocated. The StudyId allocated to the CTG record must therefore be that generated by the EUCTR record.
To make that happen, the study_id field of the study_study_links table is first updated using the data in the preferred side of the table, i.e. preferred source and sd_sid are matched against the study_ids table. This will provide a study_id for all the study pairs in the table where the preferred data has previously been aggregated, whether or not the non-preferred data is new in this aggregation process or not. If however, it is the preferred data that is new, and the non-preferred data that is already referenced in the study_ids table, the study_id must be provided by a second update, where the links table study_id is null, matching the non-preferred source id and sd_sid to retrieve the study id. Thus, if a study appears to be new and the links table is checked to see if this is the case, (see below) a study Id can be retrieved.
There is another scenario, where the two (or or more) equivalent identifiers in the links table are all new to the system, i.e. have been added and appeared in the system within the same week. In that case neither of the updates described above will bring back a study id. This is not a problem for the 'preferred' version of the study data, which will be added first. It will be seen as completely new to the system and be allocated a study Id automatically, as all new studies are. To ensure that the non-preferred duplicated study will always be allocated the correct study_id, however, requires that the study_links table is updated after each addition of new study_ids to the system. Specifically, once all study_ids have been allocated for a source's studies, the links table should again be updated so that any with a null study_id record that match the new 'preferred' study ids can have that study id added to the table.
Note that for future aggregations the system automatically ensures that the correct id is used. In the example above, for instance, for the the CTG and EUCTR data the study id will be found by a simple matching against the study_ids table. But if another new version of the study data appears, this time in the Peruvian trial registry, the first update described above will match the study links record with the CTG 'preferred' source id and source study identifier, and the Peruvian data will be matched to that study Id as returned from the CTG entry in the study_ids table. But because of what happened the first time this entry appeared, that value will be the one that was originally matched with the EUCTR entry. In other words the study_ids table preserves the original Id allocation, however it was created.
During the study aggregation process
- The source ids and source study ids are first checked against the study_ids table. Those with a direct match are givenn the study_id from that table.
- If not a direct match, the study may be a new less-preferred or new more-preferred version of an existing study. Both sides of the link table are checked, preferred then non-preferred. If either matches with a source id / source study id pair the ECRIN study id can be obtained directly from the table, and the study ids table updated accordingly.
- if there is still no match the study must be completely new to the system. It will be allocated a new study id using the study_ids table.
- If the new study is part of a linked pair, it will be the preferred side of that pair (by definition, as preferred data is added first) and the preferred source and source study ids can be used to update the study link record with the new study id.
As a separate process, (i.e. independent of the allocation of the study id) the new study id is marked as belonging to a preferred or non-preferred data source. If the data is designated as being from a non-preferred source then, as described above, different rules are applied in adding the data. In the example given, when the EUCTR data is added it will be recognised as being the non-preferred version of the data already aggregated into the system from CTG, so only new or additional data will be added to the study record and its attributes, even though the study id within the MDR is as originally allocated to the EUCTR data.
Final Steps
The various temporary tables that remain are removed.
public void DropTempTables() { using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"DROP TABLE IF EXISTS nk.temp_preferences; DROP TABLE IF EXISTS nk.temp_study_links_collector; DROP TABLE IF EXISTS nk.temp_distinct_links;"; conn.Execute(sql_string); } }
Note that at a later stage in the aggregation process, after all study data has been aggregated, The one-to-many linked study data are then added from the nk schema to the study_relationships table in the study (st) schema.
public void AddStudyStudyRelationshipRecords() { // Use the study_all_ids to insert the study Ids // for the linked sources / sd_sids, using // nk.linked_study_groups as the source using (var conn = new NpgsqlConnection(connString)) { string sql_string = @"Insert into st.study_relationships (study_id, relationship_type_id, target_study_id) select s1.study_id, g.relationship_id, s2.study_id from nk.linked_study_groups g inner join nk.all_ids_studies s1 on g.source_id = s1.source_id and g.sd_sid = s1.sd_sid inner join nk.all_ids_studies s2 on g.target_source_id = s2.source_id and g.target_sd_sid = s2.sd_sid"; conn.Execute(sql_string); } } }