Identifying Links between Studies

From ECRIN-MDR Wiki
Jump to navigation Jump to search

Introduction

One of the issues that has to be tackled during aggregation of data from different sources is the fact that the same study can be found, and in tens of thousands of cases is found, in more than one study based source, and that it will have a different persistent identifier in each source. Partly this is because studies can be registered in more than one trial registry, especially when local regulations mandate a registration for any study carried out within a particular country or region. This is especially the case with the EU, which insists all trials involving medicinal products must be registered in the EUCTR. About a third of these studies, however, are also registered in other registries, especially Clinicaltrials.gov. In addition, within a data repository, studies will usually be referenced by a local id rather than a pre-existing registry id.
Study titles cannot be relied upon to identify the same study in different source locations. A study title is often expressed slightly differently in different contexts, and cannot in any case be relied upon to be unique (even within the same source). It may be that further research will indicate how titles could be reframed (e.g. to a smaller number of keywords, expressed in a fixed order) to allow duplicate entries to be discovered using text, but for the moment the only easy way of doing this is by using the 'other identifiers' material found in the source data. Almost all sources contain this material, which usually include any other trial registry ids, (i.e. other than that used in the source registry entry), as well as ids assigned by the sponsor, funder or, sometimes, a regulatory authority.
These 'other registry ids' can be used to build up a table of study-study links which can then be used during the aggregation process to identify when duplicate studies are being added. In fact the creation of this table is always the first stage of any aggregation. This page describes this process in detail.
At the moment, with one exception, the requirement for identifying duplications only extends to studies - not data objects. The exception is provided by PubMed citations for journal papers, which can also be found multiple times in the source data, both within a single source and across multiple sources. The particular complications in processing PubMed data are described in Processing PubMed Data. The assumption for now is that other data objects are not duplicated across data sources, although this may need to be considered in the future. The rules for adding data objects are described more fully in Aggregating Data.

The Preferred Source concept

If details about a study and its attributes can be found in more than one data source, the obvious question is how should this data be merged in the final aggregated MDR database? In fact there are several aspects to this question:

  • How should 'single-occurrence' details about the study be merged? (i.e. the data points that appear in the study record itself, such as study display title, study type and status, enrolment target, min and max ages etc.)
  • How should study attributes (identifiers, titles, contributors, topics etc.) be merged?
  • How should data objects in different sources (trial registry entries, data and document references) be merged?
  • How should data object attributes be merged?

The first question is probably the most difficult. If the system allowed these basic study parameters to be merged from a variety of sources the problem of 'precedence' immediately arises: how could the system 'know' which source to use for each parameter, if they were available in both? For instance how would the system select the enrolment target from one repository rather than another? If the data is edited in one source but not another, should the most recent data always take precedence over the old? (it may be less complete than what is already there).
Because of these issues it was decided to

  • Only take the study single occurrence data items from a single data source
  • Use the idea of 'preferred sources' to order the precedence of sources in a consistent fashion.

All data sources are therefore given a number that indicate its place in this ordering - the exact number is not important, only the place it gives to the data source when they are listed, in ascending order, using this parameter. The number is stored within the sf.source_parameters table in the mon database, and is available as a property of the Source objects when they are retrieved from this table. During the aggregation process, the data sources are processed in order, most preferred first, down to the 'least preferred'. In general, more preferred sources have richer or at least more consistent data than less preferred sources. The most preferred source is ClincalTrials.gov, followed (in broad terms) by registries that are extracted individually, then the various trial registries using the WHO dataset, and then data repositories (e.g. Yoda, BioLINCC), and finally object based data sources (PubMed).
Although it cannot be guaranteed in every case, this usually means that the richer data is added to the system first, and that data coming from later, less preferred sources augments rather than replaces it. In general (and as described in more detail in Aggregating Data), during aggregation the core databases are recreated and then completely rebuilt by adding data from each source in turn. If a study does not already exist in the core system when it is 'presented' from a data source database, then it, its attributes, its associated data objects and their attributes are all added during the aggregation process. But if a study is added that already exists in the core system, because it has been added from a 'more preferred' source earlier in the process:

  • The single-occurrence study details are ignored
  • Study attributes are only added if they are definitely different from any that have already been added
  • Data objects are added (unless, very rarely, they can be seen to be already in the system)
  • Data object attributes are added if the data object records themselves are.

The Preferred source idea also plays a part in establishing the study-study link data. If a study has multiple source (registry / repository) identifiers in the system, each of those identifiers, other than the 'most preferred', will need to be linked to the identifier in its 'most preferred' source. That way the system can find the first instance of the study in the system (because it will be guaranteed to always be present in the core systems), if and when a 'less preferred' set of data for the study is presented. The study-study link data therefore needs to be ordered - it should relate each less preferred identifier to its most preferred twin. Exactly how this is done is explained below.

Initial Links Data Collection

The initial step is to collect in all the 'other identifier' information from the source databases.
Two temporary tables are established in the links (nk) schema of the mdr database. One simply collects the link data found, and the second is used to hold it once it has been ordered properly.

        public void SetUpTempLinkCollectorTable()
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"DROP TABLE IF EXISTS nk.temp_study_links_collector;
                        CREATE TABLE nk.temp_study_links_collector(
                        source_1 int
                      , sd_sid_1 varchar
                      , sd_sid_2 varchar
                      , source_2 int) ";
                conn.Execute(sql_string);
            }
        }

        public void SetUpTempLinkSortedTable()
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"DROP TABLE IF EXISTS nk.temp_study_links_sorted;
                        CREATE TABLE nk.temp_study_links_sorted(
                        source_id int
                      , sd_sid varchar
                      , preferred_sd_sid varchar
                      , preferred_source_id int) ";
                conn.Execute(sql_string);
            }
        }

The system then iterates through all the source, interrogating each of the associated databases in turn to bring back the 'other identifier' data, as sd_sid_2, and the id of the organisation assigning the identifier as source_2. That is matched against the source organisation (source_1) and the study's identifier sd_sid (sd_sid_1). The data is pre-filtered to include only identifiers associated with trial registries (and not those assigned from sponsors or funders, etc.). Repository data sources are included in this process but the repository identifiers themselves are not returned - so Yoda and BioLINCC (for example) identifiers appear as sd_sid_1, with a matching sd_sid_2 from a trial registry, but never as sd_sid_2. The data is brought back as an IEnumerable collection of StudyLink objects, which is then immediately stored in the temp_study_links_collector table using the PostgreSQLCopyHelper nuget package, which allows quick and easy storage of batches of data in one call. As the iteration across the sources proceeds, so this table gradually grows with the records found each time.

       foreach (Source s in sources)
       {
            // Fetch the study-study links and store them
            // in the Collector table 
            IEnumerable<StudyLink> links = slh.FetchLinks(s.id, s.database_name);
            slh.StoreLinksInTempTable(CopyHelpers.links_helper, links);
        }

        public IEnumerable<StudyLink> FetchLinks(int source_id, string database_name)
        {
            string conn_string = repo.GetConnString(database_name);

            using (var conn = new NpgsqlConnection(conn_string))
            {
                string sql_string = @"select " + source_id.ToString() + @" as source_1, 
                    sd_sid as sd_sid_1, 
                    identifier_value as sd_sid_2, identifier_org_id as source_2
                    from ad.study_identifiers
                    where identifier_type_id = 11
                    and identifier_org_id > 100115
                    and (identifier_org_id < 100133 or identifier_org_id = 101989)
                    and identifier_org_id <> " + source_id.ToString();
                return conn.Query<StudyLink>(sql_string);
            }
        }

        public ulong StoreLinksInTempTable(PostgreSQLCopyHelper<StudyLink> copyHelper, IEnumerable<StudyLink> entities)
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                conn.Open();
                return copyHelper.SaveAll(conn, entities);
            }
        }

One of the difficulties in the process is that the other identifier data is rarely constrained in the source systems - it can therefore sometimes be poorly formatted or otherwise mis-entered. It is necessary to try and clean some of the data before proceeding further, using a succession of SQL statements - a few of which are shown below. In the examples shown, the SQL a) deletes WHO Universal trial numbers incorrectly entered as trial registry Ids, b) replaces 'n dashes' with ordinary hyphens, making later processing easier, c) updates entries indicating an Australian ACTRN identifier, that are not prefixed by 'ACTRN', and d) removes the redundant 'Chinese Clinical Trial Register' statement from Ids from that registry which include it. These are, however, just 4 examples from a total of over 30.

        public void TidyIds1()
        {
            string sql_string = "";
            using (var conn = new NpgsqlConnection(connString))
            {
                sql_string = @"DELETE from nk.temp_study_links_collector
                where sd_sid_2 ilike 'U1111%' or sd_sid_2 ilike 'UTRN%'";
                conn.Execute(sql_string);

                // replace n dashes
                sql_string = @"UPDATE nk.temp_study_links_collector
                set sd_sid_2 = replace(sd_sid_2, '–', '-');";
                conn.Execute(sql_string);

                sql_string = @"UPDATE nk.temp_study_links_collector
                SET sd_sid_2 = 'ACTRN' || sd_sid_2
                WHERE source_2 = 100116
                and length(sd_sid_2) = 14";
                conn.Execute(sql_string);

                sql_string = @"UPDATE nk.temp_study_links_collector
                set sd_sid_2 = replace(sd_sid_2, 'Chinese Clinical Trial Register', '')
                where source_2 = 100118;";
                conn.Execute(sql_string);
    
                ...
                ...

Once the data is cleaner it needs to be ordered properly, so that preferred and non-preferred ids are clearly identified. This is done by adding the collected data to the nk.temp_study_links_sorted table using two insert statements. The first takes the data that already has the less preferred source/id (i.e. has the higher preference rating) as the first two fields, and puts this data straight into the the 'sorted' table. The second takes those where the less preferred source/id are in the last two fields and swaps them round before inserting the data.

        public void TransferLinksToSortedTable()
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                // needs to be done twice to keep the ordering of sources correct
                // A lower rating means 'more preferred' - i.e. should be used in preference
                // Therefore lower rated source data should be in the 'preferred' fields
                // and higher rated data should be on the left hand side

                // Original data matches what is required

                string sql_string = @"INSERT INTO nk.temp_study_links_sorted(
                          source_id, sd_sid, preferred_sd_sid, preferred_source_id) 
                          SELECT t.source_1, t.sd_sid_1, t.sd_sid_2, t.source_2
                          FROM nk.temp_study_links_collector t
                          inner join nk.temp_preferences r1
                          on t.source_1 = r1.id
                          inner join nk.temp_preferences r2
                          on t.source_2 = r2.id
                          WHERE r1.preference_rating > r2.preference_rating";

                conn.Execute(sql_string);

                // Original data is the opposite of what is required - therefore switch

                sql_string = @"INSERT INTO nk.temp_study_links_sorted(
                          source_id, sd_sid, preferred_sd_sid, preferred_source_id) 
                          SELECT t.source_2, t.sd_sid_2, t.sd_sid_1, t.source_1
                          FROM nk.temp_study_links_collector t
                          inner join nk.temp_preferences r1
                          on t.source_1 = r1.id
                          inner join nk.temp_preferences r2
                          on t.source_2 = r2.id
                          WHERE r1.preference_rating < r2.preference_rating";

                conn.Execute(sql_string);
            }
        }

The sorted table will obviously have many duplicates - data that was entered originally as both source_id_1/sd_sid_1 <=> source_id_2/sd_sid_2, and source_id_2/sd_sid_2 <=> source_id_1/sd_sid_1, will now appear as two identical records: source_id/sd_sid <=> pref_source_id/pref_sd_sid. It is therefore necessary to generate a distinct version of the links data, as the basis for the next stage in processing. The nk.temp_distinct_links table will therefore contain all the definite links known between studies, with both the more 'preferred' and the less 'preferred' versions of each pair identified.

        public void CreateDistinctSourceLinksTable()
        {
            // The nk.temp_study_links_sorted table will have 
            // many duplicates... create a distinct version of the data

            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links;
                           CREATE TABLE nk.temp_distinct_links 
                           as SELECT distinct source_id, sd_sid, preferred_sd_sid, preferred_source_id
                           FROM nk.temp_study_links_sorted";

                conn.Execute(sql_string);
            }
        }

Exclusion of One-to-Many links

An additional issue is that not all of the links identified are 'simple' One-to-One links. In some cases they are One-to-Many. In other words, rather than being the same study listed under different identifiers in different contexts, we have a single study in one source listed as two or more studies in another source. In these circumstances the MDR does not attempt to merge the study data. The system cannot know how data objects and attributes listed on the 'One' side could be allocated to the studies listed on the 'Many' side, so it keeps all studies that are in this sort of relationship as individual studies.
The MDR does record, however, the fact that these relationships exist, as a 'study_relationship' record. While some study relationships are between studies listed in the same data source, these are clearly relationships that cut across different sources. The task is therefore to identify this form of linkage and to extract it as study relationship data, removing it from the larger group of linked studies. The first stage is to establish a temporary table that can hold identified study groups, and then identify those groups. A 'group' in this context is two or more studies that are all in the same data source (have the same source id) that are all matched to a particular study. The matching could be at either the preferred or less preferred side of the data so two insert statements are required. Note that at this stage it is just the 'grouping' study and source that are identified (the 'One' side) - not the members of that group (the 'Many' side).

        public void IdentifyGroupedStudies()
        {
            // Set up a table to hold group definitions (i.e. the list 
            // of studies in each group, can be from the LHS or the RHS 
            // of the distinct links table

            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"DROP TABLE IF EXISTS nk.temp_grouping_studies;
                    CREATE TABLE nk.temp_grouping_studies 
                    (  
                        source_id   INT,
                        sd_sid      VARCHAR,
                        matching_source_id  INT,
                        side        VARCHAR
                    );";
                conn.Execute(sql_string);

                // Studies of interest have more than one matching study
                // within the SAME matching source registry.
                // Therefore group on one side, plus the source_id of the other

                sql_string = @"INSERT INTO nk.temp_grouping_studies
                             (source_id, sd_sid, matching_source_id, side)
                             SELECT source_id, sd_sid, 
                             preferred_source_id, 'L'
                             FROM nk.temp_distinct_links 
                             group by source_id, sd_sid, preferred_source_id
                             HAVING count(sd_sid) > 1;";
                conn.Execute(sql_string);

                sql_string = @"INSERT INTO nk.temp_grouping_studies
                             (source_id, sd_sid, matching_source_id, side)
                             SELECT preferred_source_id, preferred_sd_sid, 
                             source_id, 'R'
                             FROM nk.temp_distinct_links 
                             group by preferred_source_id, preferred_sd_sid, source_id
                             HAVING count(preferred_sd_sid) > 1;";
                conn.Execute(sql_string);
            }
        }

Having identified the grouping studies it is possible to extract the 'group members' - the studies that are linked to each of them. The code below does that, initially selecting the temp_distinct_links records to create the table, and then reversing the field selection for those groups where the grouping study is from the right hand preferred side.

        public void ExtractGroupedStudiess()
        {
            string sql_string;
            using (var conn = new NpgsqlConnection(connString))
            {
                // create a table that takes the rows from the linked 
                // studies table that match the 'L' grouping studies

                // The source_id side is the group and the preferred side 
                // is comprised of the grouped studies.

                sql_string = @"DROP TABLE IF EXISTS nk.temp_linked_studies;
                create table nk.temp_linked_studies as
                select k.* from
                nk.temp_distinct_links k 
                inner join nk.temp_grouping_studies g
                on k.source_id = g.source_id
                and k.sd_sid = g.sd_sid
                and k.preferred_source_id = g.matching_source_id
                where g.side = 'L';";

                conn.Execute(sql_string);

                // To retain the same arrangement of grouping study on 
                // the LHS the input data from the RHS has to be switched around

                sql_string = @"INSERT into nk.temp_linked_studies
                (source_id, sd_sid, preferred_source_id, preferred_sd_sid)
                select k.preferred_source_id, k.preferred_sd_sid, k.source_id, k.sd_sid from
                nk.temp_distinct_links k
                inner join nk.temp_grouping_studies g
                on k.preferred_source_id = g.source_id
                and k.preferred_sd_sid = g.sd_sid
                and k.source_id = g.matching_source_id
                where g.side = 'R'; ";

                conn.Execute(sql_string);

The temp_linked_studies table now holds all the grouped studies, with the grouping source/studies on the left hand side, and the grouped source/studies on the right. This data can now be used to create linked study group records with the correct relationship codes. Two complementary pairs of such relationships are possible:

  •   25 Includes target as one of a group of non-registered studies: This study includes<the target study>.That study is not registered independently, but instead shares this registry entry with one or more other non-registered studies.
  •   26 Non registered but included within a registered study group: This study is registered as <the target study>, along with one or more other studies that share the same registry entry and id.
  •   28 Includes target as one of a group of registered studies: This study includes <the target study>, which is registered elsewhere along with one or more other registered studies, forming a group that collectively equates to this study.
  •   29 Registered and is included elsewhere in group: This study is also registered, along with one or more other studies that together form an equivalent group, as <the target study>.

BioLINCC (id = 101900) and Yoda (id = 101901) are the only two non registry study sources at the moment, and this is reflected in the SQL below. The relationship codes for these data repositories are different than for the trial registries.

                sql_string = @"INSERT INTO nk.linked_study_groups 
                             (source_id, sd_sid, relationship_id, 
                             target_sd_sid, target_source_id)
                             select distinct source_id, sd_sid, 
                             case when preferred_source_id = 101900 
                                  or preferred_source_id = 101901 then 25 
                             else 28 end, 
                             preferred_sd_sid, preferred_source_id 
                             from nk.temp_linked_studies;";
                conn.Execute(sql_string);

                sql_string = @"INSERT INTO nk.linked_study_groups 
                             (source_id, sd_sid, relationship_id, 
                             target_sd_sid, target_source_id)
                             select distinct preferred_source_id, preferred_sd_sid, 
                             case when source_id = 101900 
                                  or source_id = 101901 then 26 
                             else 29 end, 
                             sd_sid, source_id
                             from nk.temp_linked_studies;";
                conn.Execute(sql_string);
            }
        }

Finally, it is necessary to remove the grouped study data from the main set of temp_distinct_links, as shown below, and drop the temporary tables that have been used.

        public void DeleteGroupedStudyLinkRecords()
        {
            string sql_string; 
            using (var conn = new NpgsqlConnection(connString))
            {
                // Now need to delete these grouped records from the links table...

                sql_string = @"DELETE FROM nk.temp_distinct_links k
                               USING nk.temp_grouping_studies g
                               WHERE k.source_id = g.source_id
                               and k.sd_sid = g.sd_sid
                               and k.preferred_source_id = g.matching_source_id
                               and g.side = 'L';";
                conn.Execute(sql_string);

                sql_string = @"DELETE FROM nk.temp_distinct_links k
                               USING nk.temp_grouping_studies g
                               WHERE k.preferred_source_id = g.source_id
                               and k.preferred_sd_sid = g.sd_sid
                               and k.source_id = g.matching_source_id
                               and g.side = 'R';";
                conn.Execute(sql_string);

                sql_string = @"DROP TABLE IF EXISTS nk.temp_grouping_studies;
                DROP TABLE IF EXISTS nk.temp_linked_studies";
                conn.Execute(sql_string);

            }
        }

Cascading Links

The linked study data requires further processing to ensure that it is complete. There are two types of problem that can arise when a study is found in 3 or more places, because there is no guarantee that all possible links were in the original data. Whilst the table identifies the more 'preferred' of two study ids, it does not necessarily identify the most preferred when there are three or more links. For instance a study could have identifiers in sources A, B and C, where A is less preferred than B, and B less preferred than C. In the source data there are links between A and B, and B and C, but no link exists between A and C, so the data shows A -> B and B -> C. For the aggregation process to work properly the data has to show A -> C and B -> C, i.e. all links should be to the most preferred id. The links have to be 'telescoped' together.
Before that problem can be tackled, however, there is the need to repair any 'missing' links. These occur if - in the situation above - A is linked to B and C, but no link exists between B and C. The data therefore has A -> B and A -> C but no B -> C, so the telescoping cannot take place. The first stage is therefore to make the set of links as comprehensive as possible, at the risk of creating some duplicates (but they can be easily removed) and then do the telescoping to ensure all links go all the way to the end of the chain, to show the most preferred option.
The first job is to identify the studies that have more than one 'preferred' option, this time cutting across more than one preferred source (e.g. A -> B, A -> C). The grouped studies have already been removed, so this should find only those with a potential 'missing link'. This list of studies is inserted into temp_studies_with_multiple_links. The process then creates the temp_missing_links table, adding the right hand side preferred fields, by joining this table with the records in temp_distinct_links, at the same time incorporating the preference ratings of the sources involved

        public void ManageIncompleteLinks()
        {
            string sql_string;
            using (var conn = new NpgsqlConnection(connString))
            {
                sql_string = @"DROP TABLE IF EXISTS nk.temp_studies_with_multiple_links;
                       CREATE TABLE nk.temp_studies_with_multiple_links
                       as SELECT source_id, sd_sid
                       from nk.temp_distinct_links
                       group by source_id, sd_sid
                       having count(distinct preferred_source_id) > 1;";
                conn.Execute(sql_string);

                sql_string = @"DROP TABLE IF EXISTS nk.temp_missing_links;
                    CREATE TABLE nk.temp_missing_links as
                    select k.source_id, r1.preference_rating as source_rating, 
                    k.sd_sid, k.preferred_source_id, r2.preference_rating, k.preferred_sd_sid 
                    from nk.temp_distinct_links k
                    inner join nk.temp_studies_with_multiple_links m
                    on k.source_id = m.source_id
                    and k.sd_sid = m.sd_sid
                    inner join nk.temp_preferences r1
                    on k.source_id = r1.id
                    inner join nk.temp_preferences r2
                    on k.preferred_source_id = r2.id
                    order by k.source_id, k.sd_sid, preferred_source_id;";
                conn.Execute(sql_string);

                ...

Now a further temp table is created that will hold the links between studies B and C, which are currently both 'preferred' studies (both on the RHS of the table) for any particular source id / sd_sid study. This table (temp_new_links) has 6 fields - two initially populated with the source id / sd_sid, to identify the record, and the next pair that have the source / sd_sid pair that does NOT have the minimum source rating, i.e. is the study that will need to be on the left hand side in the new link (B in the example above). Place holders are put in for the new_preferred_source and new_preferred_sd_sid fields. Then the table is updated, but this time with the source / sd_sid pair that does have the minimum source rating, i.e. is the study that will need to be on the right hand side in the new link (C in the example above) going into the new_preferred_ fields.
The four right hand fields now represent the missing links that were required (B -> C) and as such can be inserted into the temp_distinct_links table. The temporary tables can then be dropped.

                ...

                sql_string = @"DROP TABLE IF EXISTS nk.temp_new_links;
                CREATE TABLE nk.temp_new_links as
                select m.source_id, m.sd_sid, m.preferred_source_id as new_source_id, 
                m.preferred_sd_sid as new_sd_sid, 0 as new_preferred_source, '' as new_preferred_sd_sid from
                nk.temp_missing_links m
                inner join
                    (select source_id, sd_sid, min(preference_rating) as min_rating
                     from nk.temp_missing_links
                     group by source_id, sd_sid) mins
                on m.source_id = mins.source_id
                and m.sd_sid = mins.sd_sid
                and m.preference_rating <> mins.min_rating
                order by source_id, sd_sid;";
                conn.Execute(sql_string);

                // Update the last pair of tables in the temp_new_links table with the source / sd_sid 
                // that represents the study with the minimally rated source id, i.e. the 'correct' preferred option

                sql_string = @"UPDATE nk.temp_new_links k
                      SET new_preferred_source = min_set.preferred_source_id
                      , new_preferred_sd_sid = min_set.preferred_sd_sid
                      FROM
                          (select m.* from
                          nk.temp_missing_links m
                          INNER JOIN 
                                (select source_id, sd_sid, min(preference_rating) as min_rating
                                from nk.temp_missing_links
                                group by source_id, sd_sid) mins
                          on m.source_id = mins.source_id
                          and m.sd_sid = mins.sd_sid
                          and m.preference_rating = mins.min_rating) min_set
                      WHERE k.source_id = min_set.source_id
                      AND k.sd_sid = min_set.sd_sid;";
                conn.Execute(sql_string);

                // Insert the new links into the distinct_links table.
                // These links will need re-processing through the CascadeLinksTable() function.

                sql_string = @"INSERT INTO nk.temp_distinct_links
                     (source_id, sd_sid, preferred_sd_sid, preferred_source_id)
                     SELECT new_source_id, new_sd_sid, new_preferred_sd_sid, new_preferred_source from
                     nk.temp_new_links;";
                conn.Execute(sql_string);

                // drop the temp tables 
                sql_string = @"DROP TABLE IF EXISTS nk.temp_missing_links;
                DROP TABLE IF EXISTS nk.temp_new_links;";
                conn.Execute(sql_string);
            }
        }

Having filled in any 'missing links' the system can tackle the telescoping of links. In this scenario, (both A -> B and B -> C exist but not A -> C) the same study (B in this case) will appear in both the left hand side 'less preferred' columns in the temp_distinct_links table and in the more preferred right hand side. The requirement is to replace the B on the right hand side with the 'true' most preferred study id, which is C. A self join on the temp_distinct_links table, linking studies on both 'sides' of the table, can be used to first see if there are any studies that are both 'less' and 'more' preferred, and then to make the switch, replacing the preferred side data on the right of the table with the 'most preferred' values. In the example, A -> B is linked to B -> C, and A -> B becomes A -> C, 'telescoping' the links together.
Because a few studies are registered 4 times the process needs to be repeated until no further studies are found which exist in both the less and more preferred columns. A while loop is therefore used to repeat the action as often as is necessary - usually twice.
In some cases the telescoped link may have already been present - i.e. A -> B, B -> C and A -> C were all there. In these cases the process described above (still necessary to remove the A -> B link) will result in duplicates. Any duplicates therefore have to be removed at the end of the process.

        public void CascadeLinksInDistinctLinksTable()
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string;
                int match_number = 500;  // arbitrary start number
                while (match_number > 0)
                {
                    // get match number as number of link records where the rhs sd_sid
                    // appears elsewhere on the left...

                      sql_string = @"SELECT count(*) 
                          FROM nk.temp_distinct_links t1
                          inner join nk.temp_distinct_links t2
                          on t1.preferred_source_id = t2.source_id
                          and t1.preferred_sd_sid = t2.sd_sid";

                    match_number = conn.ExecuteScalar<int>(sql_string);

                    if (match_number > 0)
                    {
                        // do the update

                        sql_string = @"UPDATE nk.temp_distinct_links t1
                          SET preferred_source_id = t2.preferred_source_id,
                          preferred_sd_sid = t2.preferred_sd_sid
                          FROM nk.temp_distinct_links t2
                          WHERE t1.preferred_source_id = t2.source_id
                          AND t1.preferred_sd_sid = t2.sd_sid";

                        conn.Execute(sql_string);
                    }
                }

                sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links2;
                           CREATE TABLE nk.temp_distinct_links2 
                           as SELECT distinct * FROM nk.temp_distinct_links";

                conn.Execute(sql_string);

                sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links;
                ALTER TABLE nk.temp_distinct_links2 RENAME TO temp_distinct_links;";

                conn.Execute(sql_string);
            }
        }

Final Steps

The data is then transferred from the temporary table to the permanent study_study_links table. Again a select distinct is used, as a final check to ensure that any duplicates are removed (in case the preceding code is changed in the future). The nk.study_study_links is then available to play a critical role during the aggregation process. Each study that is added is first checked against this table, to see if it has a 'more preferred' form, in which case it will have already been added to the core system. In that case, as described above, different rules are applied in adding the data.

public void TransferNewLinksToDataTable()
        {
            // A distinct selection is required because the most recent
            // link cascade may have generated duplicates

            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"Insert into nk.study_study_links
                      (source_id, sd_sid, preferred_sd_sid, preferred_source_id)
                      select distinct source_id, sd_sid, preferred_sd_sid, preferred_source_id
                      from nk.temp_distinct_links";

                conn.Execute(sql_string);
            }
        }

The various temporary tables that remain are removed.

        public void DropTempTables()
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"DROP TABLE IF EXISTS nk.temp_preferences;
                DROP TABLE IF EXISTS nk.temp_study_links_collector;
                DROP TABLE IF EXISTS nk.temp_distinct_links;";
                conn.Execute(sql_string);
            }
        }

Note that at a later stage in the aggregation process, after all study data has been aggregated, The linked study data are then added from the nk schema to the study_relationships table in the study (st) schema.

        public void AddStudyStudyRelationshipRecords()
        {
            // Use the study_all_ids to insert the study Ids
            // for the linked sources / sd_sids, using 
            // nk.linked_study_groups as the source

            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"Insert into st.study_relationships
                      (study_id, relationship_type_id, target_study_id)
                      select s1.study_id, g.relationship_id, s2.study_id
                      from nk.linked_study_groups g
                      inner join nk.all_ids_studies s1
                      on g.source_id = s1.source_id
                      and g.sd_sid = s1.sd_sid
                      inner join nk.all_ids_studies s2
                      on g.target_source_id = s2.source_id
                      and g.target_sd_sid = s2.sd_sid";

                conn.Execute(sql_string);
            }
        }
    }