Difference between revisions of "Identifying Links between Studies"

Revision as of 17:29, 18 November 2020

One of the issues that has to be tackled during aggregation of data from different sources is the fact that the same study can be found, and in tens of thousands of cases is found, in more than one study based source, and that it will have a different persistent identifier in each source. Partly this is because studies can be registered in more than one trial registry, especially when local regulations mandate a registration for any study carried out within a particular country or region. This is especially the case with the EU, which insists all trials involving medicinal products must be registered in the EUCTR. About a third of these studies, however, are also registered in other registries, especially Clinicaltrials.gov. In addition, within a data repository, studies will usually be referenced by a local id rather than a pre-existing registry id.
Study titles cannot be relied upon to identify the same study in different source locations. A study title is often expressed slightly differently in different contexts, and cannot in any case be relied upon to be unique (even within the same source). It may be that further research will indicate how titles could be reframed (e.g. to a smaller number of keywords, expressed in a fixed order) to allow duplicate entries to be discovered using text, but for the moment the only easy way of doing this is by using the 'other identifiers' material found in the source data. Almost all sources contain this material, which usually include any other trial registry ids, (i.e. other than that used in the source registry entry), as well as ids assigned by the sponsor, funder or, sometimes, a regulatory authority.
These 'other registry ids' can be used to build up a table of study-study links which can then be used during the aggregation process to identify when duplicate studies are being added. In fact the creation of this table is always the first stage of any aggregation. This page describes this process in detail.
At the moment, with one exception, the requirement for identifying duplications only extends to studies - not data objects. The exception is provided by PubMed citations for journal papers, which can also be found multiple times in the source data, both within a single source and across multiple sources. The particular complications in processing PubMed data are described in Processing PubMed Data. The assumption for now is that other data objects are not duplicated across data sources, although this may need to be considered in the future. The rules for adding data objects are described more fully in Aggregating Data.

The Preferred Source concept

If details about a study and its attributes can be found in more than one data source, the obvious question is how should this data be merged in the final aggregated MDR database? In fact there are several aspects to this question:

How should 'single-occurrence' details about the study be merged? (i.e. the data points that appear in the study record itself, such as study display title, study type and status, enrolment target, min and max ages etc.)
How should study attributes (identifiers, titles, contributors, topics etc.) be merged?
How should data objects in different sources (trial registry entries, data and document references) be merged?
How should data object attributes be merged?

The first question is probably the most difficult. If the system allowed these basic study parameters to be merged from a variety of sources the problem of 'precedence' immediately arises: how could the system 'know' which source to use for each parameter, if they were available in both? For instance how would the system select the enrolment target from one repository rather than another? If the data is edited in one source but not another, should the most recent data always take precedence over the old? (it may be less complete than what is already there).
Because of these issues it was decided to

Only take the study single occurrence data items from a single data source
Use the idea of 'preferred sources' to order the precedence of sources in a consistent fashion.

All data sources are therefore given a number that indicate its place in this ordering - the exact number is not important, only the place it gives to the data source when they are listed, in ascending order, using this parameter. The number is stored within the sf.source_parameters table in the mon database, and is available as a property of the Source objects when they are retrieved from this table. During the aggregation process, the data sources are processed in order, most preferred first, down to the 'least preferred'. In general, more preferred sources have richer or at least more consistent data than less preferred sources. The most preferred source is ClincalTrials.gov, followed (in broad terms) by registries that are extracted individually, then the various trial registries using the WHO dataset, and then data repositories (e.g. Yoda, BioLINCC), and finally object based data sources (PubMed).
Although it cannot be guaranteed in every case, this usually means that the richer data is added to the system first, and that data coming from later, less preferred sources augments rather than replaces it. In general (and as described in more detail in Aggregating Data), during aggregation the core databases are recreated and then completely rebuilt by adding data from each source in turn. If a study does not already exist in the core system when it is 'presented' from a data source database, then it, its attributes, its associated data objects and their attributes are all added during the aggregation process. But if a study is added that already exists in the core system, because it has been added from a 'more preferred' source earlier in the process:

The single-occurrence study details are ignored
Study attributes are only added if they are definitely different from any that have already been added
Data objects are added (unless, very rarely, they can be seen to be already in the system)
Data object attributes are added if the data object records themselves are.

The Preferred source idea also plays a part in establishing the study-study link data. If a study has multiple source (registry / repository) identifiers in the system, each of those identifiers, other than the 'most preferred', will need to be linked to the identifier in its 'most preferred' source. That way the system can find the first instance of the study in the system (because it will be guaranteed to always be present in the core systems), if and when a 'less preferred' set of data for the study is presented. The study-study link data therefore needs to be ordered - it should relate each less preferred identifier to its most preferred twin. Exactly how this is done is explained below.

Initial Links Data Collection

The initial step is to collect in all the 'other identifier' information from the source databases.
Two temporary tables are established in the links (nk) schema of the mdr database. One simply collects the link data found, and the second is used to hold it once it has been ordered properly.

        public void SetUpTempLinkCollectorTable()
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"DROP TABLE IF EXISTS nk.temp_study_links_collector;
                        CREATE TABLE nk.temp_study_links_collector(
                        source_1 int
                      , sd_sid_1 varchar
                      , sd_sid_2 varchar
                      , source_2 int) ";
                conn.Execute(sql_string);
            }
        }

        public void SetUpTempLinkSortedTable()
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"DROP TABLE IF EXISTS nk.temp_study_links_sorted;
                        CREATE TABLE nk.temp_study_links_sorted(
                        source_id int
                      , sd_sid varchar
                      , preferred_sd_sid varchar
                      , preferred_source_id int) ";
                conn.Execute(sql_string);
            }
        }

The system then iterates through all the source, interrogating each of the associated databases in turn to bring back the 'other identifier' data, as sd_sid_2, and the id of the organisation assigning the identifier as source_2. That is matched against the source organisation (source_1) and the study's identifier sd_sid (sd_sid_1). The data is pre-filtered to include only identifiers associated with trial registries (and not those assigned from sponsors or funders, etc.). Repository data sources are included in this process but the repository identifiers themselves are not returned - so Yoda and BioLINCC (for example) identifiers appear as sd_sid_1, with a matching sd_sid_2 from a trial registry, but never as sd_sid_2. The data is brought back as an IEnumerable collection of StudyLink objects, which is then immediately stored in the temp_study_links_collector table using the PostgreSQLCopyHelper nuget package, which allows quick and easy storage of batches of data in one call. As the iteration across the sources proceeds, so this table gradually grows with the records found each time.

       foreach (Source s in sources)
       {
            // Fetch the study-study links and store them
            // in the Collector table 
            IEnumerable<StudyLink> links = slh.FetchLinks(s.id, s.database_name);
            slh.StoreLinksInTempTable(CopyHelpers.links_helper, links);
        }

        public IEnumerable<StudyLink> FetchLinks(int source_id, string database_name)
        {
            string conn_string = repo.GetConnString(database_name);

            using (var conn = new NpgsqlConnection(conn_string))
            {
                string sql_string = @"select " + source_id.ToString() + @" as source_1, 
                    sd_sid as sd_sid_1, 
                    identifier_value as sd_sid_2, identifier_org_id as source_2
                    from ad.study_identifiers
                    where identifier_type_id = 11
                    and identifier_org_id > 100115
                    and (identifier_org_id < 100133 or identifier_org_id = 101989)
                    and identifier_org_id <> " + source_id.ToString();
                return conn.Query<StudyLink>(sql_string);
            }
        }

        public ulong StoreLinksInTempTable(PostgreSQLCopyHelper<StudyLink> copyHelper, IEnumerable<StudyLink> entities)
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                conn.Open();
                return copyHelper.SaveAll(conn, entities);
            }
        }

One of the difficulties in the process is that the other identifier data is rarely constrained in the source systems - it can therefore sometimes be poorly formatted or otherwise mis-entered. It is necessary to try and clean some of the data before proceeding further, using a succession of SQL statements - a few of which are shown below. In the examples shown, the SQL a) deletes WHO Universal trial numbers incorrectly entered as trial registry Ids, b) replaces 'n dashes' with ordinary hyphens, making later processing easier, c) updates entries indicating an Australian ACTRN identifier, that are not prefixed by 'ACTRN', and d) removes the redundant 'Chinese Clinical Trial Register' statement from Ids from that registry which include it. These are, however, just 4 examples from a total of over 30.

        public void TidyIds1()
        {
            string sql_string = "";
            using (var conn = new NpgsqlConnection(connString))
            {
                sql_string = @"DELETE from nk.temp_study_links_collector
                where sd_sid_2 ilike 'U1111%' or sd_sid_2 ilike 'UTRN%'";
                conn.Execute(sql_string);

                // replace n dashes
                sql_string = @"UPDATE nk.temp_study_links_collector
                set sd_sid_2 = replace(sd_sid_2, '–', '-');";
                conn.Execute(sql_string);

                sql_string = @"UPDATE nk.temp_study_links_collector
                SET sd_sid_2 = 'ACTRN' || sd_sid_2
                WHERE source_2 = 100116
                and length(sd_sid_2) = 14";
                conn.Execute(sql_string);

                sql_string = @"UPDATE nk.temp_study_links_collector
                set sd_sid_2 = replace(sd_sid_2, 'Chinese Clinical Trial Register', '')
                where source_2 = 100118;";
                conn.Execute(sql_string);
    
                ...
                ...

Once the data is cleaner it needs to be ordered properly, so that preferred and non-preferred ids are clearly identified. This is done by adding the collected data to the nk.temp_study_links_sorted table using two insert statements. The first takes the data that already has the less preferred source/id (i.e. has the higher preference rating) as the first two fields, and puts this data straight into the the 'sorted' table. The second takes those where the less preferred source/id are in the last two fields and swaps them round before inserting the data.

        public void TransferLinksToSortedTable()
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                // needs to be done twice to keep the ordering of sources correct
                // A lower rating means 'more preferred' - i.e. should be used in preference
                // Therefore lower rated source data should be in the 'preferred' fields
                // and higher rated data should be on the left hand side

                // Original data matches what is required

                string sql_string = @"INSERT INTO nk.temp_study_links_sorted(
                          source_id, sd_sid, preferred_sd_sid, preferred_source_id) 
                          SELECT t.source_1, t.sd_sid_1, t.sd_sid_2, t.source_2
                          FROM nk.temp_study_links_collector t
                          inner join nk.temp_preferences r1
                          on t.source_1 = r1.id
                          inner join nk.temp_preferences r2
                          on t.source_2 = r2.id
                          WHERE r1.preference_rating > r2.preference_rating";

                conn.Execute(sql_string);

                // Original data is the opposite of what is required - therefore switch

                sql_string = @"INSERT INTO nk.temp_study_links_sorted(
                          source_id, sd_sid, preferred_sd_sid, preferred_source_id) 
                          SELECT t.source_2, t.sd_sid_2, t.sd_sid_1, t.source_1
                          FROM nk.temp_study_links_collector t
                          inner join nk.temp_preferences r1
                          on t.source_1 = r1.id
                          inner join nk.temp_preferences r2
                          on t.source_2 = r2.id
                          WHERE r1.preference_rating < r2.preference_rating";

                conn.Execute(sql_string);
            }
        }

The sorted table will obviously have many duplicates - data that was entered originally as both source_id_1/sd_sid_1 <=> source_id_2/sd_sid_2, and source_id_2/sd_sid_2 <=> source_id_1/sd_sid_1, will now appear as two identical records: source_id/sd_sid <=> pref_source_id/pref_sd_sid. It is therefore necessary to generate a distinct version of the links data, as the basis for the next stage in processing. The nk.temp_distinct_links table will therefore contain all the definite links known between studies, with both the more 'preferred' and the less 'preferred' versions of each pair identified.

        public void CreateDistinctSourceLinksTable()
        {
            // The nk.temp_study_links_sorted table will have 
            // many duplicates... create a distinct version of the data

            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links;
                           CREATE TABLE nk.temp_distinct_links 
                           as SELECT distinct source_id, sd_sid, preferred_sd_sid, preferred_source_id
                           FROM nk.temp_study_links_sorted";

                conn.Execute(sql_string);
            }
        }

Exclusion of One-to-Many links

// One set of relationships are not 'same study in a different registry'
        // but multiple studies in a different registry.
        // Such studies have a study relationship rather than being straight equivalents.
        // There can be multiple studies in the 'preferred' registry
        // or in the existing studies registry - each group being equivalent to
        // a registry entry that represents a single study, or sometimes a 
        // single project / programme, or grant

        public void IdentifyGroupedStudies()
        {
            // Set up a table to hold group definitions (i.e. the list 
            // of studies in each group, can be from the LHS or the RHS 
            // of the distinct links table

            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"DROP TABLE IF EXISTS nk.temp_grouping_studies;
                    CREATE TABLE nk.temp_grouping_studies 
                    (  
                        source_id   INT,
                        sd_sid      VARCHAR,
                        matching_source_id  INT,
                        side        VARCHAR
                    );";
                conn.Execute(sql_string);

                // Studies of interest have more than one matching study
                // within the SAME matching source registry.
                // Therefore group on one side, plus the source_id of the other

                sql_string = @"INSERT INTO nk.temp_grouping_studies
                             (source_id, sd_sid, matching_source_id, side)
                             SELECT source_id, sd_sid, 
                             preferred_source_id, 'L'
                             FROM nk.temp_distinct_links 
                             group by source_id, sd_sid, preferred_source_id
                             HAVING count(sd_sid) > 1;";
                conn.Execute(sql_string);

                sql_string = @"INSERT INTO nk.temp_grouping_studies
                             (source_id, sd_sid, matching_source_id, side)
                             SELECT preferred_source_id, preferred_sd_sid, 
                             source_id, 'R'
                             FROM nk.temp_distinct_links 
                             group by preferred_source_id, preferred_sd_sid, source_id
                             HAVING count(preferred_sd_sid) > 1;";
                conn.Execute(sql_string);
            }
        }

        public void ExtractGroupedStudiess()
        {
            string sql_string;

            using (var conn = new NpgsqlConnection(connString))
            {
                // create a table that takes the rows from the linked 
                // studies table that match the 'L' grouping studies

                // The source_id side is the group and the preferred side 
                // is comprised of the grouped studies.

                sql_string = @"DROP TABLE IF EXISTS nk.temp_linked_studies;
                create table nk.temp_linked_studies as
                select k.* from
                nk.temp_distinct_links k 
                inner join nk.temp_grouping_studies g
                on k.source_id = g.source_id
                and k.sd_sid = g.sd_sid
                and k.preferred_source_id = g.matching_source_id
                where g.side = 'L';";

                conn.Execute(sql_string);

                // To retain the same arrangement of grouping study on 
                // the LHS the input data from the RHS has to be switched around

                sql_string = @"INSERT into nk.temp_linked_studies
                (source_id, sd_sid, preferred_source_id, preferred_sd_sid)
                select k.preferred_source_id, k.preferred_sd_sid, k.source_id, k.sd_sid from
                nk.temp_distinct_links k
                inner join nk.temp_grouping_studies g
                on k.preferred_source_id = g.source_id
                and k.preferred_sd_sid = g.sd_sid
                and k.source_id = g.matching_source_id
                where g.side = 'R'; ";

                conn.Execute(sql_string);

                // Put this data into the permanent linked_study_groups table
                // The study relationships are 
                // 25	Includes target as one of a group of non-registered studies 
                // This study includes<the target study>.That study is not registered independently, 
                // but instead shares this registry entry with one or more other non-registered studies.
                // 26	Non registered but included within a registered study group 
                // This study is registered as <the target study>, along with one or more other studies
                // that share the same registry entry and id.
                // 28	Includes target as one of a group of registered studies 
                // This study includes <the target study>, which is registered elsewhere along with one 
                // or more other registered studies, forming a group that collectively equates to this study.
                // 29	Registered and is included elsewhere in group 
                // This study is also registered, along with one or more other studies that together form an
                // equivalent group, as <the target study>.

                sql_string = @"INSERT INTO nk.linked_study_groups 
                             (source_id, sd_sid, relationship_id, 
                             target_sd_sid, target_source_id)
                             select distinct source_id, sd_sid, 
                             case when preferred_source_id = 101900 
                                  or preferred_source_id = 101901 then 25 
                             else 28 end, 
                             preferred_sd_sid, preferred_source_id 
                             from nk.temp_linked_studies;";
                conn.Execute(sql_string);

                sql_string = @"INSERT INTO nk.linked_study_groups 
                             (source_id, sd_sid, relationship_id, 
                             target_sd_sid, target_source_id)
                             select distinct preferred_source_id, preferred_sd_sid, 
                             case when source_id = 101900 
                                  or source_id = 101901 then 26 
                             else 29 end, 
                             sd_sid, source_id
                             from nk.temp_linked_studies;";
                conn.Execute(sql_string);
            }
        }

        public void DeleteGroupedStudyLinkRecords()
        {
            string sql_string; 
            using (var conn = new NpgsqlConnection(connString))
            {
                // Now need to delete these grouped records from the links table...

                sql_string = @"DELETE FROM nk.temp_distinct_links k
                               USING nk.temp_grouping_studies g
                               WHERE k.source_id = g.source_id
                               and k.sd_sid = g.sd_sid
                               and k.preferred_source_id = g.matching_source_id
                               and g.side = 'L';";
                conn.Execute(sql_string);

                sql_string = @"DELETE FROM nk.temp_distinct_links k
                               USING nk.temp_grouping_studies g
                               WHERE k.preferred_source_id = g.source_id
                               and k.preferred_sd_sid = g.sd_sid
                               and k.source_id = g.matching_source_id
                               and g.side = 'R';";
                conn.Execute(sql_string);

                sql_string = @"DROP TABLE IF EXISTS nk.temp_grouping_studies;
                DROP TABLE IF EXISTS nk.temp_linked_studies";
                conn.Execute(sql_string);

            }
        }

Cascading Links

public void ManageIncompleteLinks()
        {

            // There are a set if links that may be missing, in the sense that
            // Study A is listed as being the same as Study B and Study C, but no
            // link exists beteween either Study B to C, or Study C to B.
            // The 'link path' is therefore broken and the B to C link needs to be added.
            // These studies have two, occasionally more, 'preferred studies', which
            // does not make sense in the system.

            // First create a table with these 'missing link' records

            // Working from the inside out, this query
            // a) gets the source id/sd_sids of the LHS of the study links table
            //    that has more than one 'preferred' study associated with it (dataset d)
            // b) takes those records and identifies the distinct preferred source ids
            //    that are linked to each RHS study (dataset a)
            // c) Further identifies the records that have more than one 
            //    source referenced on the RHS (so all the linked records have the 
            //    'impossible' property of having more than one preferred 
            //    source / sd_sid study record (dataset agg)
            // d) joins that dataset back to the linked records table, to 
            //    identify the source records that meet the criteria of 
            //    having a 'missing link'

            string sql_string;
            using (var conn = new NpgsqlConnection(connString))
            {
                // First identify the studies that have more than one 'preferred' option
                // cutting across more than one source. 
                // Groups have already been removed, so this should find only those
                // with the 'missing link'.

                sql_string = @"DROP TABLE IF EXISTS nk.temp_studies_with_multiple_links;
                       CREATE TABLE nk.temp_studies_with_multiple_links
                       as SELECT source_id, sd_sid
                       from nk.temp_distinct_links
                       group by source_id, sd_sid
                       having count(distinct preferred_source_id) > 1;";
                conn.Execute(sql_string);

                sql_string = @"DROP TABLE IF EXISTS nk.temp_missing_links;
                    CREATE TABLE nk.temp_missing_links as
                    select k.source_id, r1.preference_rating as source_rating, 
                    k.sd_sid, k.preferred_source_id, r2.preference_rating, k.preferred_sd_sid 
                    from nk.temp_distinct_links k
                    inner join nk.temp_studies_with_multiple_links m
                    on k.source_id = m.source_id
                    and k.sd_sid = m.sd_sid
                    inner join nk.temp_preferences r1
                    on k.source_id = r1.id
                    inner join nk.temp_preferences r2
                    on k.preferred_source_id = r2.id
                    order by k.source_id, k.sd_sid, preferred_source_id;";
                conn.Execute(sql_string);

                // Create a further temp table that will hold the links between studies B and C,
                // which are currently both 'preferred' studies (both on the RHS of the table)
                // for any particular source id / sd_sid study.

                // This table has 6 fields - two initially populated with the source id / sd_sid, 
                // to identify the record, and the next pair that have the source / sd_sid pair
                // that does NOT have the minimum source rating, i.e. is the study that will need 
                // to be 'existing studies' in the new link

                sql_string = @"DROP TABLE IF EXISTS nk.temp_new_links;
                CREATE TABLE nk.temp_new_links as
                select m.source_id, m.sd_sid, m.preferred_source_id as new_source_id, 
                m.preferred_sd_sid as new_sd_sid, 0 as new_preferred_source, '' as new_preferred_sd_sid from
                nk.temp_missing_links m
                inner join
                    (select source_id, sd_sid, min(preference_rating) as min_rating
                     from nk.temp_missing_links
                     group by source_id, sd_sid) mins
                on m.source_id = mins.source_id
                and m.sd_sid = mins.sd_sid
                and m.preference_rating <> mins.min_rating
                order by source_id, sd_sid;";
                conn.Execute(sql_string);

                // Update the last pair of tables in ther temp_new_links table with the source / sd_sid 
                // that represents the study with the minimally rated source id, i.e. the 'correct' preferred option

                sql_string = @"UPDATE nk.temp_new_links k
                      SET new_preferred_source = min_set.preferred_source_id
                      , new_preferred_sd_sid = min_set.preferred_sd_sid
                      FROM
                          (select m.* from
                          nk.temp_missing_links m
                          INNER JOIN 
                                (select source_id, sd_sid, min(preference_rating) as min_rating
                                from nk.temp_missing_links
                                group by source_id, sd_sid) mins
                          on m.source_id = mins.source_id
                          and m.sd_sid = mins.sd_sid
                          and m.preference_rating = mins.min_rating) min_set
                      WHERE k.source_id = min_set.source_id
                      AND k.sd_sid = min_set.sd_sid;";
                conn.Execute(sql_string);

                // Insert the new links into the distinct_links table.
                // These links will need re-processing through the CascadeLinksTable() function.

                sql_string = @"INSERT INTO nk.temp_distinct_links
                     (source_id, sd_sid, preferred_sd_sid, preferred_source_id)
                     SELECT new_source_id, new_sd_sid, new_preferred_sd_sid, new_preferred_source from
                     nk.temp_new_links;";
                conn.Execute(sql_string);

                // drop the temp tables 
                sql_string = @"DROP TABLE IF EXISTS nk.temp_missing_links;
                DROP TABLE IF EXISTS nk.temp_new_links;";
                conn.Execute(sql_string);
            }
        }

        public void CascadeLinksInDistinctLinksTable()
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                // telescope the preferred links to the most preferred
                // i.e. A -> B, B -> C becomes A -> C, B -> C
                // do as long as there remains links to be telescoped
                // (a few have to be done twice)

                string sql_string;
                int match_number = 500;  // arbitrary start number
                while (match_number > 0)
                {
                    // get match number as number of link records where the rhs sd_sid
                    // appears elsewhere on the left...

                      sql_string = @"SELECT count(*) 
                          FROM nk.temp_distinct_links t1
                          inner join nk.temp_distinct_links t2
                          on t1.preferred_source_id = t2.source_id
                          and t1.preferred_sd_sid = t2.sd_sid";

                    match_number = conn.ExecuteScalar<int>(sql_string);

                    if (match_number > 0)
                    {
                        // do the update

                        sql_string = @"UPDATE nk.temp_distinct_links t1
                          SET preferred_source_id = t2.preferred_source_id,
                          preferred_sd_sid = t2.preferred_sd_sid
                          FROM nk.temp_distinct_links t2
                          WHERE t1.preferred_source_id = t2.source_id
                          AND t1.preferred_sd_sid = t2.sd_sid";

                        conn.Execute(sql_string);
                    }
                }

                // but in some cases the telescoped link may have already been 
                // present - i.e. a study has two other reg identifiers
                // one of which will be the most preferred
                // Process above will result in duplicates in these cases
                // and these duplicates therefore need to be removed.

                sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links2;
                           CREATE TABLE nk.temp_distinct_links2 
                           as SELECT distinct * FROM nk.temp_distinct_links";

                conn.Execute(sql_string);

                sql_string = @"DROP TABLE IF EXISTS nk.temp_distinct_links;
                ALTER TABLE nk.temp_distinct_links2 RENAME TO temp_distinct_links;";

                conn.Execute(sql_string);
            }
        }

Final Steps

public void TransferNewLinksToDataTable()
        {
            // A distinct selection is required because the most recent
            // link cascade may have generated duplicates

            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"Insert into nk.study_study_links
                      (source_id, sd_sid, preferred_sd_sid, preferred_source_id)
                      select distinct source_id, sd_sid, preferred_sd_sid, preferred_source_id
                      from nk.temp_distinct_links";

                conn.Execute(sql_string);
            }
        }

        public void DropTempTables()
        {
            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"DROP TABLE IF EXISTS nk.temp_preferences;
                DROP TABLE IF EXISTS nk.temp_study_links_collector;
                DROP TABLE IF EXISTS nk.temp_distinct_links;";
                conn.Execute(sql_string);
            }
        }

        public void AddStudyStudyRelationshipRecords()
        {
            // Use the study_all_ids to insert the study Ids
            // for the linked sources / sd_sids, using 
            // nk.linked_study_groups as the source

            using (var conn = new NpgsqlConnection(connString))
            {
                string sql_string = @"Insert into st.study_relationships
                      (study_id, relationship_type_id, target_study_id)
                      select s1.study_id, g.relationship_id, s2.study_id
                      from nk.linked_study_groups g
                      inner join nk.all_ids_studies s1
                      on g.source_id = s1.source_id
                      and g.sd_sid = s1.sd_sid
                      inner join nk.all_ids_studies s2
                      on g.target_source_id = s2.source_id
                      and g.target_sd_sid = s2.sd_sid";

                conn.Execute(sql_string);
            }
        }

    }

@@ Line 170: / Line 170: @@
 </pre>
 </div>
-The sorted table will obviously have many duplicates - data that was entered originally as both source_id_1/sd_sid_1<==>source_id_2/sd_sid_2, ''and'' source_id_2/sd_sid_2<==>source_id_1/sd_sid_1, will now appear as two identical records: source_id/sd_sid<==>pref_source_id/pref_sd_sid. It is therefore necessary to generate a distinct version of the links data, as the basis for the next stage in processing. The nk.temp_distinct_links table will therefore contain all the definite links known between studies, with both the more 'preferred' and the less 'preferred' versions of each pair identified.
+The sorted table will obviously have many duplicates - data that was entered originally as both source_id_1/sd_sid_1 <=> source_id_2/sd_sid_2, ''and'' source_id_2/sd_sid_2 <=> source_id_1/sd_sid_1, will now appear as two identical records: source_id/sd_sid <=> pref_source_id/pref_sd_sid. It is therefore necessary to generate a distinct version of the links data, as the basis for the next stage in processing. The nk.temp_distinct_links table will therefore contain all the definite links known between studies, with both the more 'preferred' and the less 'preferred' versions of each pair identified.
 <div style="font-family: monospace; font-size: 13px" >
 <pre>

Difference between revisions of "Identifying Links between Studies"

Revision as of 17:29, 18 November 2020

Contents

Introduction

The Preferred Source concept

Initial Links Data Collection

Exclusion of One-to-Many links

Cascading Links

Final Steps

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

The Project

Metadata schemas

Data Structures

Data Extraction

The Portal

Help and F.A.Q.