Missing PIDs and Hashing

From ECRIN-MDR Wiki
Jump to navigation Jump to search

The problems

There are two fundamental problems: the lack of simple persistent identifiers (PIDs) for most data objects, and the lack of any identifiers for both study and object attributes.
Most but not all data objects in the system have an associated URL, but that link cannot be guaranteed to be permanent. It is also true that most (but not all) journal articles have a Pubmed id and / or a doi, but journal articles are handled by the system differently from how most other data objects are processed. They are linked to studies after extraction, while most other objects are linked, if only implicitly, to studies in the sources (trial registries, data repositories) where they are found during extraction. Although the object type could be used - in most cases - to distinguish the objects themselves, and so match objects when importing and comparing data, the type cannot be absolutely guaranteed to be unique, for a particular study, and a composite (parent study + object type) identifier mechanism would be a relatively clumsy two-field mechanism for identifying objects. The need is for a relatively simple mechanism for generating data object identifiers that can be guaranteed to be unique and persistent, or at least as persistent as possible.
Study and object attributes - identifiers, titles, instances, dates, topics etc. - have no associated intrinsic identifiers, and here the problem is worse. The problem, especially when web scraping is used, is that there is no guarantee that attributes will be processed in the same order as before, or will be present in the same numbers, which means comparing one unordered set of records with another to identify the new, edited and (more rarely but not impossible) deleted items.

Imagine a study with 3 identifier attributes a, b and c. a is a registry identifier and b and c are sponsor ids. On the next download a funder identifier entry ‘d’ is found, which is the second in the list, while at the same time identifer c has been edited, to become 'e' – the list is now a, d, b and e. The two sets of study identifiers can be compared and a match can be found between the two versions of a and b. But d and e are new to the system and c can no longer be matched. One of d or e corresponds to the pre-edited version of c (see figure 1), but which one?

Figure 1: Data collection data flows
Figure 1: Two sets of attributes need to be compared, but without identifiers there is no way of matching records for comparison.
Records which are identical in each set can be matched, but for objects that are not matched
it is impossible, in general, to determine their origin or relationship with objects in the previous set.

A human could look at the data and, in most cases, seeing that c and e are ‘similar’ can deduce that c has been edited into e, and d is completely new. But a machine cannot do that without a complex scoring / weighting algorithm, that would need to take into account every field in the record, and even then could only make an estimate rather than being 100% certain (the more extensive the editing, the closer the record that results looks like a completely new addition). The result is that the system has to
a) Mark both d and e as new attributes and add them both to the system
b) Mark c as no longer relevant, and ‘retire’ it from the system.
In other words edits have to be handled explicitly as a combination of insert and delete. It is true that for some types of attribute sets (e.g. object dates), you could make assumptions about the data and say, for example, that a data object would only ever have one date of a certain type. But this mechanism is only possible for some types of attribute (for example studies may have multiple identifiers of the same type) and assumptions about uniqueness of type are dangerous to make with very large data sets, often with poorly managed data. The need is therefore to find an efficient method for identifying changes in sets of attributes, and for actioning the changes that are required.

Using hashes for Object PIDs

The object PID problem, for objects in study based data sources can be solved by hashing together, using the common MD5 hashing algorithm, a few key fields of the data object, with those fields representing a unique combination of values. The fields selected are

  • the source data identifier of the 'parent' study - usually a trial registry id. Because different registries use different formats and prefixes for their ids these can, at the moment, be guaranteed to be globally unique - though the requirement is only that they are unique within a single source.
  • the display title of the object. Very few data objects have explicit titles, and even when they do exist it is not possible to guarantee that they are unique. The approach to creating an object's display title is therefore to concatenate the study title (the short, public title) with the object's name or, more normally, its type, with ' :: ' used to separate the two components. Where more than one object of the same type exists, linked to the same study - which is relatively rare but which can occur, for example, if different versions of a protocol have been uploaded, or multiple datasets are made available, it is necessary to use some mechanism - often by using a file name - to ensure that the display name remains unique. The display titles are created as part of the initial data harvesting, so that process needs to include the necessary code to ensure title uniqueness. As an extra check, if data objects are added to an existing study from a different source, because they 'belong' to the same study listed in another registry or repository (with the exception of journal papers and registry entries) they are not added if they have the same display title as an existing object.

For example, during the harvesting of XML files from Clinicaltrials.gov, one possible linked data object is a registry based results summary. The system first creates the object's display title using the study title (title_base) and the object type, and then hashes this together with the study id (sid):

       object_display_title = title_base + " :: CTG Results entry";
       sd_oid = HashHelpers.CreateMD5(sid + object_display_title);

While the CreateMD5 function itself (a static helper function) is:

        public static string CreateMD5(string input)
        {
            // Use input string to calculate MD5 hash
            using (MD5 md5 = MD5.Create())
            {
                byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input);
                byte[] hashBytes = md5.ComputeHash(inputBytes);

                // return as base64 string
                // 16 bytes = (5*4) characters + XX==, 
                // 24 rather than 32 hex characters
                return Convert.ToBase64String(hashBytes);
            }
        }

Note that while the MD5 function itself returns 32 hex characters the functions shortens this a little to 24 characters (the last always being '==' becasue only 22 are actually required), as a base 64 string. The function therefore produces a relatively short single-field identifier for all study-linked data objects, which for all practical purposes is unique.

Using Hashes for comparison I

Hashes can also be used to efficently compare both the main entities (studies and data objects), and study and object attributes, between successive data imports. It is first necessary to be clear about what ‘comparing’ two records means, and how equality (or its converse, an edit) of records is defined. The system could compare every field in each record, but in fact only some fields are relevant to the record’s 'essential content' and need to be included in the comparison. Id fields, whether from the source data or generated as described above, a foreign id that link an attribute record to its parent entity, and audit fields that date the record, do not need to be compared, just the fields that make up the record's 'data payload' (see figure 2). Because the ad records only have the codes for categorised questions, and not the text decode (whereas the sd equivalent tables have both), only the code needs to be included for comparison purposes for these question types.

Data fields in record.png
Figure 2: Most records in a database have a core 'payload' of data fields, surrounded by identifiers and audit fields.
If an id is not available or not constant, then the payload fields need to be used to identify and match the record.

Rather than comparing each field in turn, however, a process that would be different for each record type, it is easier to first ‘roll up’ the relevant values into a single string and then compare only that field. If the values are equal the records are equal, in the sense that their content-containing fields are the same. If they are unequal there is a difference in one or more of those fields, and the new record needs to replace the old. The easiest way of ‘rolling up’ a set of values is to create a hash value from those values. Each table in the sd and ad schemas therefore includes a hash_record field. Again using the md5 algorithm, a string of 32 hex characters (=128 bits) is created whatever the length of the source string. The hash value is almost certain to be unique – there are 2^128 or 3.4 X 10^38 possible values. Using the text itself could result in very long strings that would take much longer to compare.

This process is most easily carried out in postgres itself, first using the json_build_array function to turn a subset of field values in a record into a json array, then casting that array into a string, and then hashing that string using the built in md5 function. For instance, to produce a hash for the study identifiers records and store it in a field called record_hash, the command below can be used:

        public void create_study_identifier_hashes()
        {
            string sql_string = @"Update sd.study_identifiers
              set record_hash = md5(json_build_array(identifier_value, identifier_type_id, identifier_org_id,
              identifier_org, identifier_date, identifier_link)::varchar)";

            h.ExecuteHashSQL(sql_string, "study_identifiers");
        }

This provides a very simple way of creating a summary record_hash field for every record in the sd tables, later to be transferred into the ad tables. The main difficulty is that hashing is computationally expensive and with very large tables the system sometimes 'times out'. Along with the need to provide some periodic feedback to the calling program and / or user, it is therefore useful to split up the hashing process. Hence the call to ExecuteHashSQL. This function - for very large tables - splits the records to be hashed up into batches of 500,000, and constructs the freedback message(s) accordingly. It does so by adding an additional where condition to the sql string passed into it. The function is shown below...

        public void ExecuteHashSQL(string sql_string, string table_name)
        {
            try
            {
                int rec_count = GetRecordCount(table_name);
                int rec_batch = 500000;
                // int rec_batch = 10000;  // for testing 
                if (rec_count > rec_batch)
                {
                    for (int r = 1; r <= rec_count; r += rec_batch)
                    {
                        string batch_sql_string = sql_string + " where id >= " + r.ToString() + " and id < " + (r + rec_batch).ToString();
                        ExecuteSQL(batch_sql_string);
                        string feedback = "Creating " + table_name + " hash codes, " + r.ToString() + " to ";
                        feedback += (r + rec_batch < rec_count) ? (r + rec_batch - 1).ToString() : rec_count.ToString();
                        StringHelpers.SendFeedback(feedback);
                    }
                }
                else
                {
                    ExecuteSQL(sql_string);
                    StringHelpers.SendFeedback("Creating " + table_name + " hash codes - as a single batch");
                }
            }
            catch (Exception e)
            {
                string res = e.Message;
                StringHelpers.SendError("In hash creation (" + table_name + "): " + res);
            }
        }

Using Hashes for comparison II

In addition, for a group of related records, the hash values can themselves be rolled up to form a hash that corresponds to the total content of that group. This allows a group of records to be easily compared. For example, to create a hash for the whole group of study identifiers listed for each study, the following code can be used (sd_sid is the source data study id, usually the registry id, the hash_type_id and hash type are 11 and 'identifiers' respectively, in this instance, and the table_name would be 'study_identifiers'):

       public void create_composite_study_hashes(int hash_type_id, string hash_type, string table_name)
        {
            string top_sql_string = @"Insert into sd.study_hashes 
                    (sd_sid, hash_type_id, hash_type, composite_hash)
                    select t.sd_sid, " + hash_type_id.ToString() + ", '" + hash_type + @"',  
                    md5(to_json(array_agg(t.record_hash ORDER BY t.record_hash))::varchar)
                    from sd." + table_name;

            h.CreateCompositeStudyHashes(top_sql_string, hash_type_id, hash_type);
        }

This function can be called for all study attribute types by changing the parameters, and a similar function exists for data object attribute tables. Again it is necessary to split up very large tables because the process is time consuming - the CreateCompositeStudyHashes function is shown below, and again uses the ids in the source to split up records into batches (this time of only 10,000) where necessary.

       public void CreateCompositeStudyHashes(string top_sql_string, int hash_type_id, string hash_type)
        {
            try
            {
                int rec_count = GetRecordCount("studies");
                int rec_batch = 10000;
                //int rec_batch = 1000;  // for testing 
                if (rec_count > rec_batch)
                {
                    for (int r = 1; r <= rec_count; r += rec_batch)
                    {
                        string where_sql_string = " where s.id >= " + r.ToString() + " and s.id < " + (r + rec_batch).ToString();

                        string batch_sql_string = top_sql_string + @" t 
                                 inner join sd.studies s 
                                 on s.sd_sid = t.sd_sid 
                                 " + where_sql_string + " group by t.sd_sid;";
                        ExecuteSQL(batch_sql_string);
                        string feedback = "Creating composite study hash codes (" + hash_type + "), " + r.ToString() + " to ";
                        feedback += (r + rec_batch < rec_count) ? (r + rec_batch - 1).ToString() : rec_count.ToString();
                        StringHelpers.SendFeedback(feedback);
                    }
                }
                else
                {
                    string sql_string = top_sql_string + @" t group by t.sd_sid;";
                    ExecuteSQL(sql_string);
                    StringHelpers.SendFeedback("Creating composite study hash codes (" + hash_type + ") as a single batch");
                }

            }
            catch (Exception e)
            {
                string res = e.Message;
                StringHelpers.SendError("In study composite hash creation: " + res);
            }
        }

The table that results is illustarated below, for the first few studies listed in the Iranian Trial Registry

Study hashes table.png
Figure 3: Initial records of the study_hashes table, for IRCT.


With this hash it is now possible to use a single comparison to see if any of the identifier data for a study has changed. The same is true of all other attributes, for studies and data objects. The process can be extended once more, to roll up all of the composite hashes for each study or data object, plus the record hash for the study or data object record itself, to produce a 'full_hash' for the whole entity. Comparing these full hashes allows an edit in any component of the study or data object data to be detected, because the full hash only remains the same if no changes at all have taken place in the 'essential data' carried by each record. If the full hashes are different, then the entity record hashes can be compared, and then the attribute composite hashes, to see where the editg(s) have occurred. This multiple hash roll-up process is shown in figure 4.

Hash roll ups3.png
Figure 4: The role of hashed values in the MDR system, for supporting detection of changes. Hashes are constructed as ids for data objects
– but hashes are also constructed for whole records and used to detect changes in those records, and groups of related records, and finally in the entity (study or data object) as a whole.


In summary, at the end of each data extraction for each source, the system

  • Creates a hash equivalent of the (non-id, non-audit) content of all study, study attribute, data object and data object attribute records, in all tables. In all cases this hash is stored in a field called ‘record_hash'.
  • Creates a hash for each group of multiple attributes, e.g. for each study – a hash representing all identifiers, a hash for all titles, a hash for all study features, etc. For each data object: a hash for all titles, a hash for all instances, a hash for all dates, etc.
  • The ‘composite hashes’ are stored in separate tables (one for study hashes, the other for data object hashes) that reference the parent study or data object, indicate the type of composite hash, and the hash value itself.
  • The composite hash values for each study are rolled up with the hash for the study record itself to form a composite hash for the study and all of its components, a ‘study full hash’. It is stored in the Studies table.
  • Similarly, the composite hash values for each data object are rolled up with the hash for the data object itself to form a composite hash for the data object and all of its components, a ‘data object full hash’. It is stored in the data objects table.

All of the hashes need to be transferred with the data records from the sd (session data) tables to the ad (accumulated data) tables whenever that transfer takes place, i.e. after the comparison process has identified the new or revised data. The details of the way in which these has values are compared and used in the import process is described in more detail on the Importing Data page