Identifying Links between Studies
Contents
Introduction
One of the issues that has to be tackled during aggregation of data from different sources is the fact that the same study can be found, and in tens of thousands of cases is found, in more than one study based source, and that it will have a different persistent identifier in each source. Partly this is because studies can be registered in more than one trial registry, especially when local regulations mandate a registration for any study carried out within a particular country or region. This is especially the case with the EU, which insists all trials involving medicinal products must be registered in the EUCTR. About a third of these studies, however, are also registered in other registries, especially Clinicaltrials.gov. In addition, within a data repository, studies will usually be referenced by a local id rather than a pre-existing registry id.
Study titles cannot be relied upon to identify the same study in different source locations. A study title is often expressed slightly differently in different contexts, and cannot in any case be relied upon to be unique (even within the same source). It may be that further research will indicate how titles could be reframed (e.g. to a smaller number of keywords, expressed in a fixed order) to allow duplicate entries to be discovered using text, but for the moment the only easy way of doing this is by using the 'other identifiers' material found in the source data. Almost all sources contain this material, which usually include any other trial registry ids, (i.e. other than that used in the source registry entry), as well as ids assigned by the sponsor, funder or, sometimes, a regulatory authority.
These 'other registry ids' can be used to build up a table of study-study links which can then be used during the aggregation process to identify when duplicate studies are being added. In fact the creation of this table is always the first stage of any aggregation. This page describes this process in detail.
At the moment, with one exception, the requirement for identifying duplications only extends to studies - not data objects. The exception is provided by PubMed citations for journal papers, which can also be found multiple times in the source data, both within a single source and across multiple sources. The particular complications in processing PubMed data are described in Processing PubMed Data. The assumption for now is that other data objects are not duplicated across data sources, although this may need to be considered in the future. The rules for adding data objects are described more fully in Aggregating Data.