Related Initiatives

Jump to navigation Jump to search

To do...

Rearrange and supplement...

In this section the issues involved in linking a clinical study to associated data objects ((e.g. study protocol, consent forms and participant Information sheets, individual participant datasets, data management plan (DMP),statistical analysis plan (SAP)) are discussed.


One of the major problems in developing the MDR is to provide a correct link between a clinical study and data objects belonging to this study.
In the PID forum a user story has been generated for this problem “As a clinical researcher and trialist I want to discover and locate relevant data objects belonging to a specific clinical study.”


For clinical studies usually a registration is needed, providing a registration number, which can be used as a study identifier (e.g. the NCT number in or the registry number in ISRCTN registry). Unfortunately, a clinical study may be registered in different registries or not be registered at all. By the WHO an Universal Clinical Trial Number (UTN) has been propagated to facilitate the unambiguous identification of clinical trials, which is, however, not widely used yet).
For a direct link between a study data object and the clinical study, the trial registration number may be used if possible. This may be the case for publications, where the trial registration number may be included in the metadata field “secondary source ID” of PubMed. On the other hand, a trial registration record may contain references to publications, e.g. via the PubMed identifier, and links to other sources, e.g. web page belonging to the study. In addition, services like crossref/crossmark may be used to provide a link between a registrated study and publications. The issue of linkage between clinical trial registries and publications as well as the linkage of publications from a single clinical trials is discussed in the literature [1] [2]

In summary, there are possibilities to directly link a clinical study with its associated data objects via identifiers. This, however, is not always unique and far from being complete. If a direct link is not possible, other techniques need to be applied. A typical example would be a repository containing individual participant data from clinical trials but no trial registration number of the related trial. Here, for example, techniques trying to map study and document titles via similarity measures may help.

Recently, a solution was proposed. To connect the use of a data set with its originators, both data sets and individual researchers must have PIDs (see Supplementary Information, Data tracking process). Ideally, each individual scientist would obtain a unique ORCID identification number and associate that with every data set they deposit. Repositories would issue PIDs for the data sets (such as digital object identifiers, or DOIs) and connect those to one or more ORCID identifiers. Journals would require the data set PIDs to be cited in every submitted manuscript (both primary and subsequent analyses). That system would allow data generators, academic leaders, funders, scientometricians and others to track the data in searchable databases. The processes for generating and recording each of these PIDs has been well defined, but they are not yet connected [3].

In 2017, a systematic review of the processes used to link clinical trial registrations to their published results has been published [4]. 43 studies examined links to published articles with a median proportion of registry entries for which published articles were found was 47%. There were 39 studies that considered cohorts of publications and identified associated registry entries in one or more of the WHO ICTRP clinical trial registries, the median proportion of registry entries that were identified from cohorts of published articles was 54%. In the review 3 search strategies are investigated: automatic (via unique registry identifier, e.g. NCT number), inferred (involving manual processes searching for matches across databases using charateristics of the trial) and inquired (any manual process where investigators or authors were contacted). The results of the review indicate that automatic links alone are a useful bot not suffcient process for linking trial registrations with associated publications. In addition it was found that automatic linkage has not increased over time.

A recent study [5] has investigated this problem by using state-of-the-art deep learning and Information retrieval techniques by automatically learning a deep Highway Network (DHN) that estimates the likelihood that a Medline article reports the results of a Trial (NCT Link). The experimental results indicate that NCT Link obtains 30%-58% improved performance over previously reported automatic systems, suggesting that NCT Link could become a valuable tool for linkage. The method is quite complicated and based on neural networks. Interestingly, the authors use a standardised representation of clinical trials, using eight key aspects of each clinical trial: (1) the set of investigators? associated with the trial, (2) the set of unique institutions associated with any investigators, (3) the NCT ID of the trial, (4) the set of interventions studied in the trial, (5) the set of conditions studied in the trial, (6) the set of keywords provided to the registry, (7) the set of Medical Subject Headings (MeSH) terms provided to the registry, and (8) the completion date of the trial. The study used four commonly used relevance models to act as similarity measures between an aspect of a study and an article.