Difference between revisions of "Identifying Links between Studies"
(→The Preferred Source concept) |
(→The Preferred Source concept) |
||
Line 14: | Line 14: | ||
* Only take the study single occurrence data items from a single data source | * Only take the study single occurrence data items from a single data source | ||
* Use the idea of 'preferred sources' to order the precedence of sources in a consistent fashion. | * Use the idea of 'preferred sources' to order the precedence of sources in a consistent fashion. | ||
− | All data sources are therefore given a number that indicate its place in this ordering - the exact number is not important, only the place it gives to the data source when they are listed, in ascending order, using this parameter. The number is stored within the sf.source_parameters table in the mon database, and is available as a property of the Source objects when they are retrieved from this table. '''During the aggregation process, the data sources are processed in order, most preferred first, down to the 'least preferred''''. In general, more preferred sources have richer or at least more consistent data than less preferred sources. The most preferred source is ClincalTrials.gov, followed (in broad terms) by registries that are extracted individually, then the various trial registries using the WHO dataset, and then data repositories (e.g. Yoda, BioLINCC), and finally object based data sources (PubMed). <br/> | + | All data sources are therefore given a number that indicate its place in this ordering - the exact number is not important, only the place it gives to the data source when they are listed, in ascending order, using this parameter. The number is stored within the sf.source_parameters table in the mon database, and is available as a property of the Source objects when they are retrieved from this table. '''''During the aggregation process, the data sources are processed in order, most preferred first, down to the 'least preferred''''''. In general, more preferred sources have richer or at least more consistent data than less preferred sources. The most preferred source is ClincalTrials.gov, followed (in broad terms) by registries that are extracted individually, then the various trial registries using the WHO dataset, and then data repositories (e.g. Yoda, BioLINCC), and finally object based data sources (PubMed). <br/> |
Although it cannot be guaranteed in every case, this usually means that the richer data is added to the system first, and that data coming from later, less preferred sources augments rather than replaces it. In general (and as described in more detail in '''[[Aggregating Data]]'''), during aggregation the core databases are recreated and then completely rebuilt by adding data from each source in turn. If a study does not already exist in the core system when it is 'presented' from a data source database, then it, its attributes, its associated data objects and their attributes are all added during the aggregation process. But if a study is added that already exists in the core system, because it has been added from a 'more preferred' source earlier in the process: | Although it cannot be guaranteed in every case, this usually means that the richer data is added to the system first, and that data coming from later, less preferred sources augments rather than replaces it. In general (and as described in more detail in '''[[Aggregating Data]]'''), during aggregation the core databases are recreated and then completely rebuilt by adding data from each source in turn. If a study does not already exist in the core system when it is 'presented' from a data source database, then it, its attributes, its associated data objects and their attributes are all added during the aggregation process. But if a study is added that already exists in the core system, because it has been added from a 'more preferred' source earlier in the process: | ||
* The single-occurrence study details are ignored | * The single-occurrence study details are ignored |
Revision as of 15:47, 18 November 2020
Contents
Introduction
One of the issues that has to be tackled during aggregation of data from different sources is the fact that the same study can be found, and in tens of thousands of cases is found, in more than one study based source, and that it will have a different persistent identifier in each source. Partly this is because studies can be registered in more than one trial registry, especially when local regulations mandate a registration for any study carried out within a particular country or region. This is especially the case with the EU, which insists all trials involving medicinal products must be registered in the EUCTR. About a third of these studies, however, are also registered in other registries, especially Clinicaltrials.gov. In addition, within a data repository, studies will usually be referenced by a local id rather than a pre-existing registry id.
Study titles cannot be relied upon to identify the same study in different source locations. A study title is often expressed slightly differently in different contexts, and cannot in any case be relied upon to be unique (even within the same source). It may be that further research will indicate how titles could be reframed (e.g. to a smaller number of keywords, expressed in a fixed order) to allow duplicate entries to be discovered using text, but for the moment the only easy way of doing this is by using the 'other identifiers' material found in the source data. Almost all sources contain this material, which usually include any other trial registry ids, (i.e. other than that used in the source registry entry), as well as ids assigned by the sponsor, funder or, sometimes, a regulatory authority.
These 'other registry ids' can be used to build up a table of study-study links which can then be used during the aggregation process to identify when duplicate studies are being added. In fact the creation of this table is always the first stage of any aggregation. This page describes this process in detail.
At the moment, with one exception, the requirement for identifying duplications only extends to studies - not data objects. The exception is provided by PubMed citations for journal papers, which can also be found multiple times in the source data, both within a single source and across multiple sources. The particular complications in processing PubMed data are described in Processing PubMed Data. The assumption for now is that other data objects are not duplicated across data sources, although this may need to be considered in the future. The rules for adding data objects are described more fully in Aggregating Data.
The Preferred Source concept
If details about a study and its attributes can be found in more than one data source, the obvious question is how should this data be merged in the final aggregated MDR database? In fact there are several aspects to this question:
- How should 'single-occurrence' details about the study be merged? (i.e. the data points that appear in the study record itself, such as study display title, study type and status, enrolment target, min and max ages etc.)
- How should study attributes (identifiers, titles, contributors, topics etc.) be merged?
- How should data objects in different sources (trial registry entries, data and document references) be merged?
- How should data object attributes be merged?
The first question is probably the most difficult. If the system allowed these basic study parameters to be merged from a variety of sources the problem of 'precedence' immediately arises: how could the system 'know' which source to use for each parameter, if they were available in both? For instance how would the system select the enrolment target from one repository rather than another? If the data is edited in one source but not another, should the most recent data always take precedence over the old? (it may be less complete than what is already there).
Because of these issues it was decided to
- Only take the study single occurrence data items from a single data source
- Use the idea of 'preferred sources' to order the precedence of sources in a consistent fashion.
All data sources are therefore given a number that indicate its place in this ordering - the exact number is not important, only the place it gives to the data source when they are listed, in ascending order, using this parameter. The number is stored within the sf.source_parameters table in the mon database, and is available as a property of the Source objects when they are retrieved from this table. During the aggregation process, the data sources are processed in order, most preferred first, down to the 'least preferred'. In general, more preferred sources have richer or at least more consistent data than less preferred sources. The most preferred source is ClincalTrials.gov, followed (in broad terms) by registries that are extracted individually, then the various trial registries using the WHO dataset, and then data repositories (e.g. Yoda, BioLINCC), and finally object based data sources (PubMed).
Although it cannot be guaranteed in every case, this usually means that the richer data is added to the system first, and that data coming from later, less preferred sources augments rather than replaces it. In general (and as described in more detail in Aggregating Data), during aggregation the core databases are recreated and then completely rebuilt by adding data from each source in turn. If a study does not already exist in the core system when it is 'presented' from a data source database, then it, its attributes, its associated data objects and their attributes are all added during the aggregation process. But if a study is added that already exists in the core system, because it has been added from a 'more preferred' source earlier in the process:
- The single-occurrence study details are ignored
- Study attributes are only added if they are definitely different from any that have already been added
- Data objects are added (unless, very rarely, they can be seen to be already there)
- Data object attributes are added if the data object records themselves are.