Issues
This page is designed to describe issues encountered during the project's development and maintenance, for review within meetings and TCs.
It is divided into various sections according to issue type.
Note that resolved isues are described on Resolved Issues.
Important: Adding and editing comments
1) Please prefix all new topics or issues with a title header and label it as 'H5', (this places 5 equals signs each side of the header in the source), following it with a line break ("<br/>"). This not only marks the start of each issue clearly, it also ensures that each is included in the Contents list at the top of the page.
2) Please prefix all proposals or comments with a date, and end each with your initials in brackets. For example:
11/10/2019: We should consider removal of the person's email attribute from the display metadata schema. The attribute ... ... in different systems. (SC)
3) Please also remember to add two line breaks ("<br/><br/>") after each entered comment, so that they are then separated by a blank line.
Contents
The ECRIN Metadata scheme
This section is for proposals for changing the ECRIN metadata structure (and asssociated JSON file structures), or descriptions of possible problems with the current metadata structure.
#28: De-emphasise role of the doi
11/11/2019:
(SC) The current policy about using the doi as the ‘default identifier’ may not be sustainable. It is clear that only a relatively small proportion of the resources we list will have a doi. Perhaps we should just make it one identifier amongst many, though ensure we display it where it exists. It would make the database and the extractions a little simpler, even if it does move us further away from DataCite.
Individual Data Sources
Issues related to specific data sources, and their data filtering, extraction, linking, mapping etc. should be described here.
#13: BioLINCC data source
13/10/2019: In the past data has been available from BioLINCC in thje form of Excel spreadsheets that have been sent by BioLINCC, and from direct web scraping frok the BioLINCC web site. Scraping appears to give a richer dataset, and in particular a longer list of related data objects for each study. I would therefore suggest that web scraping be thje normal data solurce. It is not clear (to me) if BioLINCC records have been added to the data sent to OneData, and if so which data source was used. It would be useful to clarify this and if necessary resend. (SC)
#14: Checking individual data source extraction quality
13/10/2019: It is important that data extractions and processing are checked for quality, and we need to build a mechanism that can provide us with confidence that the extractions are correct and complete. The proposal is that - for the time being - two alternative methods of data extraction are used on each source -
- one using Microsoft technology, with data extraction routines written as C# console apps, carried out by SC, and
- the other using open source technology, with Python scripts running on an Apple machine, carried out by SG.
Both sets of scripts should be banked on the same GitHub repo.
To be able to compare the results of extractions, however, it is important to ensure that the various processing decisions are the same in both extractions, for instance:
- that the same data sources are used
- that the same decisions are taken about which fields are extracted
- of the fields extracted, the same decisions are made about which are mapped, and to which MDR data points.
- the same category matching is used
- the same contextual data is available
Intermediate database structures need not necessarily be exactly the same, though one would expect them to be very similar.
Thus, to allow some basic QA by comparing two different and independent extraction mechanisms, there has to be some initial work to ensure methodologiues are the same. I am proposing that those methodologies are based on current methods and structures, updated to reflect the proposals on this page. (SC)
Contextual Data
For problems and proposals related to the collection, edits, and use of contextual data: in particular the system's look up (category) tables, and data relating to people, organisations, languages and geographical entities.
#15: Approaches to generating category data
13/10/2019: We need to be clear about and agree a consistent methodology for creating and extending the categories used within the system. At present there are 23 lookup tables with categories (see Controlled_Terminology), and their contents will increase as new types of data are encountered. The proposal here is trhat these categories need active management. It is not sufficient for the extraction system simply to identify the distinct types as found in the source data and add them to the relevant lookup table. If we do that the lookup tables risk being filled with ambiguous and overlapping terms, and searching using them will become source system specific. Instead it is proposed that when categorised questions are mapped from the source to the MDR system:
- The meaning of the source data categories is first clearly understood. This may involve detailed study of the source's site documentation, or even communication with the site.
- The categories used are matched to existing categories, even though they may be named differently, wherever possible,
- Agreed new categories are added with short descriptions, after discussion has confirmed the need for them.
(SC)
#16: Approaches to using organisation Data
13/10/2019: We need to develop and then use a list of organisations, that allows us to identify each organisation cited (e.g. as a sponsor) despite it appearing in the source data under a variety of different names, spellings and / or initials. This is a non-trivial task and one that is likely to take some time, but without such a list it will not be possible to easily filter a search on organisation, e.g. within a particular contribution type. It also means that however presented in the source data, within the MDR the organisation has a consistent name. Some key organisation types include universities, university hospitals, trials units, pharmaceutical companies, government and supra-government agencies and publishers. Various lists of such organisations exist on the web and can be - gradually - assimilated into an organisation master list.
It will be necessary, however, to consider machine based ways of matching names found in source data to the entities in this list - there are far too many for this to be done manually. The use of organisational identifiers is poorly developed, so name matching will be the main mechanism for identifying orgnisations. This will require a process for simplifying names to more 'standard forms', so that the number of alternatives required to be stored can be reduced, and giving a greater chance that a novel name will matching one already in the system. This process probably needs to involve
- Removal of common words, especially artticles and prepositions, (e.g. the, of, and, le, la, les, der, de, das, etc.)
- Removal or replacement of punctuation (e.g. hyphens to spaces, removal of full stops in abbreviations)
- Replacement of accented characters with the plain letter or a combination commonly used to represent the letter, (e.g. à with a, ü with ue)
- application of lower case to all words
- translation of common words to their English equivalent (e.g. Universidad, Université, Universität, etc. to university)
- ordering of the remaining words in alphabetical order.
In such a scheme both 'The University of Düsseldorf' and 'Universität Düsseldorf' become 'duesseldorf university', although additional names are still required for acronyms such as HHU ('hhu'), or Heinrich-Heine Universität Düsseldorf (which is matched to 'duesseldorf heine heinrich university'. These simplified names are generated only for matching purposes - they are not diplayed. Instead one of the organisation names is designated the default - normally the name most commonly used in the language of the institution.
It will be necessary to develop this list and deploy it periodically to a store of contextual data, and then incorporate it within the exported data. This may lead to the need to re-export revised data. A core set of the key organisation types listed above, likely to be several tens of thousands of records, is required in the next few months.
(SC)
#17: Approaches to using people Data
13/10/2019: The volume of names in the source data, as authors or contributors to data object production, is potentially enormous, and likely to include several million records. Attempting to identify duplicates in these records so that individuals can be identified unambiguously will be very difficult in the general case, not only because names are given at different levels of detail, but also because one person may change their name, and because several people may have the same name. In general, therefore, names will have to be transferred as found in the source data.
The exception relates to ORCID identifiers, which are increasingly being used to identify authors, at least in some source systems. It may therefore be worth identifying a subset of names for whom the system has matching ORCID Ids, and include those in the system as a core set of 'known people'. Even in these cases, however, it is difficult to identify matches with these people using names alone - the ORCID id will be required for unambiguous idebntification. It may be that algortithms can be developed using affiliation data, where it exists, along with the name data, to help identify at least a subset of the people in the system against the known 'ORCID people'. This requires further exploration.(SC)
#18: Approaches to using language Data
13/10/2019: Language codes in source data come in a variety of systems, though most are 2 or 3 character ISO codes. The MDR is designed to use - consistently - 2 character codes. It will therefore be important to ensure that mapping is done before the data is transferred to the MDR system. (SC)
ECRIN Methods and Processes
For any issues concerned with general ECRIN processes for data processing, mapping, linking etc., and the longer term maintenance of the system and its data.
#19: Using JSON Schema for describing schemas
13/10/2019: One of the problems we have had is that it has been difficult to precisely describe the structure of the JSON files required, but I did come across JSON schema (see http://json-schema.org/), described as “a vocabulary that allows you to annotate and validate JSON documents.”
This seems very useful. It is designed to work like XSD does for XML. In the same way as an XSD file defines an XML schema, is written in XML itself and can be used to validate XML files against a specific schema, so JSON schema is designed to define a JSON file structure, is written in JSON itself and can be used to validate a JSON file against a specific structure.
The web site has formal definitions, at https://json-schema.org/latest/json-schema-core.html and http://json-schema.org/latest/json-schema-validation.html, but probably more useful are the examples listed at http://json-schema.org/learn/
and the guide / tutorial at https://json-schema.org/understanding-json-schema/index.html.
I used it to create schemas for the Study and Data Object JSON files, as described in JSON_Schemas. In these schemas...
- Element types are defined as one of the basic JSON types (string, boolean, etc. but also including integer) or an object (i.e. a composite type) or an array, in which case ‘items’ defines the array members.
- Object types have ‘properties’ which are the elements they contain (which may be a nested object, an array or a basic type).
- A description can be added (and normally is) to each element.
- The elements that are mandatory for any type can be listed (‘required’)
- It is possible to define (in ‘definitions’) elements that are used multiple times, and then refer to them by placing a ‘$ref’ element where the definition is required.
JSON schema allows a much more precise and rigorous description of JSON files, and in the process helps to identify errors and inconsistencies in those files. I propose that we use this system in the future for descriptions of the JSON structures used within the system. (SC)
#20: Reviewing the documentation organisation for individual sources
13/10/2019: Each data source has a main page describing it under the 'Individual data sources' heading. Although there are considerable differences between data sources, there should be a 'core' set of headings, each with some text and usually with linked pages, dealing with the main aspects of the data source and how its data is processed. Some data sources may require additional headings, but having a core set guarantees that the main aspects of the processing have been covered and are described.
A set of standard headings has already been applied, but I do not think this is the clearest way of organising the material. In particular...
- The current References and External Links headings are never used, and should be removed. References and External Links should be integrated with the text, so that users can more easily follow them when required.
- The Description section should be, as now, the initial section. It should include a link to the source's main web page, a description of the material included, including the types of data objects that are described, and some indication of the volume of material of interest to the MDR. This section should also describe the presence of APIs or other data export services, if any.
- A short section on Terms and Conditions would allow us to clarify the legal position with regard to using the data, so far as we understood it. For sites that made an API or equivalent data source public it would be asssumed that there is permission to use the data, though in some cases not all of the data may be covered. For sites where there is no public API or available files, and web scraping is required, it may be that explicit permission needs to be sought and obtained.
- The sections on the Study Id, Metadata and Original metadata structure should all be combined into a description of the Source Metadata (or at least the metadata used in the data extraction process, as some sources have different options available). In most cases this will require a link to a page describing the metadata in detail, element by element. The paragraph here gives an overview and highlights particular features, including the Ids for studies and / or data objects used within the source system. The detailed material on the linked page should indicate, for each element, its meaning, whether or not it is transferred to the MDR, and if so to which element. Particular categorisations used wihtin the source material and their implications for the data extraction process should also be identified.
- A section on Data extraction should describe the 'data pipeline' between data source and its eventual inclusion in the MDR. This will probably need to include (though the emphasis will vary between different sources):
- Filtering mechanisms, for systems where only a subset of the data is relevant to the MDR.
- Update mechanisms, i.e. how new and revised data is identified.
- Linkage mechanisms, between studies and data objects, if the extracted data is not already linked.
- Extraction and Mapping processes, describing how extracted data is transformed into MDR data points. The extraction may take place in several stages and the description should reflect this.
- Deduplication mechanisms, to identify data already in the system from other sources.
- The extraction scripts - very brief descriptions, usually referring to material in GitHub.
- The databases used - very brief descriptions, usually referring to database scripts in GitHub.
- Filtering mechanisms, for systems where only a subset of the data is relevant to the MDR.
In most cases each paragraph will make reference to a linked page with the details of each of these aspects of the data extraction.
- A Metadata Mapping section should then be used to summarise the extraction process described above, i.e. the matching of original data source elements with MDR metadata elements. This should link to a mapping table.
N.B. There is no expectation that all this documentation will be completed immediately for any one data source - it will need to be added as the processes themselves are developed.
Issues and proposals relating to the extraction should go into the relevant section in this Issues page.
External Systems
Problems and issues related to external partners (e.g. OneData, INFN) and the systems and support offered by them should be listed here.
#22: Clarification of the role of INFN and OneData
13/10/2019: The exact way in which OneData and INFN services and infrastructures interact and combine to generate the portal user interface is still not completely clear. As far as we undersatand it, INFN access the data via a OneData interface, even though the data is on the INFN infrastructure. INFN use Elastic Search to interface with the data, and creates a search interface using ElasticSearch. OneData have created a portal which then plugs into that search interface. It would be good to have confirmation of this and a clearer description of the responsibilities and the key personnel associated with each. This is work that is already under way, but the item is included here as a reminder and as a place to include the responses from INFN and / or OneData. (SC)
#23: Clarification of how categories (e.g. for search parameters) are created
13/10/2019: It is still not clear how categories, if and when used in the interface to add parameters to a user's search, are generated within the system. Does INFN and / or OneData require separately uploaded lists, (so ECRIN in effect defines the categories to be used), or will they simply scan the data and identify the distinct types themselves? It would be useful to be clear about this! (SC)
#24: Flexibility of partners to changes in metadata structure
13/10/2019: Changes to the metadata structure are inevitable, as we learn more about the source data (and indeed are an important part of the research project). The categories used within the system (the look up tables) will also increase in number as the different data sources come on line. We need to be clear, with INFN and / or OneData, what the impact of such changes would be, e.g. would it be necessary to re-index the data within the Elastic Search system, and what constraints exist on making changes.
The difficulty here is that if there is too much inflexibility on the part of OneData / INFN, so that we were stuck with the initial guesses for metadata, the interface being developed by them would be much less useful, and no longer a true reflection of the system. We should therefore work with them (as best we can) to ensure that their system reflects the changes in the files created by ECRIN.
It will also be important that all changes are agreed and incorporated into the third data package for the system, though this will inevitably mean a complete, rather than incremental, reload of the data.
(SC)
The Wiki and Repo
Issues related to the Wiki itself, or the organisation / use of the GitHub repository.
#26: Making file material more consistent
13/10/2019: At the moment some file based materials are available via file links (see for instance the 'JSON' page, that links to http://ecrin-mdr.online/index.php/File:Study_example.json). These links have the advantage of automatically including a file history table, but they are not very user friendly, it is not immediately clear where the file download link is, and they force the user into additional actions to obtain the data.
An alternative is simply to include the file material in the page. As that material is usually structured, e.g. a JSON file, it is necessary to enclose it in <pre> </pre> tags. It also helps to insert the whole file within a styling div: <div style="font-family: monospace; font-size: 13px" > ... </div>, as below...
<div style="font-family: monospace; font-size: 13px" > <pre> { "$schema": "http://json-schema.org/draft-07/schema#", "$id": "http://ecrin.org/json_schemas/mdrstudy/v2.json", "title": "XDC Study definition", "description": "ECRIN Metadata Repository for clinical research objects, Study JSON definition, version 2 February 2019", "type": "object", "required": ["id", "scientific_title"], "additionalProperties": false, "properties": { "id": { ... ... </pre> </div>
This approach has the advantage of simply and immediately presenting the material to the user - see for instance http://ecrin-mdr.online/index.php/Study_JSON_v2. Different versions have to be handled manually but that is usually a good thing, if slightly more work, as it allows the significant changes to be highlighted and not lost in a succession of minor changes (e.g. typo corrections).
The proposal is to replace links to file based material (other than references to external papers etc.) with direct inclusion of the material on the page.
(SC).