Resolved Issues
This page reords the resolution of issues raised previously.
Contents
- 1 #1: Person email attribute
- 2 #2: Date formats, partial and string dates
- 3 #3: Multiple languages of Data Objects
- 4 #4: Better Person data.
- 5 #5: Distinguish contributor type
- 6 #6: Change Title Storage
- 7 #7: Add Study Relationships
- 8 #8: Simplifying the related objects / studies
- 9 #9: Remove the use of IT key words
- 10 #10: Simplify a few element names
- 11 #11: Making ‘organisation’ consistent
- 12 #12: ClinicalTrials.gov data source
- 13 #21: Study Data Sharing Statement attribute
- 14 #25: Remove empty pages?
- 15 #27: Study Data Brief Description attribute
#1: Person email attribute
11/10/2019: We should consider removal of the person's email attribute from the display metadata schema. The attribute was added because it is in the DataCite schema. Though email addresses should only be added if they are publicly available (e.g. as listed for a first author) their display in a different context may generate privacy concerns, and expose us to allegations of data misuse under GDPR. I think we should still collect and store such data internally, however, because email addresses are globally unique and can therefore be useful indicators of a person's identity, even when their name is expressed in different ways in different systems. (SC)
09/11/2019: Agreed, email atttribute lost from metadata schema v3
#2: Date formats, partial and string dates
13/10/2019: The current format of dates in the schema is as a standard full date representation, e.g. "2019-10-13", the normal way of representing full dates in JSON and one which translates easily to and from database date fields. This format is sometimes used in source data, but in several cases dates are provided as separate year, month and day fields, though the format of these may also vary (2 digit or 4 digit years, numeric or string months, etc.). The problem is that dates may be partial - usually missing the day - or strings listing time periods such as seasons (e.g. "Summer 2008"). Dates of publication, in particular, often use these non-standard date formats.
One way around this would be to have an additional "date as string" representation. This would be applicable to all dates in the system, and allow all dates to be displayed accurately. The PubMed format (yyyy MMM dd, e.g. "2017 Sep 23") seems to be a reasonable starting point for this, with variations (e.g. "2016 Dec", or "Spring, Summer 2015"), only when the source data was a partial or in a non-standard date format.
The problem is that dates can also be used to filter records and we should support that filtering, (e.g. "records published after March 2015", "records before 2018"). Most systems can filter on full dates, because they are stored internally as numbers, but have problems with partial dates and other representations that are strings. We therefore need a way of handling dates that can handle both non standard formats and filtering. The proposals here are:
- Include a 'date as string' representation for all dates in the system, to ensure that every date in the source data can be displayed. This field would be the normal source of date data in the interface.
- Allow filtering by year and month/year, but not by individual dates or seasons.
- Turn both of the current start and end date fields into separate year / month / day triplets, each of three seperate integers. Years should always be 4 digit.
- For distinct dates the start and end date fields should be made the same. For dates that are given as a range the start and end dates should represent the inclusive limits of the range (which should normally be the case in the source data as well).
- Both start and end date fields should therefore be populated for all dates, though partial dates may not populate the full triplet of fields in each case.
- Rules should be develop for translating dates given as seasons or other periods into start and end years and months, e.g. 'Summer' is June - August inclusive, 'Winter' is December - February inclusive. Complete automation may not be possible immediately, but a collection of such rules can be developed over time.
- Filtering before a certain month / year would make use of the start date year and month fields, filtering after a month/year would compare against the end date year and month data. This ensures objects dated with a date range are included in the appropriate filter.
(SC)
03/11/2019: When a date is provided simply for information and is not likely to be used for filtering it can be stored and displayed simply as a date. But it should be expressed to the user in a single consistent string format.
The PubMed format (yyyy MMM dd) should be used, which means that some string dates (e.g. presented as dd MMMM yyyy) will need to be converted to and stored as a real date and then converted back to the standard string format when used on screen.
An example is the 'date url last accessed', which is a useful date to have available to a user but is unlikely ever to be required in a filter operation.
If such 'information-only' dates are often or always partial (e.g. the 'last verified' date for a trial registry, which is usually month year) they can simply be retained in string format.
(SC)
09/11/2019: Agreed, dates for objects changed to scheme proposed above in metadata schema v3
#3: Multiple languages of Data Objects
13/10/2019: Most data objects are in one language, but it appears that some can use multiple languages – for instance a small proportion of articles in PubMed are listed as being in 2 or 3 languages. In general, there is nothing to stop a data object being in more than one language. The language code element for the Data Object should therefore really be an array of language codes, to match a database structure where ‘object_languages’ will need to be factored out as a separate table. (SC)
09/11/2019: Agreed, language codes mafe an array in object schema within metadata schema v3
#4: Better Person data.
13/10/2019: The current list of fields for a person’s data, in the context of object contributors is neither very clear nor comprehensive. Unfortunately this stems from problems with the original ECRIN metadata definition. The proposal is to:
- Change first_name to given_name (more accurate and in line with DataCite, as well as the original ECRIN metadata specification),
- Change last_name to family_name (more accurate and in line with DataCite and the metadata specification),
- Remove (or deprecate the use of) the email field from the JSON object, (see above).
- Add an identifier and identifier type fields, mostly for ORCID identifiers, though other identifier schemes may be used in some cases. This would match DataCite and should have been included from the beginning.
- Retain full_name and affiliation as they are. In theory affiliation should include the option of using an organisational identifier and scheme ID, but use of this seems very limited at the moment, so may not required.
These changes should have occurred before the current structure was agreed, but were missed at the time. They would make the person data match the DataCite specification much more closely. (SC)
09/11/2019: Agreed, person data structure changed as proposed within metadata schema v3
#5: Distinguish contributor type
13/10/2019: The original metadata specification included a field in the object contributor structure that indicated if the contributor was an individual or an organisation / group. This seems to have become lost. It would be useful to have this explicit indicator, either as a boolean (e.g. is_individual) or a string or type (e.g. (“individual”| “research group” | “organisation”). The latter is probably more flexible. (SC)
09/11/2019: Agreed, is_individual boolean field incorporated within metadata schema v3
#6: Change Title Storage
13/10/2019: Titles, of studies and / or data objects, can come in a variety of forms: a 'full' scientific title, a public title, an acronym, a translated title, etc. One of these - the original full scientific title in the case of studies - needs to be indicated as the default title for display purposes. In the case of data objects, a full title may need to be constructed from the data object type and the study name, so that a protocol becomes <study name> protocol, though in many display contexts the study name is redundant and will be omitted. Information about a title includes its type, its language, whether or not it has been translated, whether or not it contains embedded html (titles may include, for instance, italic text, super and subscripts, that need to be recognised in a display context for the original meaning to be retained), any associated comment about the title - e.g. about its production, present in some source data - and also a simplified version that can be used for comparison purposes, e.g. to see if two study references are the same.
There needs to be a consistent mechanism for constructing this simplified version - e.g. removal of common words, removal or replacement of punctuation, replacement of accented characters, switching to uniform lower case, ordering of remaining words. Putting titles through this process allows common variations in presentation to be eliminated before any comparison takes place.
These various attributes neeed to be available to all titles, including the default one. It is therefore important to store all titles together, rather than having the default stored within the main study or data object record. This means that the title record also needs an 'is_default' field, so that the default title can be easily identified and placed in the JSON string as the study or object title. This means that although titles need to be stored differently in the DB, i.e. all together and with the additional fields listed, the impact on the JSON structure is slight. One possible implication, however, is that any comment may need to be added to the 'other titles' in the JSON file, in an additonal field.
This change affects both study and data object titles.(SC)
09/11/2019: Agreed, title storage nd metadata structure changed as proposed in metadata schema v3
#7: Add Study Relationships
13/10/2019: Study inter-relationships data should be introduced into the Study JSON file (e.g. "is a pilot study for...", "is a later phase continuation of ...", "uses a subset of the population of ..."). The data may be sparse for the moment but this is still potentially useful information that should be exported to any MDR. (SC)
09/11/2019: Agreed, study relationships included in metadata schema v3
13/10/2019: At the moment these are provided as an array of an object with a single integer field (Id). It would be simpler to make this element just an array of integers, the accession numbers of either the related studies or the related objects.
Conversely, if it was felt useful to explicitly identify the nature of the numbers, it would be better to label the integer type in the element as “study_id” or “data_object_id”, as appropriate. (SC)
09/11/2019: Agreed, metadata schema v3 incorporates these changes
#9: Remove the use of IT key words
13/10/2019: With the benefit of hindsight, it appears that we have used too many IT system keywords for element names. Thus we have names such as ‘class’, ‘type’, ‘value’, or ‘end’, which are also used in other systems as keywords or reserved words, and which therefore might cause errors in those systems. The only one which is an immediate problem is ‘class’, which cannot be transformed to a property of that name in C# (and I suspect most other languages) because ‘class’ is a reserved word. The suggested changes are:
- In study_identifiers and data_object_other_identifiers: “value” to “identifier_value”, “type” to “identifier_type”, “date” to “identifier_date”.
- In study_topics and object_topics: “value” to “topic_value” (or “topic_name”)
- In data object, “class” to “object_class”, “type” to “object_type”
- In data_object_dates, “start” to “start_date”, “end” to “end_date”
- In data_object_instances, “size” to “resource_size”
(SC)
09/11/2019: Agreed, changes made as proposed in metadata schema v3
#10: Simplify a few element names
13/10/2019: Some of the data object names would be easier to use if they were shortened. This would be a minor but still useful change. Specifically:
- “data_object_other_identifiers” to “other_identifiers”
- “data_object_other_titles” to “other_titles”
- “data_object_contributors” to “object_contributors”
- “data_object_dates” to “object_dates”
- “data_object_descriptions” to “object_descriptions”
- “data_object_instances” to “object_instances”
- “data_object_rights” to “object_rights”
(SC)
09/11/2019: Agreed, changes made as proposed in metadata schema v3
#11: Making ‘organisation’ consistent
13/10/2019: A minor point, but for the most part organisation is spelt using the UK English form (which is also the same as the French and German). In the Study JSON definition, however, within the other_identifiers element, the word is still listed using the US spelling. It would be more consistent, and more importantly would avoid future errors, if the spelling was made consistently ‘organisation’ everywhere. (SC)
09/11/2019: Agreed, 'organization' removed from metadata schema v3 - changed to identifier_org or organisation
#12: ClinicalTrials.gov data source
13/10/2019: The initial download of the ClinicalTrials.gov (CTG) data used a direct download of a Postgres database created within the Clinical Trials Transformation Initiative (CTTI)'s Database for Aggregate Analysis of ClinicalTrials.gov (AACT, see https://aact.ctti-clinicaltrials.org/). This was a very useful and very quick way of downloading the CTG data, already formmated as a relational database. In the longer term, however, this may not be optimal, and downloading XML files, and in particular identifying new or revised XML files for download, may be much more efficient. In addition using the AACT download means we are dependent on the design decisions taken in building this database, which might not always be the best match for our purposes in constructing the MDR. We therefore need to explore this alternative method of data retrieval from CTG, as well as working out a detailed data extraction process from the source XML to the MDR. This should be a priority amongst the data extraction tasks going forward.
(SC)
09/11/2019: Agreed, CTG data now downloaded via XMl files rather than the AACT database
#21: Study Data Sharing Statement attribute
03/11/2019: Trial registries are now asking study managers to indicate their future plans for possible IPD sharing, with such information usuallly presented as a series of semi-structured text statements. This is important information for the MDR to display. It does not usually include references to specific data objects or detailed access instructions, offering instead a more general statement of policy and future intentions. It is proposed that an additional attribute be added to the study object to capture this information, as a single text field, so that it can be displayed (along with the available data object information) when a study is located in the MDR.
(SC)
09/11/2019: Agreed, study_data_sharing_statement included in metadata schema v3
#25: Remove empty pages?
13/10/2019: At the moment several of the pages in the wiki ar empty or almost empty, e.g. 'Versioning', 'Requirements', 'Platform', 'Search engine', 'Meetings', 'TCs'.
I would suggest that such pages are removed, to be re-inserted if and when they have some content.
Alternatively, such pages should make their purpose clear, so that everyone knows what content should go there.
For instance, the purpose of the 'Versioning' page is unclear. Versioning of what? The database? The metadata schema? The mapping tool?
I would suggest any versioning descriptions should be part of the pages connected with each system component, as required.
There is a similar problem with 'Requirements'. For whom? End users? INFN? ECRIN? Again, descriptions of requirements should be attached to pages dealing with the relevant aspects or components of the system, although a summary page could then point to those more detailed pages.
I would suggest that we go through the Wiki and remove the empty or near empty pages, or make it clear how they are intended to be used. Having multiple empty pages is not a good look! (SC)
09/11/2019: Agreed, most empty pages removed and wiki contents reviewed and being supplemented
#27: Study Data Brief Description attribute
03/11/2019: It makes it much easier to assess the relevance of any found study to the current review or search task if a brief description is included. Most trial registries include this field.
(SC)
09/11/2019: Agreed, brief_description included in metadata schema v3