PubMed Data Structure

From ECRIN-MDR Wiki
Jump to navigation Jump to search

Introduction

The PubMed data is available for retrieval as XML, with entries for both articles and books available through the E-utilities search mechanism. This section describes the structure, and thus the data points available, in an Article record. At the moment only Article metadata is retrieved into the MDR.
An Article XML record has two elements: MedlineCitation and PubMedData. The MedLineCitation element is by far the larger and contains the key Article element.
The elements contained within each of the two top level elements, and the Article element, are shown in the table below. Elements that contain data that is extracted are in bold.
The symbols used are: ? = optional, 0 or 1 occurrence; * = may be multiple, 0, 1 or more occurrences, + = at least 1 but may be multiple occurrences, with no symbol indicating a single mandatory occurrence.

MedlineCitation
PMID
DateCreated?
DateRevised?
DateCompleted?
Article
Journal
ArticleTitle
Pagination / ELocationID*
Abstract?
AuthorList?
Language+
DataBankList?
GrantList?
PublicationTypeList
VernacularTitle?
ArticleDate*
MedlineJournalInfo
ChemicalList?
SupplMeshList?
CitationSubset*
CommentsCorrectionsList?
GeneSymbolList?
NumberOfReferences?
PersonalNameSubjectList?
OtherID*
OtherAbstract*
KeywordList*
CoiStatement?
SpaceFlightMission*
InvestigatorList?
GeneralNote*
PubmedData
History?
PublicationStatus
ArticleIdList
ObjectList?
ReferenceList*

Principle Elements within the PubMed structure

Data Elements

Each of the headings in the list is described below, with the emphasis on those that are actually used in the extraction. If an element is not used the reason for this is given. The text is a heavily abridged form of that available at https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html/. Elements are referred to within angle brackets, e.g. <PMID>, while attributes have a @ prefix.

Identifiers and Dates

<MedlineCitation> Attributes
The top level <MedlineCitation> element has five attributes: @Owner, @Status, @IndexingMethod, @VersionID, and @VersionDate.
@Owner references the organisation that creates the citation, in almost all cases the National Library of Medicine. It is not extracted.
@Status indicates the stage of a citation, and has seven possible values:

  • In-Data-Review
  • In-Process
  • MEDLINE
  • PubMed-not-MEDLINE
  • Publisher
  • OLDMEDLINE
  • Completed

In-Data-Review status means the record has been submitted by a publisher but has not yet been checked by NLM for completeness and accuracy.
Once that check is completed most records move to the In-Process stage, when individual citation data is checked, some funding data may be added, and MESH headings are added.
Once this is completed the record is elevated to MEDLINE status and the citation is complete. (The ‘completed’ status is an old status and has not been used since 2005).
Non life science articles are judged out of scope for MEDLINE and are therefore categorised as PubMed-not-MEDLINE.
A small percentage of older records have the status of OLDMEDLINE, if all the original MeSH Headings which reside in the <KeywordList> have not yet been mapped to current MeSH. The Publisher status is also used for some older records, if they do not fall into the scope of MEDLINE, but also may be used for new records submitted by a publisher, where errors prevent the normal workflow to In-Data-Review.
The @Status attribute is therefore an indicator of record status and therefore possible future revision, and should be extracted, even if it is not currently mapped to the MDR database.
The @IndexingMethod attribute was added in 2017, and refers to the method used to apply MeSH terms. It is not extracted.
@VersionID and @VersionDate were not used until February 2012. Only one journal, PLoS Currents, is currently using the versioning model. VersionID is also available more consistently from elsewhere in the record. These attributes are therefore not extracted, but a note can be generated when they exist, for possible future investigation.

<PMID>
This element contains the PubMed unique identifier, which is a 1 to 8-digit accession number with no leading zeros (though most records now have 8 digits). The element has one attribute, @Version, added with the 2011 DTD, though in almost all cases it is “1”.
Both the PMID value and the @Version attribute are extracted. Because the PMID value is used to identify the record in the first place it is already known to the system during the extraction process – extracting it does not therefore add any additional information.
Examples are:

     <PMID Version="1">10097079</PMID>
     <PMID Version="1">6012557</PMID>
     <PMID Version="2">20029614</PMID>


<DateCreated>
This was the date processing of the record began. The 2018 DTD removed the <DateCreated> element so this is no longer in the returned XML.

<DateCompleted>
<DateCompleted> is the date processing of the record ends; i.e., MeSH Headings have been added, quality assurance validations are completed, and the completed record subsequently is distributed to PubMed. In-Data-Review, In-Process and Publisher records lack <DateCompleted>. This element is extracted, as the date the citation (not the article!) was completed. The element has three numeric sub elements, as below, for the date’s year, month and day respectively. During extraction a date must therefore be constructed from these elements.

<DateCompleted>
    <Year>2002</Year>
    <Month>02</Month>
    <Day>07</Day>
</DateCompleted>


<DateRevised>
<DateRevised> is on all records. It identifies the date a change is made to a record, though there is no indication of what the change is on the record. It also requires extraction, as the date of the last citation revision, and has the same three element structure as <DateCompleted>.

The Article Element

<Article>
<Article> contains various elements describing the article cited; e.g., article title and author name(s). It has a single attribute, @PubModel, which is used to identify the medium/media in which the cited article is published. There are five possible values for @PubModel:

  • Print: the journal is published in print format only
  • Print-Electronic: the journal is published in both print and electronic format
  • Electronic: the journal is published in electronic format only
  • Electronic-Print: the journal is published first in electronic format followed by print (this value is currently used for just one journal, Nucleic Acids Research)
  • Electronic-eCollection: used for electronic-only journals that publish individual articles first and then later collect them into an “issue” date that is typically called an eCollection.

The @PubModel attribute is extracted, even though it is not directly mapped to a variable in the ECRIN dataset, because in combination with other data points it indicates the most appropriate choice of publication date in a citation.

<Article> / <Journal>
This element does not contain data itself, but it does hold various elements describing the journal cited; i.e., ISSN, Volume, Issue, and publication date.

<Article> / <Journal> / <ISSN>
<ISSN> (International Standard Serial Number) is an eight-character value that uniquely identifies the cited journal. An @IssnType attribute (value = 'Print' or 'Electronic') distinguishes ISSNs for print publication from those for electronic publications. All ISSNs present are extracted because they are used to try and identify the journal publisher..

<Article> / <Journal> / <JournalIssue>
This element contains information about the specific issue in which the article cited resides. It has a single attribute, @CitedMedium, which indicates whether a citation is processed/indexed at NLM from the online or the print version of the journal. The two valid attribute values are Internet and Print. The attribute is not extracted.

<Article> / <Journal> / <JournalIssue> / <Volume>
The volume number of the journal in which the article was published is recorded here, e.g. <Volume>7</Volume>, <Volume>5 Spec No</Volume>. This data is extracted because it forms part of the article’s ‘journal source string’, the part of a bibliographic citation that comes after the authors and title.

<Article> / <Journal> / <JournalIssue> / <Issue>
<Issue> identifies the issue, part or supplement of the journal in which the article was published, e.g. <Issue>Pt 1</Issue>, <Issue>3 Suppl</Issue>. This data is extracted because it forms part of the article’s ‘journal source string’.

Article Publication Date

<Article> / <Journal> / <JournalIssue> / <PubDate>
<PubDate> contains the full date on which the issue of the journal was published. The standardized format consists of elements for a 4-digit year, a 3-character abbreviated month, and a 1 or 2-digit day, as below.

<PubDate>
    <Year>2001</Year>
    <Month>Apr</Month>
    <Day>15</Day>
</PubDate>

Every record does not contain all of these elements; the data are taken as they are published in the journal issue, with minor alterations by NLM such as abbreviating months. Some records may be partial, e.g.

<PubDate>
    <Year>2001</Year>
</PubDate>

The publication date for most records will reside in the separate date-related elements within <PubDate> as shown above and in these cases the record will not contain <MedlineDate>. The date of publication of the article will be found in <MedlineDate> when parsing for the separate fields is not possible; i.e., cases where dates do not fit the Year, Month, or Day pattern, for instance:

<PubDate>
   <MedlineDate>1998 Dec-1999 Jan</MedlineDate>
</PubDate>
<PubDate>
   <MedlineDate>2000 Spring-Summer</MedlineDate>
</PubDate>

However the date is presented, the harvest process has to:

  • Extract the publication year, as the value of the Year element or the first four characters of the MedlineDate value
  • Extract the date as a string, whether it is a full, partial or MedlineDate date type. This ensures that all dates can be extracted.
  • Where a full date exists, extract a date type constructed from the Year, Month and Day data.


Article Titles

<Article> / <Journal> / <Title>
The full journal title. This does not need to be extracted as it is the journal abbreviation (in the <MedlineTA>) element, that is required for the journal source string.

<Article> / <Journal> / <ISOAbbreviation>
This element is a journal abbreviation constructed at NLM to assist NCBI in linking from GenBank to PubMed. It is not extracted.

<Article> / <ArticleTitle>
<ArticleTitle> contains the entire title of the journal article. <ArticleTitle> is always in English. Examples are:

    <ArticleTitle>The Kleine-Levin syndrome as a neuropsychiatric disorder: a case report.</ArticleTitle>
    <ArticleTitle>Why is xenon not more widely used for anaesthesia?</ArticleTitle><br>

Those titles originally published in a non-English language and translated for <ArticleTitle> are enclosed in square brackets. Explanatory information about the title itself may be enclosed in parentheses, e.g.: (author's transl). Records distributed with [In Process Citation] in <ArticleTitle> are non-English language citations in In-Process <MedlineCitation> status that do not yet have the article title translated into English. Examples are:

    <ArticleTitle>[Biological rhythms and human disease]</ArticleTitle>
    <ArticleTitle>[In Process Citation]</ArticleTitle>
    <ArticleTitle>[The effect of anti-arrhythmic drugs on myocardial function (author's transl)]</ArticleTitle>

<ArticleTitle> must be extracted. Further processing may be necessary to remove square brackets and any following information.

Article Pagination

<Article> / <Pagination> and <Article> / <ELocationID>
These two elements can be used together, or <ELocationID> can be used on its own, to indicate the exact physical and / or electronic location of the article.
<Pagination> indicates the inclusive pages for the article cited, within an inner <MedlinePgn> element. The pagination can be entirely non-digit data and redundant digits are omitted. Document numbers for electronic articles are found here. Examples are:

    <MedlinePgn>304- 10</MedlinePgn>
    <MedlinePgn>1199-201</MedlinePgn>
    <MedlinePgn>34, 72, 84 passim</MedlinePgn>
    <MedlinePgn>suppl 111-2</MedlinePgn>
    <MedlinePgn>925; author reply 925- 6</MedlinePgn>

<ELocationID> was defined for use in 2008 and may reside on records either in lieu of Pagination or, for items with both print and electronic locations, in addition to the Pagination element. The purpose of this element, defined in 2008, is to provide an electronic location for items which lack standard page numbers. It houses Digital Object Identifiers (DOIs) or Publisher Item Identifiers (PIIs) that are provided by publishers for new citations submitted to NLM for inclusion in MEDLINE/PubMed.
The element has two attributes, EIdType and ValidYN. EIdType indicates the type of ELocation data, DOI or PII. It is anticipated that a DOI will be supplied far more frequently by publishers than a PII. The default ValidYN value is “Y”. If corrected ELocation data is supplied by publishers to NLM, the revised DOI will be tagged ValidYN=Y and the original DOI will be retained with the ValidYN value “N”. Examples are:

    <ELocationID EIdType="doi" ValidYN="Y">10.1021/cr068126n</ELocationID>
    <ELocationID EIdType=”doi” ValidYN="N">10.1001/jama.298.18.216</ELocationID>
    <ELocationID EIdType="pii" ValidYN="Y">18829</ELocationID>

The pagination data is extracted as it forms part of the journal source string descriptive element. The <ELocationID> data is also extracted, with any DOI entries mapped to the DOI identifier for the citation data object, and any ‘pii’ entries being included within ‘other identifiers’.

Article Abstracts

<Article> / <Abstract>
Abstracts are stored as a series of <AbstractText> elements.
All abstracts are in English (N.B. some records may contain <OtherAbstract> in addition to or instead of <Abstract>). Increasingly abstracts are also structured, with the <AbstractText> element containing, from 2011, two attributes: @Label and @NlmCategory. @Label is the category or subheading provided by the article’s authors or publishers, while @NlmCategory is one of a smaller set of categories as defined by NLM.
In Summer 2014, NLM added Dryad and figshare data repositories to the <DataBankList> elements. Data availability information may also reside in the <AbstractText> element labeled DATA AVAILABILTY, although this is not yet extracted (to be investigated further).
Although publishers have given the National Library of Medicine permission to use abstracts for which they claim copyright, NLM does not hold copyright on the abstracts in MEDLINE. Given that, to avoid copyright issues, the abstract text is not extracted to the MDR, but a link is always present to the PubMed entry so that users can read the abstract there if necessary.
<CopyrightInformation> associated with <AbstractText> was introduced in 1999, and appears on a limited but increasing number of records. It is not extracted.

Authors

<Article> / <AuthorList>
Personal and collective (corporate) author names published with an article are found in <AuthorList>, which contains one or more <Author> elements. Anonymous articles (including those with pseudonyms) are identified by the absence of <AuthorList>.
If an article has more authors than were entered into the record, then <AuthorList CompleteYN= "N"> indicates the list is not complete. This attribute, when set to "N" for No should be translated into 'et al.' for display purposes.
The attribute ValidYN is used on each Author occurrence to indicate the true spelling of the name (some published author names are subsequently corrected by the publishers and NLM retains both versions in the MEDLINE/PubMed record). Only names with ValidYN=Y are extracted.
Personal <Author> data resides in the following elements:

  • <LastName> contains the surname or the single name used by an individual, even if that single name is not considered to be a surname
  • <ForeName> contains the remainder of name except for suffix
  • <Suffix> contains a valid MEDLINE suffix (e.g., 2nd, or 3rd, etc., Jr or Sr). Honorifics (e.g., PhD, MD, etc.) are not carried in the data.
  • <Initials> contains up to two initials
  • <Identifier> was added to <AuthorList> with the 2010 DTD, but was not used until 2013. It is defined to contain a unique identifier associated with the name. The value in the Identifier attribute Source designates the organizational authority that established the unique identifier. For example, <Identifier Source="ORCID">0000000179841889</Identifier>.
  • <AffiliationInfo> was added to <AuthorList> with the 2015 DTD. The <AffiliationInfo> envelope element includes <Affliliation> and <Identifier>.
  • <EqualContrib> was added to <Author> with the 2017 DTD.

Full first and middle names, if published, are entered in <ForeName> beginning with items published in 2002. Examples are:

    <Author ValidYN="Y">
        <LastName>Melosh</LastName>
        <ForeName>H J</ForeName>
        <Suffix>3rd</Suffix>
        <Initials>HJ</Initials>
    </Author>

    <Author ValidYN="Y">
        <LastName>Abrams</LastName>
        <ForeName>Judith</ForeName>
        <Initials>J</Initials>
    </Author>

The information contained within <AffiliationInfo> varies slightly over time but generally includes the <Affiliation> itself, and optionally an <Identifier> element, for example:

    <AffiliationInfo>
        <Affiliation>Departamento de Farmacologia, Facultad de Medicina, Universidad Complutense de Madrid (UCM), 28040 Madrid, Spain.</Affiliation>
    </AffiliationInfo>

    <AffiliationInfo>
        <Affiliation>Beth Israel Deaconess Medical</Affiliation>
        <Identifier Source=”Ringgold”>678922</Identifier>
    </AffiliationInfo>

N.B. Personal names of individuals (e.g., collaborators and investigators) who are listed in the paper as members of a collective/corporate group that is an author of the paper reside in <InvestigatorList>.
Collective or corporate name <Author> data resides in <CollectiveName>. These names enter MEDLINE exactly as they appear in the journal; NLM will not edit the names to standardize them or translate them into English. NLM enters the Roman alphabet words (e.g., German, French) into <CollectiveName>. Transliterated Russian or other cyrillic names are also entered into <CollectiveName> but for Japanese, Chinese, Hebrew, and Arabic NLM puts the English translation of the name into the <CollectiveName>. An example is

    <Author>
        <CollectiveName>SBU-group. Swedish Council of Technology Assessment in Health Care</CollectiveName>
    </Author>

All of the information in <AuthorList> (apart from <EqualContrib>) are extracted.

Languages

<Article> / <Language>
The language in which an article was published is recorded in <Language>. All entries are three letter abbreviations stored in lower case, such as eng, fre, ger, jpn, etc. When a single record contains more than one language value the XML export program extracts the languages in alphabetic order by the 3-letter language value. Examples are:

    <Language>eng</Language>
    <Language>rus</Language>

This data is extracted, with the 3 letter codes later being changed to two letter ISO codes.

Databank references

<Article> / <DataBankList>
This element contains information about several types of data associated with a journal article:

  • molecular sequence data (beginning in 1988 and expanded in 2014);
  • clinical trial numbers (beginning summer 2005 and expanded in 2006 and 2014);
  • gene expression/molecular abundance data (beginning February 2006);
  • PubChem identifiers (beginning in January 2007 and expanded in 2014);
  • Two general research databanks, the Dryad Digital Repository and figshare (beginning in 2014); and BioProject identifiers (beginning in 2014).

The complete list of databanks is available at //www.nlm.nih.gov/bsd/medline_databank_source.html.
The clinical trial numbers are especially important in establishing links between the Pubmed citation and an associated study or studies, and are also used to select records for extraction.
The <DataBankList> element contains one or more <DataBank> elements, which includes a <DataBankName> element, and an <AccessionNumberList>, which itself contains one or more <AccessionNumber> elements. For example:

     <DataBankList CompleteYN="Y">
         <DataBank>
             <DataBankName>ClinicalTrials.gov</DataBankName>
             <AccessionNumberList>
                 <AccessionNumber>NCT00000161</AccessionNumber>
             </AccessionNumberList>
         </DataBank>
         <DataBank>
             <DataBankName>Dryad</DataBankName>
             <AccessionNumberList>
                 <AccessionNumber>z8a11</AccessionNumber>
             </AccessionNumberList>
         </DataBank>
     </DataBankList>

All of the data in <DataBankList> is extracted. Where the <DataBankName> refers to a clinical trial registry the accession numbers are used to match the PubMed record with Study records (using the Ids within the MDR system). Other accession number types are unlikely to be mapped but are extracted for further investigation and potential future use.

Other Article Elements

<Article> / <GrantList>
This element was introduced in 1981 and contains the elements dealing with research grant or contracts. It is, however, not relevant to the MDR and so is not extracted.

<Article> / <PublicationTypeList>
This element is used to identify the type of article indexed; it characterizes the nature of the information or the manner in which it is conveyed as well as the type of research support received (e.g., Review, Letter, Retracted Publication, Clinical Conference, Research Support, N.I.H., Extramural). Records may contain more than one <PublicationType> that are listed in alphabetical order. Defined for <PublicationType> with the 2015 DTD, the @UI attribute carries the MeSH unique identifiers for publication types. An example is:

     <PublicationTypeList>
         <PublicationType UI=”D016428”>Journal Article</PublicationType>
         <PublicationType UI=”D052061”>Research Support, N.I.H., Extramural</PublicationType>
         <PublicationType UI=”D016441”>Retracted Publication</PublicationType>
         <PublicationType UI=”D016454”>Review</PublicationType>
     </PublicationTypeList>

This data is extracted, for further exploration and identification of particularly useful publication types (for example identifying reviews), but is not currently incorporated into the MDR schema.

<Article> / <VernacularTitle>
<VernacularTitle> is used for articles published in non-English languages and contains the original, untranslated title. Non-Roman alphabet language titles are transliterated. The translated titles are in <ArticleTitle> and enclosed in brackets. This data is extracted ands mapped to ‘other_titles’.

<Article> / <ArticleDate>
<ArticleDate> contains the date the publisher made an electronic version of the article, with the month represented as a 2-digit numeric rather than an alphabetic abbreviation as is the case for the month in PubDate.
A record includes <ArticleDate> only if that data is included in the publisher's electronic submission to NLM, and it may be present on records with <Article> PubModel attribute values of Electronic, Print-Electronic, Electronic-Print or Electronic-eCollection.
The attribute @DateType is always used with <ArticleDate>. It represents the media of the article published on the date in that element; the only valid value is "Electronic."
This data is extracted as the electronic publication date. Various combinations of the <Article> PubModel attributes and the data in <ArticleDate> control which dates display in the journal source string. Information on how to interpret these data to indicate print and/or electronic publication dates is provided on the NLM website.

<MedlineJournalInfo>
This element contains further information about the source journal.
<MedlineTA> contains the standard abbreviation for the title of the journal in which the article appeared. Examples are:

    <MedlineTA>JAMA</MedlineTA>
    <MedlineTA>J Comp Physiol B</MedlineTA>
    <MedlineTA>Ann Biol Clin (Paris)</MedlineTA>

This data is extracted as it is used in constructing the journal source name.
<Country>, which carries the place of publication of the journal, is extracted as it can help with identifying title languages.
<NlmUniqueID>, which is the accession number for the journal's record assigned in the NLM online catalog, and <ISSNLinking>, which enables co-location or linking among the different media versions of a continuing resource (separate ISSN’s are assigned for each media type in which a resource is issued), are not extracted.

Topic Elements

<MeshHeadingList>
The NLM controlled vocabulary, Medical Subject Headings (MeSH), is used to characterize the content of the articles represented by MEDLINE citations.
Of the various MeSH headings assigned to a record, those representing the most significant points are identified with the MajorTopic attribute set to Y for Yes. The remaining descriptors are used to identify concepts which have also been discussed in the item, but that are not the primary topics.
Each <MeshHeading> in <MeshHeadingList> contains <DescriptorName> and often <QualifierName>. The MajorTopic attribute for <DescriptorName> is set to Y (for Yes) when the MeSH Heading alone is a central concept of the article (without a QualifierName).
Defined for <DescriptorName> with the 2011 DTD, the Type attribute with its valid value Geographic, is used to distinguish MeSH geographic subject terms in Category Z of MeSH from other subject terms. Defined for <DescriptorName> and <QualifierName> with the 2015 DTD, the UI attribute carries the MeSH unique identifiers for descriptors and qualifiers. An example is:

     <MeshHeadingList>
         <MeshHeading>
             <DescriptorName MajorTopicYN="N" UI="D004740">English Abstract</DescriptorName>
         </MeshHeading>
         <MeshHeading>
             <DescriptorName MajorTopicYN="N" UI="D005317">Fetal Growth Retardation</DescriptorName>
             <QualifierName MajorTopicYN="N" UI="Q000150">complications</QualifierName>
             <QualifierName MajorTopicYN="Y" UI="Q000503">physiopathology</QualifierName>
         </MeshHeading>
         <MeshHeading>
             <DescriptorName MajorTopicYN="N" UI="D006801">Humans</DescriptorName>
         </MeshHeading>
         <MeshHeading>
             <DescriptorName MajorTopicYN="N" Type="Geographic" UI="D014481">United States</DescriptorName>
         </MeshHeading>
     </MeshHeadingList>

This data is extracted as it forms part of the object topic data.

<SupplMeshList>
<SupplMeshList> and the <SupplMeshName> elements it contains (containing the attribute @Type) are used to house (chemotherapy) Protocol Class 2 Supplementary Concept Record (SCR) terms and Disease Class 3 SCR terms. The Type attribute distinguishes Class 2 from Class 3 terms. Defined for <SupplMeshName> with the 2015 DTD, the UI attribute carries the MeSH unique identifiers for supplemental protocols and diseases. The 2018 DTD introduced the SCRs Class 4 for Organism terms.
Examples are:

    <SupplMeshList>
        <SupplMeshName Type="Disease" UI="C538248">Amyloid angiopathy</SupplMeshName>
    </SupplMeshList>
    <SupplMeshList>
        <SupplMeshName Type="Organism" UI="C000623891">Tomato yellow leaf curl virus</SupplMeshName>
    </SupplMeshList>

The values, types and MeSH UIs in this data are extracted. They are converted into object topics with a source category determined by the ‘type attribute.

<KeywordList>
<KeywordList> contains controlled terms in <Keyword> that describe the content of the article. Keywords are assigned by a collaborating data producer. Not all MEDLINE data producers supply Keywords; those that do use their own list of specialized terms which may change during the year.
Beginning in January 2013, the <KeywordList> with Owner attribute NOTNLM contains author-written keywords in <Keyword>. Author-written keywords describe the content of the article and are supplied by publishers.
Keywords, when present, appear in addition to MeSH Headings. The same Keyword may exist in more than one Keyword List. The element <KeywordList> can have one or more of the @Owner attributes listed in <MedlineCitation>, which identifies the organization that assigned the subject terms. An example is

    <KeywordList Owner="NOTNLM">
        <Keyword MajorTopicYN="N">apnea syndrome</Keyword>
        <Keyword MajorTopicYN="N">cardiovascular disease</Keyword>
        <Keyword MajorTopicYN="N">self management</Keyword>
    </KeywordList>

This information should be extracted, as it forms part of the object topics dataset.

<ChemicalList>
This element contains one or more <Chemical> elements that, in turn, contain <RegistryNumber> and <NameOfSubstance>.
<RegistryNumber> contains the unique 5 to 9 digit number in hyphenated format assigned by the Chemical Abstracts Service to specific chemical substances; for enzymes, the E.C. number derived from Enzyme Nomenclature is placed in this element.
<NameOfSubstance> is the name of the substance that the registry number or the E.C. number identifies. Defined for <NameOfSubstance> with the 2015 DTD, the UI attribute carries the MeSH unique identifiers for names of the substances. An example of a chemical list is:

    <Chemical List>
        <Chemical>
            <RegistryNumber>69-93-2</RegistryNumber>
            <NameOfSubstance UI="D014527">Uric Acid</NameOfSubstance>
        </Chemical>
        <Chemical>
            <RegistryNumber>6964-20-1</RegistryNumber>
            <NameOfSubstance UI="C004568">tiadenol</NameOfSubstance>
        </Chemical><br>
        <Chemical>
            <RegistryNumber>RegistryNumber>EC 3.1.1.34</RegistryNumber>
            <NameOfSubstance UI="D008071">Lipoprotein Lipase</NameOfSubstance>
        </Chemical>
    </ChemicalList>

The names and MeSH UIs in this data are extracted (registry numbers may possibly be useful in the future but are not extracted for the moment). They are converted into object topics with source category ‘Chemical’.

Other Citation Elements

<CitationSubset>
<CitationSubset> identifies the subset for which MEDLINE records from certain journal lists or records on specialised topics were created. The information is not relevant to the MDR and is not extracted.

<CommentsCorrectionsList>
This contains one or more <CommentsCorrections> elements that, in turn contain an associated <RefSource>, usually the associated <PMID>, and possibly a clarifying <Note>. These data pertain to and contain citations to associated journal publications, e.g., comments, errata, retractions, or cited references, and enable outside links between the record at hand to its associated citation(s).
The attribute @RefType is used with <CommentsCorrections>; one or more of the following RefType valid values may reside on a record (introduced at different times and with different levels of coverage):

  • AssociatedDataset cites the reference to a dataset description.
  • AssociatedPublication cites the reference to a scientific paper reporting on or utilising a dataset.
  • Cites lists items in the bibliography or list of references at the end of an article. Cites data currently resides on records citing articles deposited in PMC and whose citation record is in the NLM DCMS.
  • CommentOn cites the reference upon which the article comments.
  • CommentIn cites the reference containing a comment about the article.
  • ErratumIn cites the reference containing a published erratum to the article.
  • ErratumFor cites the original article for which there is a published erratum.
  • ExpressionOfConcernIn cites the expression of concern (on citation for original article).
  • ExpressionOfConcernFor cites the original article for which there is an expression of concern.
  • RepublishedFrom cites the original article subsequently corrected and republished.
  • RepublishedIn cites the final, correct version of a corrected and republished article.
  • RetractionOf cites the article being retracted.
  • RetractionIn cites the reference containing a retraction of the article.
  • UpdateIn cites the reference containing an update to the article.
  • UpdateOf cites the article being updated; limited use.
  • SummaryForPatientsIn cites the reference containing a patient summary article.
  • OriginalReportIn cites a scientific article associated with a patient summary.
  • ReprintIn cites the subsequent (and possibly abridged) version of a republished article.
  • ReprintOf cites the first, originally published article.

<RefSource> contains the citation of the associated record. <PMID> contains the PMID of the associated record (if available) thus providing a link between a citation and the citation of its related @RefType, such as a comment, erratum, retraction, or item in its bibliography. <Note> clarifies the data in <CommentsCorrections> but is used infrequently.
It can be seen that this data can sometimes be used as the basis of establishing relationships between data objects (mostly other articles but sometimes datasets), and it should therefore be extracted. Examples are:

    <CommentsCorrectionsList>
        <CommentsCorrections RefType="ErratumIn">
            <RefSource>J Infect Dis 1998 Aug;178(2):601</RefSource>
            <Note>Whitely RJ [corrected to Whitley RJ]</Note>
        </CommentsCorrections>
        <CommentsCorrections RefType="RetractionOf">
            <RefSource>Dunkel EC, de Freitas D, Scheer DI, Siegel ML, Zhu Q, Whitley RJ, Schaffer PA, Pavan-Langston D. J Infect Dis. 1993 Aug;168(2):336-44&</RefSource>
            <PMID VersionID = "1">8393056</PMID>
        </CommentsCorrections>
    </CommentsCorrectionsList>

    <CommentsCorrectionsList>
        <CommentsCorrections RefType="ErratumIn">
            <RefSource>HIV Clin Trials. 2009 Mar-Apr;10(2):vi</RefSource>
            <Note>;Dosage error in published abstract; MEDLINE/PubMed abstract corrected</Note>
        </CommentsCorrections>
    </CommentsCorrectionsList>

    <CommentsCorrectionsList>
        <CommentsCorrections RefType="ErratumIn">
            <RefSource>Adv Chronic Kidney Dis. 2006 Oct;13(4):433</RefSource>
            <Note>Dosage error in article text</Note>
        </CommentsCorrections>
    </CommentsCorrectionsList><br>


<GeneSymbolList>
This is not currently used and does not require extraction. <

<NumberOfReferences>
Use of this element is now discontinued and it does need to be extracted.

<PersonalNameSubjectList>
This relates to Individuals' names for citations that contain a biographical note or obituary, or are entirely about the life or work of an individual or individuals. In general this data does not seem relevant to the MDR and therefore is not extracted.

<OtherID>
<OtherID> may occur on a record owned by a collaborating partner or on an NLM-owned record to which a collaborating partner added additional information not originally included by NLM on the record, or where there are PMC or NIH Manuscript System identifiers present.
<OtherID> and its @Source attribute identifies the organization responsible for the information on the citation or the document where the information originated, and a unique number for that citation or document. The field may be multiply occurring. In practice the @Source attribute is mostly ‘NLM’, although a few other values (e.g. ‘KIE’, ‘NRCBL’) are currently actively used. Examples include:

    <OtherID Source="KIE>101133</OtherID>
    <OtherID Source="NRCBL">14.1</OtherID>
    <OtherID Source="NLM">PMC373290</OtherID>
    <OtherID Source="NLM">PMC2442205 [Available on 12/30/08]</OtherID>
    <OtherID Source="NLM">PMC2762775.2</OtherID>

PubMed Central identifiers can be identified from the PMC prefix.
This data should be extracted in the future, because it can be mapped to other_identifiers.

<OtherAbstract>
Whether or not there is an <Abstract>, a collaborating partner or other entity (identified in the <OtherAbstract> @Type attribute) may create an <OtherAbstract> for that record.
With the 2013 DTD, Language attribute was added to OtherAbstract so that NLM can indicate on behalf of publishers that there are additional abstracts available at the publishers' Web sites or elsewhere. The MEDLINE/PubMed record will not carry the abstract. Instead, <AbstractText> of <OtherAbstract> will include a standard phrase such as "Abstract available in Spanish from the publisher." The phrase will be supplied by publishers and they will use the <OtherAbstract> Language attribute to indicate the language of the abstract available at their Web site.
This information is currently not extracted.

<CoiStatement>
The <CoiStatement> element contains a conflict of interest statement as provided by the publisher. This field was introduced in 2017. It is not extracted.

<SpaceFlightMission>
<SpaceFlightMission> exists on earlier MEDLINE citations created by the National Aeronautics and Space Administration (NASA) but is no longer used. This element contains the space flight mission name and/or number when results of research conducted in space are covered in a publication. It is not extracted.

<InvestigatorList>
Beginning with the 2008 production year, InvestigatorList is used to contain personal names of individuals (e.g., collaborators and investigators) who are not authors of a paper but rather are listed in the paper as members of a collective/corporate group that is an author of the paper.
For records containing more than one collective/corporate group author, InvestigatorList does not indicate to which group author each personal name belongs. In this context, the names are entered in the order that they are published; the same name listed multiple times is repeated because NLM can not make assumptions as to whether those names are the same person.
Data is entered in the same format as author names in <Author> including <LastName>, <ForeName>, <Initials>, <Suffix>, <Identifier>, and <AffiliationInfo>. <Identifier> was added to <InvestigatorList> with the 2010 DTD, but is not yet in use. It is defined to contain a unique identifier associated with the name.
This data is currently not extracted partly because it would seem of limited use to users searching for resources, and also because it increases even further what is already a very large dataset of object contributors.

<GeneralNote>
<GeneralNote> contains supplemental or descriptive information related to the document cited in the MEDLINE record. It is a 'catchall' for various types of information included by NLM collaborating producers or by NLM. It can have one or more @Owner attributes (such as ‘NLM’ or ‘KIE’). Because the information contained is so diverse it is difficult to filter and use, and is therefore not extracted.

The PubmedData element

<PubmedData>
<PubmedData> includes a set of citation data elements not included in the <MedlineCitation> section. <PubmedData> is a container element that contains the <History>, <PublicationStatus>, and <ArticleIDList> - it has no attributes.

<History>
<History> contains the dates associated with the published article and its PubMed citation's history. Each <PubMedPubDate> element in <History> has a @PubStatus attribute to indicate the significance of the date. The date types allowed include (though not all of these appear to be in active use):

  • received
  • accepted
  • epublish
  • ppublish
  • revised
  • aheadofprint
  • retracted
  • ecollection
  • pmc
  • pmcr
  • pubmed
  • pubmedr
  • premedline
  • medline
  • medliner
  • entrez
  • pmc-release

The <PubMedPubDate> element also includes <Year>, <Month>, and <Day> elements containing the actual date. Some may have an <Hour>, <Minute>, and <Second> element(s), though these are not extracted.
This data is extracted as it used within object_dates. Because some of the dates may have been supplied in other parts of the citation record it is necessary to check for duplicates of some date types.

<PublicationStatus>
<PublicationStatus> Indicates the publication status of the article, i.e. whether the article is a ppublish, epublish, or ahead of print, as determined by the article's primary publication date. It is not extracted.

<ArticleIdList>
<ArticleIdList> contains <ArticleId> elements, each specifying an identification number significant to either the article's history or the citation’s processing. It has one attribute, @IdType, which can be one of the following:

  • doi
  • pii
  • pmcpid
  • pmpid
  • pmc
  • mid
  • sici
  • pubmed
  • medline
  • pmcid

This data needs extracting, as it can be mapped into ‘other_identifiers’.
Because some identifiers may have already been extracted from other parts of the record, however, it is necessary to check for possible duplicates.