As of version 5.0.0, the articles of the CRAFT Corpus have been semantically annotated with classes based on specific Open Biomedical Ontologies (OBOs), organized into 11 modules:
Chemical Entities of Biological Interest (CHEBI): compositionally defined chemical entities (atoms, chemical substances, chemical groups, and molecular entities), subatomic particles, and role-defined chemical entities (i.e., defined in terms of their use by humans, or by their biological and/or chemical behavior)
Example proper CHEBI classes used in CHEBI annotation set: methylglyoxal, 3-isobutyl-1-methylxanthine, iodide, water, streptomycin, ammonium chloride, bicarbonate, ozone, sodium(1+), histone, polysaccharide, mineral, mixture, solution, anion, atom, carbon-14 atom, radical, amino group, disulfanediyl group, gelatin, electron, dye, detergent, insecticide, anti-inflammatory agent, mitogen, chromophore, acid, PPAR modulator, carcinogen, toxin, analgesic
Example additional CHEBI extension classes used in CHEBI+extensions annotation set: buffering, buffer solution, acrylic, ADP, 2-oxoglutarate, aluminum, adenine/adenosine, biochemical, carboxy/carboxylato group, chemical substance, EDTA, phosphate, eosin, metal, glycerol 1-phosphate, phosphatidylcholine, purine, residue, radiolabeling, antioxidant, saline solution, agonist, cyclin-dependent kinase inhibitor, high-density lipoprotein, hormone, adiponectin, protein, angiotensin, ribonucleoprotein, enzyme, peptidase/protease/proteinase, beta-amyloid protein/aggregate, glucagon, alanine, amino acid, polypeptide, nucleobase, biomarker, DNA, epitope, molecular probe, oligonuncleotide
Cell Ontology (CL): cells (excluding types of cell line cells)
Example proper CL classes used in CL annotation set: platelet, enterocyte, osteoblast, endothelial cell, peritoneal macrophage, megakaryocyte, brown fat cell, slow muscle myoblast, inner phalangeal cell, lens epithelial cell, spiral ganglion neuron, M cell of gut, Sertoli cell, mesangial cell, chromaffin cell of adrenal gland, CD4-positive alpha-beta T-cell, kidney cell, fungal spore, sperm, embryonic stem cell, diploid cell, circulating cell, cultured cell, apoptosis-fated cell
Example additional CL extension classes used in CL+extensions annotation set: cell, liver cell, neural crest cell
Gene Ontology Biological Process (GO_BP): biological processes, including genetic, biochemical/molecular-biological, cellular and subcellular, organ- and organ-system-level, organismal, and multiorganismal processes
Example proper GO_BP classes used in GO_BP annotation set: proprioception, excretion, long-term memory, anaphase, cell adhesion, menopause, biological regulation, macromolecular complex assembly, autophagy, drug metabolism, DNA repair, cellular response to platelet-derived growth factor, protein folding, translational initiation, immune response, sexual reproduction, BMP signaling pathway, locomotion, asymmetric stem cell division, pharynx development, neuron projection morphogenesis, DNA-mediated transformation, saliva secretion, amide transport, cell proliferation, death
Example additional GO_BP extension classes used in GO_BP+extensions annotation set: biogenesis/biosynthesis, cell communication/signaling, biological behavior, biological movement/translocation, biological mover/transporter, biological recruitment, detection/sensing of stimulus, cytokine biosynthesis/production, negative regulation, negative regulator, neurotransmission, biological reaction/response, transcription, phosphorylation, biological pigmentation process/quality, post-translational modification entity/process, biological localization process/quality, behavioral conditioning
Gene Ontology Cellular Component (GO_CC): cellular and extracellular components and regions; species-nonspecific macromolecular complexes
Example proper GO_CC classes used in GO_CC annotation set: vesicle, nucleolus, caveola, actin cytoskeleton, cell-cell junction, cell projection, nuclear envelope, cytoplasm, cis-Golgi network, excitatory synapse, chromatin, endoplasmic reticulum, actin filament, mitochondrial membrane, extracellular matrix, photoreceptor outer segment, extrinsic component of membrane, ribosome, DNA repair complex, protein complex, protein phosphatase type 2A complex
Example additional GO_CC extension classes used in GO_CC+extensions annotation set: hemoglobin protein/complex, high-density lipoprotein, cell, basal body, dimer, flagellum, tight junction, calcineurin complex, integrin protein/complex, nerve ending/terminal, centromere, chromosome, chromosomal location/part/region, basal lamina, cell component/part
Gene Ontology Molecular Function (GO_MF): molecular functionalities possessed by genes/gene products, as well as the molecular bearers of these functionalities
Example proper GO_MF classes used in GO_MF annotation set: annealing/hybridization, dimerization, protein anchoring
Example additional GO_MF extension classes used in GO_MF+extensions annotation set: binding, agonist, antioxidant, DNA ligase, ATPase, transposase, carbonate dehydratase, NAD-dependent histone deacetylase, chemoattractant, nitric acid synthase, metallochaperone, calcium channel inhibitor, protein kinase activator, receptor, 9-cis-retinoic acid receptor, MAP kinase, morphogen, transcription factor, transcription corepressor, enzyme, enzyme inhibitor, hormone, cytochrome-c oxidase, peptidase/protease/proteinase, biological mover/transporter, nucleotide exchange factor
MONDO Disease Ontology (MONDO): diseases, disorders, and their characteristics
Example proper MONDO classes used in MONDO annotation sets: disease/disorder, congenital, diabetes, Huntington disease, blindness, obesity, albinism, scarring, movement disorder, retinal degeneration, pancreatic neoplasm, agenesis of corpus callosum, mucolipidosis type IV, late-infantile neuronal ceroid lipofuscinosis
Molecular Process Ontology (MOP): chemical reactions and other molecular processes
Example proper MOP classes used in MOP annotation set: acetylation, deacetylation, butylation, myristoylation, biotinylation, N-gylcosylation, isomerization, oxidation, reduction, dehydrogenation, polymerization, depolymerization, hydrolysis, chain reaction, electron transfer, covalent bond formation
Additional MOP extension classes used in MOP+extensions annotation set: catalysis, glycosylation, metal chelation, methylation, oxidant entity/process
NCBI Taxonomy (NCBITaxon): biological taxa and their corresponding organisms; taxon levels
Example proper NCBITaxon classes used in NCBITaxon annotation set: organism, Archaea, Bartonellaceae, Crithidia luciliae, Chlamydomonas, Rhizobiales/rhizobacteria, Vertebrata/vertebrate, Magniolophyta/angiosperm, Phix174 virus, Deuterostomia/deuterostome, Tetraodontidae/pufferfish, Oryctolagus cuniculus/rabbit, Enterobacteria phage T7, Homo sapiens/human, Saccharomyces cerevisiae/baker's yeast, Escherichia coli K-12, kingdom, phylum, species, subspecies
Additional NCBITaxon extension classes used in NCBITaxon+extensions annotation set: carp, fish, ground squirrel, HIV, HSV, invertebrate, monkey, quail, reptile, ungulate, worm, yeast, calf, chick, child, girl, man, mare, pup, woman
Protein Ontology (PR): proteins, which are also used to annotate their corresponding genes and transcripts
Example proper PR classes used in PR annotation set: cadherin, AKT kinase (protein kinase B/PKB), G-protein coupled receptor (GPCR), annexin A1 (ANX1/ANXA1/LPC1), 2'-5'-oligoadenylate synthetase 1A (Oas1a), bile acid receptor (farnesoid X receptor/nuclear receptor subfamily 1 group H member 4/BAR/FXR/HRR1/NR1H4/PFIC5/RIP14), 40S ribosomal protein S16 (RPS16/S16), placental alkaline phosphatase (ALPP/PALP/PLAP/PLAP1)), achaeate-scute homolog 1 (achaeate-scute family bHLH transcription factor 1/ASCL1/ASH1/bHLHa46/HASH1/MASH1)
Example additional PR extension classes used in PR+extensions annotation set: protein, growth hormone, clathrin, adiponectin, beta-amyloid protein/aggregate, beta-amyloid protein 40, angiotensin, glucagon, 7-dehydrocholesterol reductase (DHCR7), activin receptor type II, acetylcholinesterase, alkaline phosphatase, alpha-catenin, alpha-crystallin, annexin, arginine vasipressin, arrestin, BMP type I receptor, calcitonin gene-related peptide, CCAAT/enhancer-binding protein (CEBP), carbonic anhydrase 2, CLIP-associating protein (CLASP), collagen, gastrin, leptin, myoglobin
Sequence Ontology (SO): biomacromolecular entities, sequence features, and their associated attributes and processes
Example proper SO classes used in SO annotation set: mature transcript, genome, plasmid, clone, targeting vector, PCR product, exon, allele, haplotype, gene, pseudogene, QTL, SNP, open chromatin region, origin of replication, DNA-binding site, chromosome arm, poly(A) signal sequence, nuclear export signal, loxP site, proximal promoter element, base pair, transposable element, enhancer, open reading frame, splice junction, chromosome breakpoint, insertion site, mating-type region, polypeptide domain, contig, read, homologous, cryptic, floxed, antisense, in-frame
Example additional SO extension classes used in SO+extensions annotation set: biological sequence, enzyme, alanine, amino acid, biomarker, DNA, double-stranded DNA, epitope, molecular probe, nucleobase, oligonucleotide, centromere, chromosome, chromosome location/part/region, insertion entity/process, post-translational modification entity/process, sequence assembly entity/process, sequence repeat unit/region, sequence variant, mRNA, nucleic acid, nucleotide
Uberon (UBERON): anatomical entities; multicellular organisms and life-cycle stages defined in terms of developmental and sexual characteristics
Example proper UBERON classes used in UBERON annotation set: tongue, nerve, mouth, brain, brain ventricle, feather, skeleton, shell, vasculature, mesentery, iris, pituitary gland, venous plexus, basilar membrane of cochlea, endolymphatic duct, aortic sac, trophoblast, coat of hair, abdominal cavity, bladder lumen, cardiovascular system, corticospinal tract, granular layer of cerebellar cortex, dorsal root of spinal cord, mammary gland mesenchyme, neural tube, hair inner root sheath, apical ectodermal ridge, zona pellucida, aqueous humor, bile, blood clot, embryo, adult, life, 2-cell stage
Example additional UBERON extension classes used in UBERON+extensions annotation set: fat, lens fiber, basal lamina, cell component/part, lumen, nerve ending/terminal, bone, muscle, hippocampus, lens, skin, chick, child, male organism
For each of these, the concept annotations are modularly distributed in two sets, one using only proper classes from the given ontology, and another additionally using extension classes created by us but defined in terms of proper OBO classes. These extension classes were created for various reasons: Some were created to unify classes from different ontologies that were semantically equivalent or very close so that there would not be multiple concept annotations for the same text spans if the disparate annotation sets were aggregated. (This also adds to the semantic integration of the ontologies.) Others were created to unify multiple classes for cases in which they were difficult to consistently use for text annotation; that is, we have found it difficult to reliably differentiate among the concepts for observed textual mentions. Others were created to create representations for new concepts in terms of existing classes. (These cases are not mutually exclusive, so a given extension class may have had multiple motivations.) In any case, an extension class was only created if we were able to create a formal logical definition for it in terms of existing OBO classes. We have not implemented these as formal logical definitions yet, as some require logical expressions beyond that which can be represented in OBO format (which is the format that we have been using for these ontologies), but these logical definitions can be seen in the text definition fields of the extension classes’ OBO stanzas in the form of Manchester OWL syntax. Note that since there are no logical definitions implemented and no other explicit superclass assertions, the extension classes are situated at the level of the root(s) of the ontologies. In the future, we intend to distribute the ontologies in OWL rather than OBO format and will implement these logical definitions, which will also allow them to be properly automatically classified within the ontologies by OWL reasoners. Note that an extension class may be an extension of more than ontology (and many are).
Along with the annotations for each ontology, we have included two .obo ontologies. One is the ontology as distributed by its developers, and the other is this ontology plus the OBO stanzas for that ontology’s extension classes that we have created. (The ontologies provided for the GO_MF annotations are somewhat different; see below.) Both are valid OBO-format files. Note that obsolete classes have not been used for any of the concept annotations, so these should be ignored when parsing an ontology file to extract information for dictionary construction by looking for “is_obsolete: true” lines within the OBO stanzas of the classes. Additional filtering is recommended when parsing the GO_BP, GO_CC, MOP, NCBITaxon and PR ontology files, detailed in their respective subsections below.
We recommend that the annotation sets created partly with extension classes be used, as they contain large numbers of additional annotations compared to the annotation sets created without extension classes, including many that are likely to be of interest to users. For example, annotations of the amino acid classes of the SO (e.g., SO:alanine, SO:amino_acid, SO:arginine) do not appear in the SO annotation set created without extension classes, as extension classes have been created for these in the form of unifications with corresponding CHEBI classes; analogously, annotations of the corresponding amino acid classes of CHEBI do not appear in the CHEBI annotation set created without extension classes. However, these amino acid annotations appear in both the CHEBI+extensions and SO+extensions annotation sets since these extension classes are extensions of both ontologies.
Included along with each ontology and its annotations is a file named X_extension_classes_and_related_X_classes.txt, where X is one of {CHEBI, CL, GO_BP, GO_CC, GO_MF, MOP, NCBITaxon, PR, SO, UBERON}, which is, as its name indicates, a mapping of each extension class of the given ontology to semantically related classes in the base ontology. This is a basic tab-delimited text file, with each line corresponding to one extension class, and every extension class of the given ontology appears in the file and has a mapping to at least one class of the given ontology. Though the large majority of the extension classes are mapped to only one class of a given ontology, some map to several classes of a given ontology (and a very small number map to a large number of classes). For example, CHEBI_EXT:calcium is an extension class of the CHEBI ontology and is mapped to two CHEBI classes (CHEBI:’calcium atom’ (CHEBI:22984) and CHEBI:’elemental calcium’ (CHEBI:35155)), as neither of these classes subsumes the other, and a given mention of calcium may be referring to either (or both); thus, to make annotations of calcium mentions more straightforward, we have unified them into one calcium extension class (defined as the union of the two CHEBI classes). The first entry of each line in these files is the ID of the extension class, and any subsequent (tab-delimited) column value is the ID of a mapped class. Note that, at least for now, manually created extension classes have not been given numeric IDs, but rather more human-readable ones; however, the CHEBI, GO_MF, PR, and SO extension ontologies additionally contain large numbers of automatically created parallel extension classes that have parallel numeric IDs (detailed in their respective subsections below). In the case of multiple mapped classes in a given ontology, each is delimited by a tab character, so this can be easily automatically parsed. There are two additional important points to note: First, the mapped OBO classes are NOT guaranteed to be semantically equivalent to the extension class. Though in many cases they are semantically equivalent, many are not equivalent but otherwise semantically related (which is an admittedly vague phrase); with these mappings, we wanted to at least specify which OBO class(es) a given extension class is most closely semantically related. Secondly, an extension class that is an extension of more than one ontology will have different mappings in the corresponding mapping files; for example, CL_GO_EXT:cell, which is an extension class that unifies the cell classes of the CL and GO_CC ontologies, is mapped to CL:cell (CL:0000000) in the corresponding CL mapping file and to GO:cell (GO:0005623) in the corresponding GO_CC mapping file. This structuring was purposely done in the interest of allowing users to work with the ontologies and their corresponding concept annotations modularly.
In addition to simply working with all of the extension classes and their annotations as we have created them, the user has at least two additional options of working with them. The first is to use the annotation sets using the extension classes, but replace the extension classes used in the annotations with corresponding proper OBO classes using the analogous extension-classes-to-OBO-classes mapping file. However, since the mappings are not guaranteed to be semantically equivalent, the user would have to tolerate some semantic inexactness (and even error) introduced into the annotations. Additionally, for those cases in which a given extension class is mapped to more than class of a given ontology, the user would have to decide which of the mapped classes to use for replacement, an issue for which there is no straightforward solution. Thus, we do not recommend this strategy, but it can be pursued.
A second option is to work with only some of the extension classes and their annotations: There may be specific extension classes that a given user may want to use but not others. For example, a user may or may not be interested in some of the extension classes that are related to the proper OBO classes on which they are based but represent different types of concepts; e.g., some extension classes of the SO represent processes corresponding to proper SO sequence feature classes (e.g., SO_EXT:insertion_process, defined in terms of SO:insertion (SO:0000667)). In such a case, a user may remove any extension class that she wishes not to use from the .obo file (simply by deleting the OBO stanza for that extension class, either manually or computationally), and then computationally search for and remove any annotations created using that extension class from her local copy of the annotation set. This is guaranteed not to introduce any semantic errors as long as the removals are done properly, so we can present this as a viable alternative.
There are two more important text files included within each module (except for the GO_MF module; see below). One is named unused_classes_for_X_annotations.txt, where once again X is one of {CHEBI, CL, GO_BP, GO_CC, GO_MF, MOP, NCBITaxon, PR, SO, UBERON}. This is simply a text file of IDs of certain classes of the given ontology that were not used at all in the corresponding annotations not making use of the extension classes. This is NOT an exhaustive listing of every class of the ontology that was not used in these annotations; rather, it lists certain classes that were not used either because we thought they were difficult to reliably use for annotation and/or for which extension classes were alternatively created and used. This is an important file to use when working with annotation sets that don’t make use of extension classes, as the user should remove any automatically created annotation that uses one of these classes before comparing to the CRAFT gold standard, as they are guaranteed to be false positives on the part of the user. This list also gives the user an idea of the kinds of concepts that are missing from the annotation set created without the use of the extension classes.
The corresponding file that is important use when working with an X+extensions annotation set is named unused_classes_and_substitute_extension_classes_for_X+extensions_annotations.txt, where X is again one of the aforementioned ontology namespaces. This tab-delimited file contains all of the classes of the corresponding unused_classes_for_X_annotations.txt file, and for the CHEBI, GO_MF, PR, and SO annotations, many more classes as well (explained later in their corresponding sections below). Like this corresponding file, it is a listing of certain classes that were not used at all in the X+extensions annotations. Also like the corresponding file, it is NOT an exhaustive listing of every class of the original ontology that was not used. In addition to this listing of unused classes, the overwhelming majority of these classes are mapped to extension classes that were used instead. There are two additional points to note: First, a very small number of these classes have no mapping; these are simply classes that were not used at all and for which no extension classes were used instead, so the user should automatically remove any annotation using one of these extension classes before comparing to the CRAFT gold standard. Secondly, there are a small number of OBO classes that are mapped to more than one extension class; these are cases in which the the OBO class was not used but several different extension classes were instead used in different contexts. In these cases, it cannot be determined from the mapping file alone which extension class to use instead, but the user can at least automatically remove the annotations making use of these classes, thus guaranteeing removal of false positives. Otherwise, for the overwhelming majority of classes listed in this file that are mapped to one and only one extension class, the user can simply check to replace any annotation class that is specified as unused in the mapping file with the mapped extension class before comparing to the CRAFT gold standard. In some cases, an automatic annotator may match a piece of text both to an OBO class and to an OBO extension class to which the OBO class is also mapped. In these cases, the user should be careful not to duplicate the extension class annotation; that is, in such a case, the OBO class annotation should be deleted rather than kept but substituted with the OBO extension class, as this would result in two annotations with the OBO extension class.
In addition to the changes regarding extension classes, there are two changes to the text span annotation guidelines for the concept annotations. First, in addition to the previously specified rule that any boundary at either side of any whitespace or punctuation character can serve as an annotation span boundary, the boundary of a mention of any of a specified set of affixes (the prefixes ante-/anti-, auto-, co-, de-, dis-/dys-, extra-, hetero-, homo-, hyper-, hypo-, inter-, intra-, juxta-, mal-, mid-, mis-, non-, over-, peri-, post-, pre-, pro-, re-, sub-, super-, trans-, un-, and under-, and the suffix -less) can serve as an annotation boundary if there is not a whitespace or punctuation character delimiting the affix from the word root; thus, the “chromatin” of both “anti-chromatin” and “antichromatin” can be annotated. Additionally, the boundary at which there is a change in text formatting (e.g., regular text, italic text, superscript text, subscript text) can serve as an annotation span boundary; for example, for “BRAFV600E” in which “BRAF” is regular text and “V600E” is superscripted, the boundary between “BRAF” and “V600E” can serve as an annotation boundary, and “BRAF” can be annotated. Note that although the regular text files of the annotated articles do not contain any of these kinds of text formatting, the CRAFT distribution includes annotations of such formatting, so these can be used to determine potential annotation span boundaries.
As in previous versions, the concept annotations are distributed in multiple formats, including one in RDF and several more in various forms of XML and XMI. Note that, as in previous versions, because discontinuous annotations (i.e., annotations composed of two or more disconnected annotated text spans) cannot be unambiguously represented in GPML, we have excluded all such annotations from their corresponding annotation sets in this format; thus, the concept annotations in the Genia XML format are to be regarded as incomplete.
In summary, recommended and not-so-recommended ways of using the concept annotations of the CRAFT Corpus v 3.0:
If making use of at least some of the annotations created with extension classes (X+extensions annotations), there are at least three options:
-
Use all of the extension classes and their annotations as they have been created.
-
Replace the extension classes used in the annotations with mapped OBO classes from X_extension_classes_and_related_classes.txt. However, some semantic inexactness and/or error may be introduced, and for extension classes that are mapped to multiple related OBO classes, the user would have to decide which class to use for replacement. (Not recommended, but possible)
-
Remove specific extension classes from the X+extensions.obo file by (manually or automatically) deleting those classes’ OBO stanzas and automatically removing any annotations using these removed extension classes from the X+extensions annotations.
Some combination of 2 and 3 may also be attempted. For any of the options above, the user should use X+extensions.obo (possibly modified in option 3) as the ontology/term source. Prior to comparing to the CRAFT gold standard, the user should check his automatically generated annotations against the list of unused classes in unused_classes_and_substitute_extension_classes_for_X+extensions_annotations.txt. For an annotation using one of the overwhelming majority of these unused classes that have one and only one mapped extension class, check to replace the unused class with the singularly mapped class. The annotation should be kept but with the OBO class replaced with the OBO extension class UNLESS an annotation using this mapped extension class has also been created for the same text span(s), in which case the user should delete the OBO class annotation and keep the OBO extension class annotation so as to avoid duplicate annotations with the extension class. If the unused class has no mapping, delete the corresponding annotation, and for an unused class that is mapped to more than OBO extension class, the user can try to determine the correct mapped class to use, or he can simply delete the annotation, at least guaranteeing the removal of a false positive.
If instead making use of an annotation set not created with any extension classes, the user should use X.obo, which is the original OBO ontology (except for the GO_MF ontology provided; see below). Prior to comparing to the CRAFT gold standard, the user should remove any automatically generated annotation made with an unused class as specified in unused_classes_for_X_annotations.txt. We emphasize that though the annotation sets created without extension classes use only proper OBO classes, they lack large numbers of annotations likely of interest to the user that do appear in the corresponding annotations created with extension classes.
Also, the OBO classes in the ontology files contain four types of synonyms (exact, broad, narrow, and related). In general (except for the PR annotations; see below), the only synonyms we recommend using are the exact ones.
Finally, the user is recommended to take note of the ontology-specific comments below.
Although all of the subclasses of the role hierarchy have names and definitions that refer to material chemical entities, these classes actually represent the functionalities inherent in the material chemical entities that can be realized. However, we have more straightforwardly used them to annotate material chemical entities in the CRAFT Corpus. Since we have altered the semantics of the role classes (even though we have not changed the original primary names, synonyms, or textual definitions), we have created a set of parallel CHEBI extension classes that instead represent the material chemical entities that possess these inherent functionalities. These parallel role extension classes have the same numeric CHEBI IDs but use the CHEBI_EXT prefix instead; e.g., for CHEBI:solvent (CHEBI:22586), representing the inherent potential to act as a solvent, we have created CHEBI_EXT:solvent (CHEBI_EXT:22586), representing the material chemical entities with the inherent potential to act as a solvent. The annotations created without CHEBI extension classes are simpler in that they use the original CHEBI classes (i.e., instead of the parallel CHEBI_EXT classes), while we were more ontologically rigorous with the CHEBI+extensions annotations, which use the parallel CHEBI_EXT classes. In CHEBI+extensions.obo, the original CHEBI classes appear rather than the CHEBI_EXT classes used in the annotations, as we sought to minimize changes to the ontology. However, the user can simply use CHEBI+extensions.obo as it is for CHEBI concept recognition. Then, prior to comparing any automatically generated concept annotations to the CRAFT gold standard, the mappings in the included unused_classes_and_substitute_extension_classes_for_CHEBI+extensions_annotations.txt (which includes mappings to both the automatically created parallel CHEBI_EXT role classes as well as to additional manually created CHEBI extension classes) should be used to check if any classes used for the annotations should be replaced with their mapped extension classes or if any annotations should be deleted.
There is only one specified unused class for the CL annotations, the top-level CL:cell (CL:0000000), as the aforementioned unified extension class CL_GO_EXT:cell is instead used. However, there are thousands of mentions of this concept in the CRAFT Corpus, and the user should be aware that the thousands of annotations of cells (i.e., those specifically annotated with a generic cell class) are correspondingly absent from the CL annotations created without extension classes.
Note that the Gene Ontology is distributed as one file with all of its classes in the biological_process, cellular_component, and molecular_function namespaces together, and this is the version that is included for the GO_BP annotations. Thus, when parsing either GO.obo or GO+GO_BP_extensions.obo to build a dictionary for GO_BP concept recognition, the user should ignore the GO_CC and GO_MF classes by inspecting the namespace fields within the OBO stanzas of the GO classes and only extracting information from the classes in the biological_process namespace. (However, if parsing GO+GO_BP_extensions.obo, be sure not to ignore the extension classes, all of whose ID namespaces end in "_EXT".)
Note also that whereas the GO_BP and GO_MF annotations were packaged together in previous versions of the corpus, the GO_BP annotations are properly modularized in this version.
Note that the Gene Ontology is distributed as one file with all of its classes in the biological_process, cellular_component, and molecular_function namespaces together, and this is the version that is included for the GO_CC annotations. Thus, when parsing either GO.obo or GO+GO_CC_extensions.obo to build a dictionary for GO_CC concept recognition, the user should ignore the GO_BP and GO_MF classes by inspecting the namespace fields within the OBO stanzas of the GO classes and only extracting information from the classes in the cellular_component namespace. (However, if parsing GO+GO_CC_extensions.obo, be sure not to ignore the extension classes, all of whose ID namespaces end in "_EXT".)
One of the specified unused classes for the GO_CC annotations is GO:cell (GO:0005623). for which the aforementioned unified extension class CL_GO_EXT:cell is instead used. However, there are thousands of mentions of this concept in the CRAFT Corpus, and the user should be aware that the thousands of annotations of cells (i.e., those specifically annotated with a generic cell class) are correspondingly absent from the GO_CC annotations created without extension classes.
Analogous to the CHEBI role hierarchy, the GO_MF classes represent molecular functionalities inherent in genes and gene products that can be realized. However, for the overwhelming majority of the GO_MF annotations, we have more straightforwardly annotated the material molecular entities that possess these functionalities. (Additionally, we have also used them to annotate molecular entities that possess these functionalities but are not necessarily genes and gene products, e.g., hormones.) Since we have similarly altered the semantics of these classes, we have created a set of parallel GO_MF extension classes that represent the material molecular entities rather than the inherent functionalities. Analogous to the parallel CHEBI_EXT role classes, these parallel GO_MF extension classes have the same numeric GO IDs but use the GO_EXT prefix instead; e.g., for GO:0016209, which represents antioxidant activity, we have created and used the parallel GO_EXT:0016209, which represents molecular entities that possess this inherent antioxidant activity. However, we have changed the textual names of these extension classes by prepending the original GO name with “bearer of” to reflect the fact that these represent material entities; for example, the textual name for GO_EXT:0016209 is “bearer of antioxidant activity”. An important change we have made for these extension classes is in the modification of synonyms. Specifically, we have replaced every synonym of the original GO_MF classes that ends in “activity” with a synonym without this terminal word, and if the original GO_MF name also ends in “activity”, we have created an additional synonym for the GO_MF extension class without this terminal word, e.g., “antioxidant” for GO_EXT:0016209. We believe that these synonyms are more intuitive names for these classes and that they will be extremely useful to employ for automatic annotation of these concepts in text.
We have used a very small number (5) of original GO_MF classes directly to annotate certain molecular process concepts, e.g., annealing, dimerization. For this, we have created an extremely small subset of the original GO ontology that only contains these 5 classes, called GO_MF_stub.obo, which should be used for the GO_MF annotation sets created without extension classes. These annotations, which number only in the hundreds and are very limited in semantic richness, are the only ones that appear in the GO_MF annotation sets created without extension classes; thus, we strongly recommend that the GO_MF+extensions annotations be used. Since we have specifically created the stub ontology containing only the original GO_MF classes that were used for these annotations, there is no need for an unused_classes_for_GO_MF_annotations.txt file, so we have not created one. Also, since GO_MF_stub.obo and GO_MF_stub+GO_MF_extensions.obo only contain GO_MF classes (and, in the latter, GO_MF extension classes), there is no need to look for non-GO_MF classes to ignore in these ontology files.
Note also that whereas the GO_BP and GO_MF annotations were packaged together in previous versions of the corpus, the GO_MF annotations are properly modularized in this version. Additionally, note that there are no "continuant" annotations as in previous versions of the GO_MF annotations; instead, the parallel GO_EXT extension classes of GO_MF classes are used to annotate bearers of molecular functionalities in this version of the corpus.
Note that the OWL version of the MONDO Disease Ontology is provided rather than an OBO-format file, as the former was needed for the Knowtator v2 annotation tool we used to create the MONDO annotations. Also note that MONDO imports classes from many external ontologies and terminologies, including BFO, CARO, ECTO, GO, IAO, PATO, and UBERON, among others. Since these non-MONDO classes were not used for the MONDO annotations, these should be ignored when parsing mondo.owl to extract information for dictionary construction; this can be done by examining the class IDs and only using those whose abbreviated forms begin with "MONDO_", e.g., MONDO_0001328.
The MONDO concept annotations are currently provided in two sets, one including annotations of mentions of MONDO concepts within genotype specifications, and the other without these. An example of such a MONDO genotype annotation is "TTD" within "XpdTTD", denoting the allele of the Xpd gene associated with trichothiodystrophy (TTD). Though we believe these are actual mentions of MONDO disease concepts, we have decided to also provide a set of MONDO annotations without them because such mentions appear in the plain-text article files without the helpful formatting one would see, e.g., on a Web page; for example, the aforementioned example appears as "XpdTTD" in the corresponding plain-text article file. Without the associated formatting these mentions may be much more difficult to reliably detect, and there are several articles within the corpus that contain many such annotations, potentially significantly affecting performance if not detected. (Note, however, that the CRAFT distribution does provide typographic annotations for the articles of the corpus (in CRAFT/structural-annotation/sections-and-typography/knowtator), which would be useful to locate typographic discontinuities.) If not using the typographic annotations, we recommend using the MONDO set excluding the annotations within genotypes.
Finally, note that no extension classes are yet used for the MONDO annotations.
Although there are only several hundred MOP annotations appearing in the corpus, we nevertheless believe this can be a useful ontology for concept recognition in text, particularly for chemically oriented text.
Note that the MOP ontology imports classes from external ontologies, specifically the BFO and CHEBI ontologies. Since these non-MOP classes were not used for the MOP annotations, these should be ignored when parsing either MOP.obo or MOP+extensions.obo to extract information for dictionary construction; this can be done by examining the class prefixes in the ID fields. However, if parsing the latter, the purposely created MOP extension classes (all of which have namespaces ending in "_EXT") should not be ignored in the process.
The structure of the NCBITaxon annotations has been changed for v3.0 of the corpus: Previously, all NCBITaxon annotations were annotated using a generic organism class, with the NCBITaxon ID specified as an attribute of this class. (That was only done for unimportant engineering reasons.) Now, each annotation is directly annotated with the appropriate NCBITaxon ID, just as in the concept annotations for all of the other ontologies. We believe this new consistency will make these annotations more convenient to use and process.
Also, there's a large subhierarchy directly under the root named "other sequences" (NCBITaxon:28384) and another named "unclassified sequences" (NCBITaxon:12908) that were purposely not used for concept annotation in the CRAFT Corpus. Therefore, users are advised to ignore/filter out these classes and all of their (direct and indirect) subclasses when making use of the NCBITaxon annotations.
The overwhelming majority of the classes of the PR ontology represent types of proteins; however, as in v1.0, we have also used them to mark up their corresponding genes and transcripts, as differentiation among these types of sequences is a long-known issue in biomedical text annotation (which we are cognizantly sidestepping). Thus, since we have expanded the semantics of these PR classes, we have analogously created parallel PR_EXT classes for the entire protein (PR:000000001) hierarchy. As for the parallel hierarchies of CHEBI, GO_MF, and SO extension classes, these have the same numeric IDs as their corresponding PR classes but a different prefix; e.g., while the original PR:000005403 represents B subunits of chromatin assembly factor 1, PR_EXT:000005403 represents these proteins as well as the corresponding genes and transcripts that code for them. The annotations created without PR extension classes are simpler in that they use the original PR classes (i.e., instead of the parallel PR_EXT classes), while we were more ontologically rigorous with the PR+extensions annotations, which use the parallel PR_EXT classes. In PR+extensions.obo, the original PR classes appear rather than the PR_EXT classes used in the annotations, as we sought to minimize changes to the ontology. However, the user can simply use PR+extensions.obo as it is for PR concept recognition. Then, prior to comparing any automatically generated concept annotations to the CRAFT gold standard, the mappings in the included unused_classes_and_substitute_extension_classes_for_PR+extensions_annotations.txt (which includes mappings to both the automatically created parallel PR_EXT classes as well as to additional manually created PR extension classes) should be used to check if any classes used for the annotations should be replaced with their mapped extension classes or if any annotations should be deleted.
Also, note that the PR ontology imports many classes from a number of external terminologies, identifiable by their different class namespace prefixes (e.g., BFO, CGNC, EcoGene, FlyBase). Since these non-PR classes were not used for the PR annotations, these should be ignored when parsing either PR.obo or PR+extensions.obo to extract information for dictionary construction; this can be done by examining the class prefixes in the ID fields. However, if parsing the latter, the purposely created extension PR classes (all of which have namespaces ending in "_EXT") should not be ignored in the process.
Note also that nearly all of the PR annotations use taxon-nonspecific PR protein classes. There are only two types of uses of taxon-specific PR protein classes, one being for names that are unique to a particular taxon (as opposed to names that are shared among a diverse set of homologs), e.g., doublesex (fruit fly), Flp (yeast), BamHI (B. amyloliquifaciens). Taxon-specific PR classes are also used for mentions in which the taxon is attached to the protein mention in abbreviated form, e.g., hPLAP, which is annotated with PR:'alkaline phosphatase, placental type (human)'. Additionally, nearly all of the PR annotations use isoform-nonspecific PR protein classes. (In the PR, all isoform-specific protein classes and most taxon-specific protein classes are subclasses of their corresponding nonspecific protein classes.)
Many of the protein classes of the PR have their corresponding gene names and acronyms stored as related synonyms. Since we have used the PR protein classes to also mark up corresponding genes and transcripts, these related synonyms are very likely to be useful; therefore, unlike the other ontologies, we recommend also using the related synonyms for PR concept recognition in text.
Analogous to the PR annotations, SO classes that represent RNA sequence molecules and sequence features are also used to mark up the corresponding DNA that codes for them (and vice versa, in the case of reverse transcription), and SO classes that represent peptide sequence molecules and sequence features are also used to mark up the corresponding DNA and RNA that code for them. Thus, since we have expanded the semantics of these classes, we have analogously created parallel SO_EXT classes for all of the classes of the sequence_collection (SO:0001260), sequence_feature (SO:0000110), and sequence_variant (SO:0001060) hierarchies. As for the parallel hierarchies of CHEBI, GO_MF, and PR extension classes, these have the same numeric IDs as their corresponding SO classes but a different prefix; e.g., while the original SO:0001528 represents nuclear localization signals in peptides, SO_EXT:0001528 represents these signals as well as corresponding DNA and RNA sequences that code for them. The annotations created without SO extension classes are simpler in that they use the original SO classes (i.e., instead of the parallel SO_EXT classes), while we were more ontologically rigorous with the SO+extensions annotations, which use the parallel SO_EXT classes. In SO+extensions.obo, the original SO classes appear rather than the SO_EXT classes used in the annotations, as we sought to minimize changes to the ontology. However, the user can simply use SO+extensions.obo as it is for SO concept recognition. Then, prior to comparing any automatically generated concept annotations to the CRAFT gold standard, the mappings in the included unused_classes_and_substitute_extension_classes_for_SO+extensions_annotations.txt (which includes mappings to both the automatically created parallel SO_EXT classes as well as to additional manually created SO extension classes) should be used to check if any classes used for the annotations should be replaced with their mapped extension classes or if any annotations should be deleted.
This is a new set of concept annotations in the CRAFT Corpus, as compared to versions 1.0 and 2.0.