select_vocab_terms.rst 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157
  1. Selecting Vocabulary Terms
  2. ==========================
  3. Ontologies: what and why?
  4. -------------------------
  5. Tripal 3 requires all bundles and fields to be associated with a Controlled Vocabulary (CV). CVs are dictionaries of defined terms (CV terms) that make data machine-accessible, ensuring uniform terms are used across experiments, organisms and websites. Without CVterms, our scientific knowledge might be split by "dialects". Plant biologists might study temperature stress, while animal biologists study heat shock. Each group might benefit from the knowledge of the other, but they use a different vocabulary to describe the same thing, creating challenges for data discovery and exchange. CV terms make this easier not just for people, but especially for machines. Ontologies take this a step further. Where CVs are controlled lists of CVterms, ontologies are a controlled language, that include hierarchical relationships of terms.
  6. Tripal leverages vocabularies to make use of the `Semantic Web <https://en.wikipedia.org/wiki/Semantic_Web>`_. Every bundle and field defined in Tripal will be associated with a CVterm. Therefore, it is important to find community developed terms. The `EMBL EBI Ontology Lookup Service <http://www.ebi.ac.uk/ols/index>`_ provides an easy location to search for and identify terms. When choosing terms for new Bundles and Fields, think carefully about the terms you will use to describe your objects. Selecting the proper CV term that best describes the data may be the most challenging part of creating custom Bundles and Fields!
  7. Before you can create a new Bundle or Field the vocabulary term must be present in your local Tripal site. You can check if a term exists by using the Tripal lookup service on your local site using the URL path cv/lookup (e.g. http://your-site/cv/lookup). If the term is not present then you'll need to add it. You can do so manually by using Tripal's controlled vocabulary admin pages. For creating new bundles this is all you need to do. However, when creating Fields you will want to programmatically add the term. This is important because Fields are meant to be shared. If you create an awesome field that you want to share with others then you need to make sure the terms get added programmatically. The following sections describe how terms are stored in Chado and how you can add them using Tripal API calls.
  8. Storage of Terms in Chado
  9. -------------------------
  10. In Chado, CVs are stored by two tables: the **db** and **cv** tables. Chado was designed to store a record for the online database that a vocabulary lives at in the **db** table, and the namespace of a vocabulary in the **cv** table. For example, the sequence ontology uses the namespace, sequence, which is stored in the **cv** table but uses the short name of SO which is stored in the **db** table. As we'll see later, sometimes the distinction between what gets stored in the **cv** vs the **db** tables can get a bit fuzzy with some vocabularies. The terms themselves are stored in the cvterm table. The cvterm table has a foreign key to the **cv** table via the **cv_id** field. Every controlled vocabulary term has an accession. For example the term gene in the Sequence Ontology has an accession number of SO:0000704. Accession numbers consist of two parts: a vocabulary "short name", followed by a unique identifier separated by a colon. Within Chado, the accession for any term is stored in the dbxref table. This table has a foreign key to the **db** table via the **db_id** as well as a foreign key to the cvterm table via the **dbxref_id** field.
  11. Vocabulary Short Names and Namespaces
  12. -------------------------------------
  13. How can you tell what the **short name** and **namespace** values will be for a vocabulary term that you want to insert into Chado for your custom Bundle or Field? Hint: use the information is in the EMBL-EBI `Ontology Lookup Service <http://www.ebi.ac.uk/ols/index>`_ (OLS). The following sections provide three examples for different cases.
  14. Case 1: Ontologies without a defined namespace
  15. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  16. Consider the term for `organism <http://www.ebi.ac.uk/ols/ontologies/obi/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FOBI_0100026>`_.
  17. .. figure:: select_vocab_terms.1.organism.png
  18. Notice how the teal box (the **short name**) is OBI, and the orange box contains the **full accession**, OBI:0100026 which includes but the **short name** and the unique term **accession** value. Unfortunately, the OLS does not indicate the **namespace** terms. So, as a rule we will use the short name converted to lower case. Before using this term in a Tripal Bundle or Field you may need to insert this term into Chado. You can do so in your custom module cude using the **tripal_insert_cvterm** function. The following provides a demonstration:
  19. .. code-block:: php
  20. $term= tripal_insert_cvterm([
  21. 'id' => 'OBI:0100026',
  22. 'name' => 'organism',
  23. 'cv_name' => 'OBI',
  24. 'definition' => 'A material entity that is an individual living system, such as animal,
  25. plant, bacteria or virus, that is capable of replicating or reproducing, growth and maintenance
  26. in the right environment. An organism may be unicellular or made up, like humans, of many
  27. billions of cells divided into specialized tissues and organs.',
  28. ]);
  29. Note that in the code above the namespace is provided as the **cv_name** element and the full accessions (including the short name) is provided as the **id** element. In this case the OBI CV already exists by default in the Tripal database, so we did not need to add the vocabulary record. If the OBI did not exist we could have added it using the following API calls. First we insert the "database" record for the ontology.
  30. .. code-block:: php
  31. tripal_insert_db(array(
  32. 'name' => 'obi',
  33. 'description' => 'The Ontology for Biomedical Investigation.',
  34. 'url' => 'http://obi-ontology.org/page/Main_Page',
  35. 'urlprefix' => 'http://purl.obolibrary.org/obo/{db}_{accession}',
  36. ));
  37. Notice here that the **name** element is the **namespace** (short name converted to lower case) for the vocabulary. The url is the web address for the ontology online. The urlprefix is a URL that can be used to construct a link that when clicked will take the user to any term in the vocabulary. Almost all vocabularies will have a common URL for all terms. Tripal will automatically substitute the short name into the **{db}** token and the term **accession** in to the **{accession}** token to generate the URL.
  38. Second, we insert the record for the controlled vocabulary.
  39. .. code-block:: php
  40. tripal_insert_cv(
  41. 'OBI',
  42. 'Ontology for Biomedical Investigation. The Ontology for Biomedical Investigations (OBI) is build in a collaborative, international effort and will serve as a resource for annotating biomedical investigations, including the study design, protocols and instrumentation used, the data generated and the types of analysis performed on the data. This ontology arose from the Functional Genomics Investigation Ontology (FuGO) and will contain both terms that are common to all biomedical investigations, including functional genomics investigations and those that are more domain specific.'
  43. );
  44. Case 2: Ontologies with a defined namespace
  45. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  46. Consider the entry for `CDS <https://www.ebi.ac.uk/ols/ontologies/so/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FSO_0000316>`_.
  47. .. figure:: select_vocab_terms.2.cds.png
  48. Notice that in the Term Info box on the right there is the term **has_obo_namespace** which is defined as the word: sequence. This is much better than the organism example from the OBI. We now know the correct namespace for the term! By default, Tripal loads the Sequence Ontology during install. However, suppose we did not have this term loaded we could do so with the following:
  49. .. code-block:: php
  50. $term= tripal_insert_cvterm([
  51. 'id' => 'SO:0000316',
  52. 'name' => 'CDS',
  53. 'cv_name' => 'sequence',
  54. 'definition' => 'A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon. [ http://www.sequenceontology.org/browser/current_svn/term/SO:ma ].',
  55. ]);
  56. Notice in the code above we can properly set the cv_name to sequence.
  57. Case 3: Ontologies with multiple namespaces
  58. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  59. Some ontologies are b into sub-ontologies. This includes the Gene Ontology (GO). Let's consider the example GO term `cell aggregation <http://www.ebi.ac.uk/ols/ontologies/go/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FGO_0098743>`_. Looking at the EBI entry, the teal box is GO, the orange box is GO:0098743, and the has_obo_namespace is biological_process. However, the GO provides two other namespaces: cellular_component and molecular_function. Be sure to pay attention to these different namespaces if you ever need to manually insert a term.
  60. .. figure:: select_vocab_terms.3.go.png
  61. Case 4: Ontologies with muliptle short names
  62. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  63. The EDAM ontology builds its term accessions using different short names instead of the ontology. Consider the EDAM term for `Sequence <http://www.ebi.ac.uk/ols/ontologies/edam/terms?iri=http%3A%2F%2Fedamontology.org%2Fdata_2044>`_. The teal box is EDAM, the orange box is data:2044, and there is no **namespace**.
  64. .. figure:: select_vocab_terms.4.edam.png
  65. For this case, the **namespace** is EDAM, the short name is **data**, and the accession is 2044. Unfortunately, this breaks the paradigm that Chado expects. Typically the **short name** is the teal box (EDAM). In order to force Chado to properly handle ontologies like this we are forced to reverse the short name and **namespace** values when creating our record:
  66. .. code-block:: php
  67. $term= tripal_insert_cvterm([
  68. 'id' => 'data:2044',
  69. 'name' => 'sequence',
  70. 'cv_name' => 'EDAM',
  71. 'definition' => 'One or more molecular sequences, possibly with associated annotation.',
  72. ]);
  73. tripal_insert_db(array(
  74. 'name' => 'data',
  75. 'description' => 'Bioinformatics operations, data types, formats, identifiers and topics.',
  76. 'url' => 'http://edamontology.org/page',
  77. 'urlprefix' => 'http://edamontology.org/{db}_{accession}',
  78. ));
  79. tripal_insert_cv(
  80. 'EDAM',
  81. 'EDAM is an ontology of well established, familiar concepts that are prevalent within bioinformatics, including types of data and data identifiers, data formats, operations and topics. EDAM is a simple ontology - essentially a set of terms with synonyms and definitions - organised into an intuitive hierarchy for convenient use by curators, software developers and end-users. EDAM is suitable for large-scale semantic annotations and categorization of diverse bioinformatics resources. EDAM is also suitable for diverse application including for example within workbenches and workflow-management systems, software distributions, and resource registries.'
  82. );
  83. Case 5: You really cant find a term!
  84. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  85. Sometimes a good CVterm just doesn't exist for what you want to describe. If you can't find a CV term, you can insert a term into the "local" CV. This is meant to be used as a last resort. In these cases, before you use a local term, consider contributing the term to an existing CV or ontology. Any terms that are invented for a local site may mean that the data exposed by your site cannot be discovered by other sites or tools. In this case, the accession will not be numeric, but is the same as the term name.
  86. .. code-block:: php
  87. $term= tripal_insert_cvterm([
  88. 'id' => 'local:shame_on_you',
  89. 'name' => 'shame_on_you',
  90. 'cv_name' => 'local',
  91. 'definition' => 'You should really find a good CVterm.',
  92. ]);
  93. Notice in the above code the **short name** and **namespace** are both "local" as this is a local term on the site.