genomes_genes.rst 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223
  1. Genomes and Genes
  2. =================
  3. Loading Feature Data
  4. --------------------
  5. Now that we have our organism and whole genome analysis ready, we can begin loading genomic data. For this tutorial only a single gene from sweet orange will be loaded into the databsae. This is to ensure we can move through the tutorial rather quickly. The following datasets will be used for this tutorial:
  6. - `Citrus sinensis-orange1.1g015632m.g.gff3 <http://tripal.info/sites/default/files/Citrus_sinensis-orange1.1g015632m.g.gff3>`_
  7. - `Citrus sinensis-scaffold00001.fasta <http://tripal.info/sites/default/files/Citrus_sinensis-scaffold00001.fasta>`_
  8. - `Citrus sinensis-orange1.1g015632m.g.fasta <http://tripal.info/sites/default/files/Citrus_sinensis-orange1.1g015632m.g.fasta>`_
  9. One of the new features available in many of the Tripal v3 data loaders is an HTML5 file upload element which allows administrators and users to upload large files reliably. This removes the requirement in previous versions of this tutorial to download these files directly on the server and provide a path to the file. Instead, if you have the file on your current local machine you can now simply upload it for loading.
  10. Another new option in Tripal v3 Data Loaders is the ability to provide a remote path of a file to be loaded. This completely alleviates the need to transfer large files multiple times and eases the loading process.
  11. Loading a GFF3 File
  12. -------------------
  13. The gene features (e.g. gene, mRNA, 5_prime_UTRs, CDS 3_prime_UTRS) are stored in the GFF3 file downloaded in the previous step. We will load this GFF3 file and consequently load our gene features into the database. Navigate to **Tripal → Data Loaders → Chado GFF3 Loader**.
  14. .. image:: genomes_genes.1.png
  15. Enter the following:
  16. .. csv-table::
  17. :header: "Field Name", "Value"
  18. "File", "Upload the file name Citrus_sinensis-orange1.1g015632m.g.gff3"
  19. "Analysis", "Whole Genome Assembly and Annotation of Citrus sinensis"
  20. "Organism", "Citrus sinensis"
  21. "All other options", "leave as default"
  22. Finally, click the Import GFF3 file button. You'll notice a job was submitted to the jobs subsystem. Now, to complete the process we need the job to run. We'll do this manually:
  23. ::
  24. drush trp-run-jobs --username=administrator --root=/var/www/html
  25. You should see output similar to the following:
  26. ::
  27. Tripal Job Launcher
  28. Running as user 'administrator'
  29. -------------------
  30. 2018-06-29 18:00:50: There are 1 jobs queued.
  31. 2018-06-29 18:00:50: Job ID 8.
  32. 2018-06-29 18:00:50: Calling: tripal_run_importer(12)
  33. Running 'Chado GFF3 File Loader' importer
  34. NOTE: Loading of file is performed using a database transaction.
  35. If it fails or is terminated prematurely then all insertions and
  36. updates are rolled back and will not be found in the database
  37. Opening /var/www/html/sites/default/files/tripal/users/1/Citrus_sinensis-orange1.1g015632m.g.gff3
  38. Percent complete: 100.00%. Memory: 32,211,360 bytes.
  39. Adding protein sequences if CDS exist and no proteins in GFF...
  40. Setting ranks of children...
  41. Done.
  42. Remapping Chado Controlled vocabularies to Tripal Terms...
  43. Done.
  44. .. note::
  45. For very large GFF3 files the loader can take quite a while to complete.
  46. Loading FASTA files
  47. -------------------
  48. Using the Tripal GFF3 loader we were able to populate the database with the genomic features for our organism. However, those features now need nucleotide sequence data. To do this, we will load the nucleotide sequences for the mRNA features and the scaffold sequence. Navigate to the **Tripal → Data Loaders → Chado FASTA Loader**.
  49. .. image:: genomes_genes.2.png
  50. Before loading the FASTA file we must first know the Sequence Ontology (SO) term that describes the sequences we are about to upload. We can find the appropriate SO terms from our GFF file. In the GFF file we see the SO terms that correspond to our FASTA files are 'scaffold' and 'mRNA'.
  51. .. note::
  52. It is important to ensure prior to importing, that the FASTA loader will be able to appropriately match the sequence in the FASTA file with existing sequences in the database. Before loading FASTA files, take special care to ensure the definition line of your FASTA file can uniquely identify the feature for the specific organism and sequence type.
  53. For example, in our GFF file an mRNA feature appears as follows:
  54. ::
  55. scaffold00001 phytozome6 mRNA 4058460 4062210 . + . ID=PAC:18136217;Name=orange1.1g015632m;PACid=18136217;Parent=orange1.1g015632m.g
  56. Note that for this mRNA feature the ID is **PAC:18136217** and the name is **orange1.1g015632m**. In Chado, features always have a human readable name which does not need to be unique, and also a unique name which must be unique for the organism and SO type. In the GFF file, the ID becomes the unique name and the Name becomes the human readable name.
  57. In our FASTA file the definition line for this mRNA is:
  58. ::
  59. >orange1.1g015632m PAC:18136217 (mRNA) Citrus sinensis
  60. By default Tripal will match the sequence in a FASTA file with the feature that matches the first word in the definition line. In this case the first word is **orange1.1g015632m**. As defined in the GFF file, the name and unique name are different for this mRNA. However, we can see that the first word in the definition line of the FASTA file is the name and the second is the unique name. Therefore, when we load the FASTA file we should specify that we are matching by the name because it appears first in the definition line.
  61. If however, we cannot guarantee the that feature name is unique then we can use a regular expressions in the **Advanced Options** to tell Tripal where to find the name or unique name in the definition line of your FASTA file.
  62. .. note::
  63. When loading FASTA files for features that have already been loaded via a GFF file, always choose "Update only" as the import method. Otherwise, Tripal may add the features in the FASTA file as new features if it cannot properly match them to existing features.
  64. Now, enter the following values in the fields on the web form:
  65. .. csv-table::
  66. :header: "Field Name", "Value"
  67. "FASTA file", "Upload the file named Citrus_sinensis-scaffold00001.fasta"
  68. "Analysis", "Whole Genome Assembly and Annotation of Citrus sinensis"
  69. "Organism", "Citrus sinensis (Sweet orange)"
  70. "Sequence type", "supercontig (scaffold is an alias for supercontig in the sequence ontology)"
  71. "Method", "Update only (we do not want to insert these are they should already be there)"
  72. "Name Match Type", "Name"
  73. Click the Import Fasta File, and a job will be added to the jobs system. Run the job:
  74. ::
  75. drush trp-run-jobs --username=administrator --root=/var/www/html
  76. Notice that the loader reports the it "Found 1 sequences(s).". Next fill out the same form for the mRNA (transcripts) FASTA file:
  77. .. csv-table::
  78. :header: "Field Name", "Value"
  79. "FASTA file", "Upload the file named Citrus_sinensis-orange1.1g015632m.g.fasta"
  80. "Analysis", "Whole Genome Assembly and Annotation of Citrus sinensis"
  81. "Organism", "Citrus sinensis (Sweet orange)"
  82. "Sequence type", "mRNA"
  83. "Method", "Update only"
  84. "Name Match", "Name"
  85. The FASTA loader has some advanced options. The advanced options allow you to create relationships between features and associate them with external databases. For example, the definition line for the mRNA in our FASTA file is:
  86. ::
  87. >orange1.1g015632m PAC:18136217 (mRNA) Citrus sinensis
  88. Here we have more information than just the feature name. We have a unique Phytozome accession number (e.g. PAC:18136217) for the mRNA. Using the **External Database Reference** section under **Additional Options** we can import this information to associate the Phytozome accession with the features. A regular expression is required to uniquely capture that ID. In the example above the unique accession is 18136217. Because Tripal is a PHP application, the syntax for regular expressions follows the PHP method. Documentation for regular expressions used in PHP can be found `here <http://php.net/manual/en/reference.pcre.pattern.syntax.php>`_. Enter the following value to make the associate between the mRNA and it's corresponding accession at Phytozome:
  89. .. csv-table::
  90. :header: "Field Name", "Value"
  91. "External Database", "Phytozome"
  92. "Regular expression for the accession", "^.*PAC:(\d+).*$"
  93. Remember, we have the name **Phytozome** in our **External Database** drop down because we manually added it as a database cross reference earlier in the turorial. After adding the values above, click the **Import FASTA file** button, and manually run the submitted job:
  94. ::
  95. drush trp-run-jobs --username=administrator --root=/var/www/html
  96. Now the scaffold sequence and mRNA sequences are loaded!
  97. .. note:
  98. If the name of the gene to which this mRNA belonged was also on the definition line, we could use the **Relationships** section in the **Advanced Options** to link this mRNA with it's gene parent. Fortunately, this information is also in our GFF file and these relationships have already been made.
  99. .. note::
  100. It is not required to load the mRNA sequences as those can be derived from their alignments with the scaffold sequence. However, in Chado the **feature** table has a **residues** column. Therefore, it is best practice to load the sequence when possible.
  101. Creating Gene Pages
  102. ----------------------
  103. Now that we've loaded our feature data, we must publish them. This is different than when we manually created our Organism and Analysis pages. Using the GFF and FASTA loaders we imported our data into Chado, but currently there are no published pages for this data that we loaded. To publish these genomic features, navigating to Structure → Tripal Content Types and click the link titled Publish Chado Content. The following page appears:
  104. .. image:: genomes_genes.3.png
  105. Here we can specify the types of content to publish. For our site we want to offer both gene and mRNA pages (these types were present in our GFF file). First, to create pages for genes select 'Gene' from the dropdown. A new Filter section is present and when opened appears as follows.
  106. .. image:: genomes_genes.4.png
  107. The **Filters** section allows you to provide filters to limit what you want to publish. For example, if you only want to publish genes for a single organism you can select that organism in the Organism drop down list. We only have one organism in our site, but for the sake of experience, add a filter to publish only genes for Citrus sinesis by selecting it from the Organism drop down. Scroll to the bottom a click the Publish button. A new job is added to the job queue. Manually run the job:
  108. ::
  109. drush trp-run-jobs --username=administrator --root=/var/www/html
  110. You should see output similar to the following:
  111. ::
  112. Tripal Job Launcher
  113. Running as user 'administrator'
  114. -------------------
  115. Calling: tripal_chado_publish_records(Array, 12)
  116. NOTE: publishing records is performed using a database transaction.
  117. If the load fails or is terminated prematurely then the entire set of
  118. is rolled back with no changes to the database
  119. Succesfully published 1 Gene record(s).
  120. Here we see that 1 gene was successfully published. This is because the GFF file we used previously to import the genes only had one gene present.
  121. Now, repeat the steps above to publish the mRNA content type. You should see that 9 mRNA records were published:
  122. ::
  123. Tripal Job Launcher
  124. Running as user 'administrator'
  125. -------------------
  126. Calling: tripal_chado_publish_records(Array, 13)
  127. NOTE: publishing records is performed using a database transaction.
  128. If the load fails or is terminated prematurely then the entire set of
  129. is rolled back with no changes to the database
  130. Succesfully published 9 mRNA record(s).
  131. .. note::
  132. It is not necessary to publish all types of features in the GFF file. For example, we do not want to publish features of type **scaffold**. The feature is large and would have many relationships to other features, as well as a very long nucleotide sequence. These can greatly slow down page loading, and in general would be overwhelming to the user to view on one page. As another example, each **mRNA** is composed of several **CDS** features. These **CDS** features do not need their own page and therefore do not need to be published.
  133. Now, we can view our gene and mRNA pages. Click the Find Tripal Content link. Find and click the new page titled **orange1.1g015632m.g**. Here we can see the gene feature we added and its corresponding mRNA's.
  134. .. image:: genomes_genes.5.png
  135. Next find an mRNA page to view. Remember when we loaded our FASTA file for mRNA that we associated the record with Phytozome. On these mRNA pages you will see a link in the left side bar titled **Database Cross Reference**. Clicking that will open a panel with a link to Phytozome. This link appears because:
  136. - We added a Database Cross Reference for Phytozome in a previous step
  137. - We associated the Phytozome accession with the features using a regular expression when importing the FASTA file.
  138. All data that appears on the page is derived from the GFF file and the FASTA files we loaded.