bulk_loader.rst 22 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325
  1. Bulk Loader
  2. ===========
  3. .. note::
  4. Remember you must set the ``$DRUPAL_HOME`` environment variable if you want to cut-and-paste the commands below. See :doc:`./install_tripal/drupal_home`
  5. The bulk loader is a tool that Tripal provides for loading of data contained in tab delimited files. Tripal supports loading of files in standard formats (e.g. ``FASTA``, ``GFF``, ``OBO``), but Chado can support a variety of different biological data types and there are often no community standard file formats for loading these data. For example, there is no file format for importing genotype and phenotype data. Those data can be stored in the feature, stock and natural diversity tables of Chado. The Bulk Loader was introduced in Tripal v1.1 and provides a web interface for building custom data loader. In short, the site developer creates the bulk loader "template". This template can then be used and re-used for any tab delimited file that follows the format described by the template. Additionally, bulk loading templates can be exported allowing Tripal sites to share loaders with one another.
  6. The following commands can be executed to install the Tripal Bulk Loader using Drush:
  7. .. code-block:: bash
  8. cd /var/www/
  9. drush pm-enable tripal_bulk_loader
  10. Plan How to Store Data
  11. ----------------------
  12. To demonstrate use of the Bulk Loader, a brief example that imports a list of organisms and associates them with their NCBI taxonomy IDs will be performed. The input tab-delimited file will contains the list of all *Fragaria* (strawberry) species in NCBI at the time of the writing of this document. Click the file link below and download it to ``/var/www/html/sites/default/files``.
  13. * `Fragaria.txt <http://tripal.info/sites/default/files/book_pages/Fragaria_0.txt>`_
  14. .. code-block:: bash
  15. cd $DRUPAL_HOME/sites/default/files
  16. wget http://tripal.info/sites/default/files/book_pages/Fragaria_0.txt
  17. This file has three columns: NCBI taxonomy ID, genus and species:
  18. .. csv-table:: Fragaria sample file
  19. 3747 "Fragaria" "x ananassa"
  20. 57918 "Fragaria" "vesca"
  21. 60188 "Fragaria" "nubicola"
  22. 64939 "Fragaria" "iinumae"
  23. 64940 "Fragaria" "moschata"
  24. 64941 "Fragaria" "nilgerrensis"
  25. 64942 "Fragaria" "viridis"
  26. To use the bulk loader you must be familiar with the Chado database schema and have an idea for where data should be stored. It is best practice to consult the GMOD website or consult the Chado community (via the `gmod-schema mailing list <https://lists.sourceforge.net/lists/listinfo/gmod-schema>`_) when deciding how to store data. For this example, we want to add the species to Chado, and we want to associate the NCBI taxonomy ID with these organisms. The first step, therefore, is to decide where in Chado these data should go. In Chado, organisms are stored in the **organism** table. This table has the following fields:
  27. `chado.organism Table Schema`
  28. .. csv-table::
  29. :header: "Name", "Type", "Description"
  30. "organism_id", "serial", "PRIMARY KEY"
  31. "abbreviation", "character varying(255)",
  32. "genus", "character varying(255)", "UNIQUE#1 NOT NULL"
  33. "species", "character varying(255)", "UNIQUE#1 NOT NULL A type of organism is always uniquely identified by genus and species. When mapping from the NCBI taxonomy names.dmp file, this column must be used where it is present, as the common_name column is not always unique (e.g. environmental samples). If a particular strain or subspecies is to be represented, this is appended onto the species name. Follows standard NCBI taxonomy pattern."
  34. "common_name", "character varying(255)"
  35. "comment", "text"
  36. We can therefore store the second and third columns of the tab-delimited input file in the **genus** and **species** columns of the organism table.
  37. In order to store a database external reference (such as for the NCBI Taxonomy ID) we need to use the following tables: **db**, **dbxref**, and **organism_dbxref**. The **db** table will house the entry for the NCBI Taxonomy; the **dbxref** table will house the entry for the taxonomy ID; and the **organism_dbxref** table will link the taxonomy ID stored in the **dbxref** table with the organism housed in the **organism** table. For reference, the fields of these tables are as follows:
  38. `chado.db Table Schema`
  39. .. csv-table::
  40. :header: "Name", "Type", "Description"
  41. "db_id", "serial", "PRIMARY KEY"
  42. "name", character varying(255), "UNIQUE NOT NULL"
  43. "description", "character varying(255)", ""
  44. "urlprefix", "character varying(255)"
  45. "url", "character varying(255)"
  46. `chado.dbxref Table Schema`
  47. .. csv-table::
  48. :header: "Name", "Type", "Description"
  49. "dbxref_id", "serial", "PRIMARY KEY"
  50. "db_id", "integer", "Foreign Key db. UNIQUE#1 NOT NULL"
  51. "accession", "character varying(255)", "UNIQUE#1 NOT NULL. The local part of the identifier. Guaranteed by the db authority to be unique for that db."
  52. "version", "character varying(255)", "UNIQUE#1 NOT NULL DEFAULT ''"
  53. "description", "text"
  54. `chado.organism_dbxref Table Schema`
  55. .. csv-table::
  56. :header: "Name", "Type", "Description"
  57. "organism_dbxref_id", "serial", "PRIMARY KEY"
  58. "organism_id", "integer", "Foreign key organism. UNIQUE#1 NOT NULL"
  59. "dbxref_id", "integer", "Foreign key dbxref. UNIQUE#1 NOT NULL"
  60. For our bulk loader template, we will therefore need to insert values into the **organism**, **db**, **dbxref** and **organism_dbxref** tables. In our input file we have the genus and species and taxonomy ID so we can import these with a bulk loader template. However, we do not have information that will go into the db table (e.g. "NCBI Taxonomy"). This is not a problem as the bulk loader can use existing data to help with import. We simply need to use the "NCBI Taxonomy" database that is currently in the Chado instance of Tripal v3.
  61. Creating a New Bulk Loader Template
  62. -----------------------------------
  63. Now that we know where all of the data in the input file will go and we have the necessary dependencies in the database (i.e. the NCBI Taxonomy database), we can create a new bulk loader template. Navigate to ``Tripal → Data Loaders → Chado Bulk Loader``, click the **Templates** tab in the top right corner, and finally click the link **Add Template**. The following page appears:
  64. .. image:: ./bulk_loader.1.png
  65. We need to first provide a name for our template. Try to name templates in a way that are meaningful for others. Currently only site administrators can load files using the bulk loader. But, future versions of Tripal will provide functionality to allow other privileged users the ability to use the bulk loader templates. Thus, it is important to name the templates so that others can easily identify the purpose of the template. For this example, enter the name **NCBI Taxonomy Importer (taxid, genus, species)**. The following page appears:
  66. .. image:: ./bulk_loader.2.png
  67. Notice that the page is divided into two sections: **Current Records** and **Current Fields**. Before we continue with the template we need a bit of explanation as to the terminology used by the bulk loader. A **record** refers to a Chado table and an action on that table. For example, to insert the data from the input file we will need to select the NCBI Taxonomy database from the **db** table and insert entries into the **dbxref**, **organism** and **dbxref_organism** tables. Therefore, we will have four records:
  68. * An insert into the organism table
  69. * A select from the db table (to get the database id (db_id) of the "NCBI Taxonomy" database needed for the insert into the dbxref table)
  70. * An insert into the dbxref table
  71. * An insert into the organism_dbxref table.
  72. Each record contains a set of fields on which the action is performed. Thus, when we insert an entry into the organism table we will insert into two fields: **genus** and **species**.
  73. To create the first record for inserting an organism, click the button **New Record/Field**. The following page appears:
  74. .. image:: ./bulk_loader.3.png
  75. By default, when adding a new record, the bulk loader also provides the form elements for adding the first field of the record as well. We are adding a new record, so we can leave the **Record** drop-down as **New Record**. Next, give this record a unique record name. Because we are inserting into the organism table, enter the name **Organism** into the **Unique Record Name** box.
  76. We also have the opportunity with this form to add our first field to the record. Because we are adding the organism record we will first add the field for the **genus**. In the **Field** section we specify the source of the field. Because the genus value comes from the input file, select the first radio button titled **Data**. Next we need a human-readable name for the field. This field is the **genus** field so we will enter Genus into the **Human-readable Title for Field** box. Next, we need to specify the **Chado table** for this record. In the Chado table drop down box, choose the **organism** table, and in the **Chado Field/Column** drop down box select **genus**.
  77. In the next section, titled **Data File Column**, we need to indicate the column in the tab-delimited file where the genus is found. For the example file this is column 2 (columns are ordered beginning with number 1). Therefore, enter the number **2** in the **Column** box. There are additional options to expose the field to the user, but for now we can ignore those options. Click the **Save Changes** button at the bottom. We now see that the organism record and the first field have been added to our bulk loader template.
  78. .. image:: ./bulk_loader.4.png
  79. We also see that the **Mode** (or action) for this record has been set to insert by default. Before continuing we should edit the settings for the record so that it is more fault tolerant. Click the **Edit** link to the left of the new organism record. On the resulting page we see the record details we already provided, but now there is a section titled **Action to take when Loading Record**. By default, the **INSERT** option is selected. This is correct. We want to perform an insert. However, notice in the **Additional Insert Options** section, the **SELECT if duplicate (no insert).** Check this box. This is a good option to add because it prevents the bulk loader from failing if the record already exists in the table.
  80. Click the **Save Record** button to save these settings. Now, you will see that the **Mode** is now set to insert or select if duplicate. Previously the **Mode** was just **insert**.
  81. Next, we need to add the **species** field to the record. Click the **Add Field** link to the left of the organism record name. Here we are presented with the same form we used when first adding the organism record. However, this time, the **Record** section is collapsed. If we open that section the drop down already has the **Organism** record as we are not creating a new record. To add the **Species** field, provide the following values and click the **Save Changes button**:
  82. * Type of field: Data
  83. * Human-readable Title for Field: Species
  84. * Chado table: organism (should already be set)
  85. * Chado Field/Column: species
  86. * Column: 3
  87. We now have two fields for our organism record:
  88. .. image:: ./bulk_loader.5.png
  89. At this point our organism record is complete, however there are still a few fields in the organism table of Chado that are not present in our record. These include the **organism_id, abbreviation, common_name** and **comment** fields. We do not have values in our input file for any of these fields. Fortunately, the **organism_id** field is a primary key field and is auto generated when a record is submitted. We do not need to provide a value for that field. The other fields are not part of the unique constraint of the table. Therefore, those fields are optional and we do not need to specify them. Ideally, if we did have values for those non-required fields we would add them as well.
  90. To this point, we have built the loader such that it can load two of the three columns in our input file. We have one remaining column: the NCBI taxonomy ID. In order to associate an organism with the taxonomy ID we must first insert the taxonomy ID into the **dbxref** table. Examining the dbxref table, we see that a **db_id** field is a required value in a foreign key relationship. We must first retrieve the **db_id** from the **db** table of Chado before we can add the entry to the **dbxref** table. Therefore, we will create a second record that will do just that. On the **Edit Template** page click the button **New Record/Field**. Here we see the same form we used for adding the first organism record. Provide the following values:
  91. * For the record:
  92. * Record: New Record
  93. * Unique Record Name: NCBI Taxonomy DB
  94. * Record Type/Action: SELECT ONCE: Select the record only once for each constant set.
  95. * For the field:
  96. * Type of field: Constant
  97. * Human-readable Title for Field: DB name
  98. * Chado table: db
  99. * Chado field/column: name
  100. * Within the Constant section:
  101. * Constant Value: NCBITaxon
  102. * Check "Ensure the value is in the table"
  103. Here we use a field type of **Constant** rather than **Data**. This is because we are providing the value to be used in the record rather than using a value from the input file. The value we are providing is "NCBI Taxonomy" which is the name of the database we added previously. The goal is to match the name "NCBI Taxonomy" with an entry in the **db** table. Click the **Save Changes** button.
  104. We now see a second record on the **Edit Template** page. However, the mode for this record is insert. We do not want to insert this value into the table, we want to select it because we need the corresponding **db_id** for the **dbxref** record. To change this, click the Edit link to the left of the **NCBI Taxonomy DB** record. Here we want to select only the option **SELECT ONCE**. We choose this option because the database entry that will be returned by the record will apply for the entire input file. Therefore, we only need to select it one time. Otherwise, the select statement would execute for each row in the input file causing excess queries. Finally, click **Save Record**. The **NCBI Taxonomy DB** record now has a mode of **select once**. When we created the record, we selected the option to 'SELECT ONCE'. This means that the bulk loader will perform the action one time for that record for the entire import process. Because the field is a constant the bulk loader need not execute that record for every row it imports from our input file. We simply need to select the record once and the record then becomes available for use through the entire import process.
  105. Now that we have a record that selects the **db_id** we can now create the **dbxref** record. For the **dbxref** record there is a unique constraint that requires the **accession**, **db_id** and **version**. The version record has a default value so we only need to create two fields for this new record: the db_id and the accession. We will use the **db_id** from the **NCBI Taxonomy DB** record and the accession is the first column of the input file. First, we will add the **db_id** record. Click the **New Record/Field** button and set the following:
  106. * For the record:
  107. * Record: New Record
  108. * Unique Record Name: Taxonomy ID
  109. * Record Type/Action: INSERT: insert the record
  110. * For the field:
  111. * Type of field: Record referral
  112. * Human-readable Title for Field: NCBI Taxonomy DB ID
  113. * Chado table: dbxref
  114. * Chado Field/Column: db_id
  115. * In the Record Referral Section:
  116. * Record to refer to: NCBI Taxonomy DB
  117. * Field to refer to: db_id
  118. Click the Save Changes button. The Edit Template page appears.
  119. .. image:: ./bulk_loader.6.png
  120. Again, we need to edit the record to make the loader more fault tolerant. Click the Edit link to the left of the Taxonomy ID record. Select the following:
  121. * Insert
  122. * Select if duplicate
  123. To complete this record, we need to add the accession field. Click the Add field link to the left of the Taxonomy ID record name. Provide the following values:
  124. * For the field:
  125. * Type of Field: Data
  126. * Human-readable Title for Field: Accession
  127. * Chado table: dbxref
  128. * Chado field/column: accession
  129. * In the Data File Column section:
  130. * Column: 1
  131. At this state, we should have three records: Organism, NCBI Taxonomy DB, and Taxonomy ID. We can now add the final record that will insert a record into the **organism_dbxref** table. Create this new record with the following details:
  132. * For the record:
  133. * Record: New Record
  134. * Unique Record Name: Taxonomy/Organism Linker
  135. * Check: Insert: insert the record
  136. * For the field:
  137. * Type of Field: Record Referral
  138. * Human-readable Title for Field: Accession Ref
  139. * Chado table: organism_dbxref
  140. * Chado field/column: dbxref_id
  141. * In the Record Referral section:
  142. * Record to refer to: Taxonomy ID
  143. * Field to refer to: dbxref_id
  144. Create the second field:
  145. * For the field:
  146. * Type of Field: Record Referral
  147. * Human-readable Title for Field: Organism ID
  148. * Chado table: organism_dbxref
  149. * Chado field/column: organism_id
  150. * In the Record Referral section:
  151. * Record to refer to: Organism
  152. * Field to refer to: organism_id
  153. ​After saving the field. Edit the record and set the following:
  154. * Change the record mode to: insert or select if duplicate
  155. We are now done! We have created a bulk loader template that reads in a file with three columns containing an NCBI taxonomy ID, a genus and species. The loader places the genus and species in the **organism** table, adds the NCBI Taxonomy ID to the **dbxref** table, links it to the NCBI Taxonomy entry in the db table, and then adds an entry to the **organism_dbxref** table that links the organism to the NCBI taxonomy Id. The following screen shots show how the template should appear:
  156. .. image:: ./bulk_loader.7.png
  157. To save the template, click the **Save Template** link at the bottom of the page.
  158. Creating a Bulk Loader Job (importing a file)
  159. ---------------------------------------------
  160. Now that we have created a bulk loader template we can use it to import a file. We will import the **Fragaria**.txt file downloaded previously. To import a file using a bulk loader template, click the **Add Content** link in the administrative menu and click the **Bulk Loading Job**. A bulk loading job is required each time we want to load a file. Below is a screen shot of the page used for creating a bulk loading job.
  161. .. image:: ./bulk_loader.8.png
  162. Provide the following values:
  163. * Job Name: Import of Fragaria species
  164. * Template: NCBI Taxonomy Importer (taxid, genus species).
  165. * Data File: [DRUPAL_HOME]/sites/default/files/Fragaria_0.txt
  166. * Keep track of inserted IDs: No
  167. * File has a header: No
  168. .. note::
  169. Be sure to change the [DRUPAL_HOME] token to where Drupal is installed.
  170. Click **Save**. The page then appears as follows:
  171. .. image:: ./bulk_loader.9.png
  172. You can see details about constants that are used by the template and the where the fields from the input file will be stored by clicking the **Data Fields** tab in the table of contents on the left sidebar.
  173. .. image:: ./bulk_loader.10.png
  174. Now that we have created a job, we can submit it for execution by clicking the **Submit Job** button. This adds a job to the Tripal Jobs systems and we can launc the job as we have previously in this tutorial:
  175. .. code-block:: shell
  176. cd /var/www
  177. drush trp-run-jobs --username=admin --root=$DRUPAL_HOME
  178. After execution of the job you should see similar output to the terminal window:
  179. .. code-block:: shell
  180. Tripal Job Launcher
  181. Running as user 'admin'
  182. -------------------
  183. There are 1 jobs queued.
  184. Calling: tripal_bulk_loader_load_data(2, 7)
  185. Template: NCBI Taxonomy Importer (taxid, genus, species) (1)
  186. File: /var/www/html/sites/default/files/Fragaria_0.txt (46 lines)
  187. Preparing to load...
  188. Loading...
  189. Preparing to load the current constant set...
  190. Open File...
  191. Start Transaction...
  192. Defer Constraints...
  193. Acquiring Table Locks...
  194. ROW EXCLUSIVE for organism
  195. ROW EXCLUSIVE for dbxref
  196. ROW EXCLUSIVE for organism_dbxref
  197. Loading the current constant set...
  198. Progress:
  199. [|||||||||||||||||||||||||||||||||||||||||||||||||||] 100.00%. (46 of 46) Memory: 33962080
  200. Our *Fragaira* species should now be loaded, and we return to the Tripal site to see them. Click on the **Organisms** link in the **Search Data** menu. In the search form that appears, type "Fragaria" in the **Genus** text box and click the **Filter** button. We should see the list of newly added *Fragaria* species.
  201. .. image:: ./bulk_loader.11.png
  202. Before the organisms will have Tripal pages, the Chado records need to be **Published**. You can publish them by navigating to **Tripal Content -> Publish Tripal Content**. Select the **organism** table from the dropdown and run the job.
  203. .. note::
  204. In Tripal 2, records were synced by naviating to **Tripal → Chado Modules → Organisms**.
  205. Once complete, return to the search form, find a *Fragaria* species that has been published and view its page. You should see a Cross References link in the left table of contents. If you click that link you should see the NCBI Taxonomy ID with a link to the page:
  206. .. image:: ./bulk_loader.12.png
  207. Sharing Your Templates with Others
  208. ----------------------------------
  209. Now that our template for loading organisms with NCBI Taxonomy IDs is completed we can share our template loader with anyone else that has a Tripal-based site. To do this we simply export the template in text format, place it in a text file or directly in an email and send to a collaborator for import into their site. To do this, navigate to **Tripal → Chado Data Loaders → Buik Loader** and click the **Tempalate** tab at the top. Here we find a table of all the tempaltes we have created. We should see our template named **NCBI Taxonomy Importer** (taxid, genus, species). In the far right colum is a link to export that template. Licking that link will redirect you to a page where the template is provided in a serialized PHP array.
  210. .. image:: ./bulk_loader.13.png
  211. Cut-and-paste all of the text in the **Export** field and send it to a collaborator.
  212. To import a template that may have been created by someone else, navigate to **Tripal → Chado Data Loaders → Buik Loader** and click the **Tempalate** tab. A link titled Import Template appears above the table of existing importers. The page that appears when that link is clicked will allow you to import any template shared with you.