|
@@ -1,4 +1,4 @@
|
|
|
-# CALDER user manuel
|
|
|
+# CALDER user manual
|
|
|
|
|
|
CALDER is a Hi-C analysis tool that allows: (1) compute chromatin domains from whole chromosome contacts; (2) derive their non-linear hierarchical organization and obtain sub-compartments; (3) compute nested sub-domains within each chromatin domain from short-range contacts. CALDER is currently implemented in R.
|
|
|
|
|
@@ -13,18 +13,18 @@ CALDER is a Hi-C analysis tool that allows: (1) compute chromatin domains from w
|
|
|
|
|
|
* Support for hg19, hg38, mm9, mm10 and other genomes
|
|
|
* Support input in .hic format generated by Juicer tools (https://github.com/aidenlab/juicer)
|
|
|
-* Opitimized bin_size selection for more reliable compartment identification
|
|
|
+* Optimized bin_size selection for more reliable compartment identification
|
|
|
* Aggregated all chromosome output into a single file for easier visualization in IGV
|
|
|
* Added output in tabular .txt format at bin level for easier downstream analysis
|
|
|
|
|
|
Below we introduce two main updates:
|
|
|
|
|
|
-### (1) Opitimized `bin_size` selection
|
|
|
+### (1) Optimized `bin_size` selection
|
|
|
|
|
|
-Due to reasons such as low data quality or large scale structrual variation, compartments can be unreliablly called at one `bin_size` (equivalent to `resoltution` in the literature) but properly called at another `bin_size`. We added an opitimized `bin_size` selection strategy to call reliable compartments. This strategey is based on the observation from our large scale compartment analysis (https://www.nature.com/articles/s41467-021-22666-3), that although compartments can change between different conditions, their overall correlation `cor(compartment_rank_1, compartment_rank_2)` is high (> 0.4).
|
|
|
+Due to reasons such as low data quality or large scale structural variation, compartments can be unreliably called at one `bin_size` (equivalent to `resolution` in the literature) but properly called at another `bin_size`. We added an optimized `bin_size` selection strategy to call reliable compartments. This strategy is based on the observation from our large scale compartment analysis (https://www.nature.com/articles/s41467-021-22666-3), that although compartments can change between different conditions, their overall correlation `cor(compartment_rank_1, compartment_rank_2)` is high (> 0.4).
|
|
|
<br>
|
|
|
<br>
|
|
|
-**The strategy**: given a `bin_size` specified by user, we call compartments with extended `bin_sizes` and choose the smallest `bin_size` such that no bigger `bin_size` can increase the compartment correclation with a reference compartment more than 0.05. For example, if correclation for `bin_size=10000` is 0.2 while for `bin_size=50000` is 0.6, we are more confident that the latter is more reliable; if correclation for `bin_size=10000` is 0.5 while for `bin_size=50000` is 0.52, we would choose the former as it has higher resolution.
|
|
|
+**The strategy**: given a `bin_size` specified by user, we call compartments with extended `bin_sizes` and choose the smallest `bin_size` such that no bigger `bin_size` can increase the compartment correlation with a reference compartment more than 0.05. For example, if correlation for `bin_size=10000` is 0.2 while for `bin_size=50000` is 0.6, we are more confident that the latter is more reliable; if correlation for `bin_size=10000` is 0.5 while for `bin_size=50000` is 0.52, we would choose the former as it has higher resolution.
|
|
|
<br>
|
|
|
<br>
|
|
|
`bin_size` is extended in the following way thus contact matrices at any larger `bin_sizes` can be aggregated from the input contact matrices directly:
|
|
@@ -40,7 +40,7 @@ Note that this strategy is currently only available for `hg19`, `hg38`, `mm9` an
|
|
|
|
|
|
### (2) Support for other genomes
|
|
|
|
|
|
-Although CALDER was mainly tested on human and mouse dataset, it can be applied to dataset from other genomes. One additional information is required in such case: a `feature_track` presumably positively correlated with compartment score (thus higher values in A than in B compartment). This information will be used for correctly determing the `A/B` direction. Some suggested tracks are gene density, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K36me3 (or negative transform of H3K9me3) signals. Note that this information will not alter the hierarchical compartment/TAD structure, and can come from any external study with matched genome. An example of `feature_track` is given in the **Usage** section.
|
|
|
+Although CALDER was mainly tested on human and mouse dataset, it can be applied to dataset from other genomes. One additional information is required in such case: a `feature_track` presumably positively correlated with compartment score (thus higher values in A than in B compartment). This information will be used for correctly determining the `A/B` direction. Some suggested tracks are gene density, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K36me3 (or negative transform of H3K9me3) signals. Note that this information will not alter the hierarchical compartment/TAD structure, and can come from any external study with matched genome. An example of `feature_track` is given in the **Usage** section.
|
|
|
|
|
|
# Installation
|
|
|
|
|
@@ -88,7 +88,7 @@ CALDER contains three modules: (1) compute chromatin domains; (2) derive their h
|
|
|
|
|
|
### Input data format
|
|
|
|
|
|
-CALDER works on contact matrices compatable with that generated by Juicer tools (https://github.com/aidenlab/juicer), either a .hic file, or three-column `dump` table retrieved by the juicer dump (or straw) command (https://github.com/aidenlab/juicer/wiki/Data-Extraction):
|
|
|
+CALDER works on contact matrices compatible with that generated by Juicer tools (https://github.com/aidenlab/juicer), either a .hic file, or three-column `dump` table retrieved by the juicer dump (or straw) command (https://github.com/aidenlab/juicer/wiki/Data-Extraction):
|
|
|
|
|
|
16050000 16050000 10106.306
|
|
|
16050000 16060000 2259.247
|
|
@@ -211,11 +211,11 @@ CALDER(contact_file_dump=contact_file_dump,
|
|
|
|
|
|
|
|
|
|
|
|
-### Paramters:
|
|
|
+### Parameters:
|
|
|
|
|
|
| Name | Description |
|
|
|
| --------------------- | ----------------------- |
|
|
|
-| **chrs** | A vector of chromosome names to be analyzed, with or without 'chr'. Chromosome names should be consistent with those in `contact_file_hic` and `feature_track` if tsuch files are provided
|
|
|
+| **chrs** | A vector of chromosome names to be analyzed, with or without 'chr'. Chromosome names should be consistent with those in `contact_file_hic` and `feature_track` if such files are provided
|
|
|
| **contact_file_dump** |A list of contact files in dump format, named by `chrs`. Each contact file stores the contact information of the corresponding `chr`. Only one of `contact_file_dump`, `contact_tab_dump`, `contact_file_hic` should be provided
|
|
|
| **contact_tab_dump** | A list of contact table in dump format, named by `chrs`, stored as an R object. Only one of `contact_file_dump`, `contact_tab_dump`, `contact_file_hic` should be provided
|
|
|
| **contact_file_hic** | A hic file generated by Juicer tools. It should contain all chromosomes in `chrs`. Only one of `contact_file_dump`, `contact_tab_dump`, `contact_file_hic` should be provided
|
|
@@ -223,7 +223,7 @@ CALDER(contact_file_dump=contact_file_dump,
|
|
|
| **save_dir** | the directory to be created for saving outputs
|
|
|
| **bin_size** | The bin_size (resolution) to run CALDER. `bin_size` should be consistent with the data resolution in `contact_file_dump` or `contact_tab_dump` if these files are provided as input, otherwise `bin_size` should exist in `contact_file_hic`. Recommended `bin_size` is between **10000 to 100000**
|
|
|
| **single_binsize_only** | logical. If TRUE, CALDER will compute compartments only using the bin_size specified by the user and not do bin size optimization
|
|
|
-| **feature_track** | A genomic feature track in `data.frame` or `data.table` format. This track will be used for determing the A/B compartment direction when `genome=others` and should presumably have higher values in A than in B compartment. Some suggested tracks can be gene density, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K36me3 (or negative transform of H3K9me3 signals)
|
|
|
+| **feature_track** | A genomic feature track in `data.frame` or `data.table` format. This track will be used for determining the A/B compartment direction when `genome=others` and should presumably have higher values in A than in B compartment. Some suggested tracks can be gene density, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K36me3 (or negative transform of H3K9me3 signals)
|
|
|
| **save_intermediate_data** | logical. If TRUE, an intermediate_data will be saved. This file can be used for computing nested sub-domains later on
|
|
|
| **n_cores** | integer. Number of cores to be registered for running CALDER in parallel
|
|
|
| **sub_domains** | logical, whether to compute nested sub-domains
|
|
@@ -289,8 +289,8 @@ save_dir/
|
|
|
| --------------------- | ----------------------- |
|
|
|
| **all_sub_compartments.bed** | a .bed file containing the optimal compartments for all `chrs`, that can be visualized in IGV. Different colors were used to distinguish compartments (at the resolution of 8 sub-compartments)
|
|
|
| **all_sub_compartments.tsv** | optimal compartments stored in tabular text format. Each row represents one 10kb region
|
|
|
-| **cor_with_ref.ALL.txt** | a plot of correlation between compartment rank and the reference compartment rank for each of extended `bin_sizes`, and the optimimal `bin_size` that is finally selected
|
|
|
-| **cor_with_ref.pdf** | correlation of compartment rank with the reference compartment rank using the optimimal `bin_size`
|
|
|
+| **cor_with_ref.ALL.txt** | a plot of correlation between compartment rank and the reference compartment rank for each of extended `bin_sizes`, and the optimal `bin_size` that is finally selected
|
|
|
+| **cor_with_ref.pdf** | correlation of compartment rank with the reference compartment rank using the optimal `bin_size`
|
|
|
|
|
|
|
|
|
|
|
@@ -299,7 +299,7 @@ save_dir/
|
|
|
|
|
|
| Name | Description |
|
|
|
| --------------------- | ----------------------- |
|
|
|
-| **chrxx_domain_hierachy.tsv** | information of compartment domain and their hierarchical organization. The hierarchical structure is fully represented by `compartment_label`, for example, `B.2.2.2` and `B.2.2.1` are two sub-branches of `B.2.2`. The `pos_end` column specifies all compartment domain borders, except when it is marked as `gap`, which indicates it is the border of a gap chromsome region that has too few contacts and was excluded from the analysis (e.g., due to low mappability, deletion, technique flaw)
|
|
|
+| **chrxx_domain_hierachy.tsv** | information of compartment domain and their hierarchical organization. The hierarchical structure is fully represented by `compartment_label`, for example, `B.2.2.2` and `B.2.2.1` are two sub-branches of `B.2.2`. The `pos_end` column specifies all compartment domain borders, except when it is marked as `gap`, which indicates it is the border of a gap chromosome region that has too few contacts and was excluded from the analysis (e.g., due to low mappability, deletion, technique flaw)
|
|
|
| **chrxx_sub_compartments.bed** | a .bed file containing the compartment information, that can be visualized in IGV. Different colors were used to distinguish compartments (at the resolution of 8 sub-compartments)
|
|
|
| **chrxx_domain_boundaries.bed** | a .bed file containing the chromatin domains boundaries, that can be visualized in IGV
|
|
|
| **chrxx_nested_boundaries.bed** | a .bed file containing the nested sub-domain boundaries, that can be visualized in IGV
|