há 4 anos atrás · 01b870e2ef
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
 
				-# CALDER user manuel
			
 
				+# CALDER user manual
			
 
				 
			
 
				 CALDER is a Hi-C analysis tool that allows: (1) compute chromatin domains from whole chromosome contacts; (2) derive their non-linear hierarchical organization and obtain sub-compartments; (3) compute nested sub-domains within each chromatin domain from short-range contacts. CALDER is currently implemented in R.
			
 
				 
			
@@ -13,18 +13,18 @@ CALDER is a Hi-C analysis tool that allows: (1) compute chromatin domains from w
 
				 
			
 
				 * Support for hg19, hg38, mm9, mm10 and other genomes
			
 
				 * Support input in .hic format generated by Juicer tools (https://github.com/aidenlab/juicer)
			
 
				-* Opitimized bin_size selection for more reliable compartment identification
			
 
				+* Optimized bin_size selection for more reliable compartment identification
			
 
				 * Aggregated all chromosome output into a single file for easier visualization in IGV
			
 
				 * Added output in tabular .txt format at bin level for easier downstream analysis
			
 
				 
			
 
				 Below we introduce two main updates:
			
 
				 
			
 
				-### (1) Opitimized `bin_size` selection
			
 
				+### (1) Optimized `bin_size` selection
			
 
				 
			
 
				-Due to reasons such as low data quality or large scale structrual variation, compartments can be unreliablly called at one `bin_size` (equivalent to `resoltution` in the literature) but properly called at another `bin_size`. We added an opitimized `bin_size` selection strategy to call reliable compartments. This strategey is based on the observation from our large scale compartment analysis (https://www.nature.com/articles/s41467-021-22666-3), that although compartments can change between different conditions, their overall correlation `cor(compartment_rank_1, compartment_rank_2)` is high (> 0.4).
			
 
				+Due to reasons such as low data quality or large scale structural variation, compartments can be unreliably called at one `bin_size` (equivalent to `resolution` in the literature) but properly called at another `bin_size`. We added an optimized `bin_size` selection strategy to call reliable compartments. This strategy is based on the observation from our large scale compartment analysis (https://www.nature.com/articles/s41467-021-22666-3), that although compartments can change between different conditions, their overall correlation `cor(compartment_rank_1, compartment_rank_2)` is high (> 0.4).
			
 
				 <br>
			
 
				 <br>
			
 
				-**The strategy**: given a `bin_size` specified by user, we call compartments with extended `bin_sizes` and choose the smallest `bin_size` such that no bigger `bin_size` can increase the compartment correclation with a reference compartment more than 0.05. For example, if correclation for `bin_size=10000` is 0.2 while for `bin_size=50000` is 0.6, we are more confident that the latter is more reliable; if correclation for `bin_size=10000` is 0.5 while for `bin_size=50000` is 0.52, we would choose the former as it has higher resolution.
			
 
				+**The strategy**: given a `bin_size` specified by user, we call compartments with extended `bin_sizes` and choose the smallest `bin_size` such that no bigger `bin_size` can increase the compartment correlation with a reference compartment more than 0.05. For example, if correlation for `bin_size=10000` is 0.2 while for `bin_size=50000` is 0.6, we are more confident that the latter is more reliable; if correlation for `bin_size=10000` is 0.5 while for `bin_size=50000` is 0.52, we would choose the former as it has higher resolution.
			
 
				 <br>
			
 
				 <br>
			
 
				 `bin_size` is extended in the following way thus contact matrices at any larger `bin_sizes` can be aggregated from the input contact matrices directly:
			
@@ -40,7 +40,7 @@ Note that this strategy is currently only available for `hg19`, `hg38`, `mm9` an
 
				 
			
 
				 ### (2) Support for other genomes
			
 
				 
			
 
				-Although CALDER was mainly tested on human and mouse dataset, it can be applied to dataset from other genomes. One additional information is required in such case: a `feature_track` presumably positively correlated with compartment score (thus higher values in A than in B compartment). This information will be used for correctly determing the `A/B` direction. Some suggested tracks are gene density, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K36me3 (or negative transform of H3K9me3) signals. Note that this information will not alter the hierarchical compartment/TAD structure, and can come from any external study with matched genome. An example of `feature_track` is given in the **Usage** section.
			
 
				+Although CALDER was mainly tested on human and mouse dataset, it can be applied to dataset from other genomes. One additional information is required in such case: a `feature_track` presumably positively correlated with compartment score (thus higher values in A than in B compartment). This information will be used for correctly determining the `A/B` direction. Some suggested tracks are gene density, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K36me3 (or negative transform of H3K9me3) signals. Note that this information will not alter the hierarchical compartment/TAD structure, and can come from any external study with matched genome. An example of `feature_track` is given in the **Usage** section.
			
 
				 
			
 
				 # Installation
			
 
				 
			
@@ -88,7 +88,7 @@ CALDER contains three modules: (1) compute chromatin domains; (2) derive their h
 
				 
			
 
				 ### Input data format
			
 
				 
			
 
				-CALDER works on contact matrices compatable with that generated by Juicer tools (https://github.com/aidenlab/juicer), either a .hic file, or three-column `dump` table retrieved by the juicer dump (or straw) command (https://github.com/aidenlab/juicer/wiki/Data-Extraction):	
			
 
				+CALDER works on contact matrices compatible with that generated by Juicer tools (https://github.com/aidenlab/juicer), either a .hic file, or three-column `dump` table retrieved by the juicer dump (or straw) command (https://github.com/aidenlab/juicer/wiki/Data-Extraction):	
			
 
				 
			
 
				 	16050000	16050000	10106.306
			
 
				 	16050000	16060000	2259.247
			
@@ -211,11 +211,11 @@ CALDER(contact_file_dump=contact_file_dump,
 
				 
			
 
				 
			
 
				 
			
 
				-### Paramters:
			
 
				+### Parameters:
			
 
				 
			
 
				 | Name              | Description |  
			
 
				 | --------------------- | ----------------------- |
			
 
				-| **chrs**                | A vector of chromosome names to be analyzed, with or without 'chr'. Chromosome names should be consistent with those in `contact_file_hic` and `feature_track` if tsuch files are provided
			
 
				+| **chrs**                | A vector of chromosome names to be analyzed, with or without 'chr'. Chromosome names should be consistent with those in `contact_file_hic` and `feature_track` if such files are provided
			
 
				 | **contact_file_dump**                |A list of contact files in dump format, named by `chrs`. Each contact file stores the contact information of the corresponding `chr`. Only one of `contact_file_dump`, `contact_tab_dump`, `contact_file_hic` should be provided
			
 
				 | **contact_tab_dump**                | A list of contact table in dump format, named by `chrs`, stored as an R object. Only one of `contact_file_dump`, `contact_tab_dump`, `contact_file_hic` should be provided
			
 
				 | **contact_file_hic**                | A hic file generated by Juicer tools. It should contain all chromosomes in `chrs`. Only one of `contact_file_dump`, `contact_tab_dump`, `contact_file_hic` should be provided
			
@@ -223,7 +223,7 @@ CALDER(contact_file_dump=contact_file_dump,
 
				 | **save_dir**             | the directory to be created for saving outputs
			
 
				 | **bin_size**         | The bin_size (resolution) to run CALDER. `bin_size` should be consistent with the data resolution in `contact_file_dump` or `contact_tab_dump` if these files are provided as input, otherwise `bin_size` should exist in `contact_file_hic`. Recommended `bin_size` is between **10000 to 100000**
			
 
				 | **single_binsize_only**     |  logical. If TRUE, CALDER will compute compartments only using the bin_size specified by the user and not do bin size optimization
			
 
				-| **feature_track**                | A genomic feature track in `data.frame` or `data.table` format. This track will be used for determing the A/B compartment direction when `genome=others` and should presumably have higher values in A than in B compartment. Some suggested tracks can be gene density, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K36me3 (or negative transform of H3K9me3 signals)
			
 
				+| **feature_track**                | A genomic feature track in `data.frame` or `data.table` format. This track will be used for determining the A/B compartment direction when `genome=others` and should presumably have higher values in A than in B compartment. Some suggested tracks can be gene density, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K36me3 (or negative transform of H3K9me3 signals)
			
 
				 | **save_intermediate_data**  | logical. If TRUE, an intermediate_data will be saved. This file can be used for computing nested sub-domains later on
			
 
				 | **n_cores**     |  integer. Number of cores to be registered for running CALDER in parallel
			
 
				 | **sub_domains**     |  logical, whether to compute nested sub-domains
			
@@ -289,8 +289,8 @@ save_dir/
 
				 | --------------------- | ----------------------- |
			
 
				 | **all_sub_compartments.bed**                | a .bed file containing the optimal compartments for all `chrs`, that can be visualized in IGV. Different colors were used to distinguish compartments (at the resolution of 8 sub-compartments)
			
 
				 | **all_sub_compartments.tsv**                | optimal compartments stored in tabular text format. Each row represents one 10kb region
			
 
				-| **cor_with_ref.ALL.txt**                | a plot of correlation between compartment rank and the reference compartment rank for each of extended `bin_sizes`, and the optimimal `bin_size` that is finally selected
			
 
				-| **cor_with_ref.pdf**                | correlation of compartment rank with the reference compartment rank using the optimimal `bin_size`
			
 
				+| **cor_with_ref.ALL.txt**                | a plot of correlation between compartment rank and the reference compartment rank for each of extended `bin_sizes`, and the optimal `bin_size` that is finally selected
			
 
				+| **cor_with_ref.pdf**                | correlation of compartment rank with the reference compartment rank using the optimal `bin_size`
			
 
				 
			
 
				 
			
 
				 
			
@@ -299,7 +299,7 @@ save_dir/
 
				 
			
 
				 | Name              | Description |  
			
 
				 | --------------------- | ----------------------- |
			
 
				-| **chrxx_domain_hierachy.tsv**                | information of compartment domain and their hierarchical organization. The hierarchical structure is fully represented by `compartment_label`, for example, `B.2.2.2` and `B.2.2.1` are two sub-branches of `B.2.2`. The `pos_end` column specifies all compartment domain borders, except when it is marked as `gap`, which indicates it is the border of a gap chromsome region that has too few contacts and was excluded from the analysis (e.g., due to low mappability, deletion, technique flaw) 
			
 
				+| **chrxx_domain_hierachy.tsv**                | information of compartment domain and their hierarchical organization. The hierarchical structure is fully represented by `compartment_label`, for example, `B.2.2.2` and `B.2.2.1` are two sub-branches of `B.2.2`. The `pos_end` column specifies all compartment domain borders, except when it is marked as `gap`, which indicates it is the border of a gap chromosome region that has too few contacts and was excluded from the analysis (e.g., due to low mappability, deletion, technique flaw) 
			
 
				 | **chrxx_sub_compartments.bed**                | a .bed file containing the compartment information, that can be visualized in IGV. Different colors were used to distinguish compartments (at the resolution of 8 sub-compartments)
			
 
				 | **chrxx_domain_boundaries.bed**                | a .bed file containing the chromatin domains boundaries, that can be visualized in IGV
			
 
				 | **chrxx_nested_boundaries.bed**                | a .bed file containing the nested sub-domain boundaries, that can be visualized in IGV