|  | 3 years ago | |
|---|---|---|
| R | 3 years ago | |
| img | 3 years ago | |
| inst | 3 years ago | |
| man | 3 years ago | |
| src | 3 years ago | |
| .DS_Store | 3 years ago | |
| .gitattributes | 3 years ago | |
| .gitignore | 3 years ago | |
| DESCRIPTION | 3 years ago | |
| LICENSE | 3 years ago | |
| NAMESPACE | 3 years ago | |
| README.md | 3 years ago | 
CALDER is a Hi-C analysis tool that allows: (1) compute chromatin domains from whole chromosome contacts; (2) derive their non-linear hierarchical organization and obtain sub-compartments; (3) compute nested sub-domains within each chromatin domain from short-range contacts. CALDER is currently implemented in R.
bin_size selectionDue to reasons such as low data quality or large scale structrual variation, compartments can be unreliablly called at one bin_size (equivalent to resoltution in the literature) but correctly called at another bin_size. We added an opitimized bin_size selection strategy to call reliable compartments. This strategey is based on the observation from our large scale compartment analysis (https://www.nature.com/articles/s41467-021-22666-3) that, although compartments can change between different conditions, their overall correlation cor(compartment_rank_1, compartment_rank_2) is high (> 0.4).
The strategy: given a bin_size specified by user, we call compartments with extended bin_sizes and choose the smallest bin_size such that no bigger bin_size can increase the correclation with a reference compartment more than 0.05. For example, if correclation for bin_size=10000 is 0.2 while for bin_size=50000 is 0.6, we are more confident the latter is more reliable; if correclation for bin_size=10000 is 0.5 while for bin_size=50000 is 0.52, we would choose the former as it has higher resolution.
bin_size is extended in the following way such that we can aggregated directly from the input contact matrix into larger bin_sizes
if(bin_size==5E3) bin_sizes = c(5E3, 10E3, 50E3, 100E3)
if(bin_size==10E3) bin_sizes = c(10E3, 50E3, 100E3)
if(bin_size==20E3) bin_sizes = c(20E3, 40E3, 100E3)
if(bin_size==25E3) bin_sizes = c(25E3, 50E3, 100E3)
if(bin_size==40E3) bin_sizes = c(40E3, 80E3)
if(bin_size==50E3) bin_sizes = c(50E3, 100E3)
Note that this strategy is currently only available for hg19, hg38, mm9 and mm10 genome for which we generated high quality reference compartments, using Hi-C data: from GSE63525 for hg19, from https://data.4dnucleome.org/files-processed/4DNFI1UEG1HD/ for hg38, from GSM3959427 for mm9, from http://hicfiles.s3.amazonaws.com/external/bonev/CN_mapq30.hic for mm10.
git clone https://github.com/CSOgroup/CALDER.git
install.packages(path_to_CALDER, repos = NULL, type="source") ## install from the cloned source file
Please contact yliueagle@googlemail.com for any questions about installation.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GenomicRanges")
install.packages("remotes")
remotes::install_github("CSOgroup/CALDER")
The input data of CALDER is a three-column text file storing the contact table of a full chromosome (zipped format is acceptable, as long as it can be read by data.table::fread). Each row represents a contact record pos_x, pos_y, contact_value, which is the same format as that generated by the dump command of juicer (https://github.com/aidenlab/juicer/wiki/Data-Extraction):
16050000    16050000    10106.306
16050000    16060000    2259.247
16060000    16060000    7748.551
16050000    16070000    1251.3663
16060000    16070000    4456.1245
16070000    16070000    4211.7393
16050000    16080000    522.0705
16060000    16080000    983.1761
16070000    16080000    1996.749
...
A demo dataset is included in the repository CALDER/inst/extdata/mat_chr22_10kb_ob.txt.gz and can be accessed by system.file("extdata", "mat_chr22_10kb_ob.txt.gz", package='CALDER') once CALDER is installed. This data contains contact values of GM12878 on chr22 binned at 10kb (Rao et al. 2014)
CALDER contains three modules: (1) compute chromatin domains; (2) derive their hierarchical organization and obtain sub-compartments; (3) compute nested sub-domains within each compartment domain.
CALDER_main(contact_mat_file, 
			chr, 
			bin_size, 
			out_dir, 
			sub_domains=TRUE, 
			save_intermediate_data=FALSE,
			genome='hg19')
# This will not compute sub-domains, but save the intermediate_data that can be used to compute sub-domains latter on
CALDER_main(contact_mat_file, 
			chr, 
			bin_size, 
			out_dir, 
			sub_domains=FALSE, 
			save_intermediate_data=TRUE,
			genome='hg19') 
# (optional depends on needs) Compute sub-domains using intermediate_data_file that was previous saved in the out_dir (named as chrxx_intermediate_data.Rds)
CALDER_sub_domains(intermediate_data_file, 
				   chr, 
				   out_dir, 
				   bin_size) 
| Name              | Description |
| --------------------- | ----------------------- |
| chrs                | A vector of chromosome names to be analyzed, with or without 'chr'
| contact_file_dump                |A list of contact files in dump format, named by chrs. Each contact file stores the contact information of the corresponding chr. Only one of contact_file_dump, contact_tab_dump, contact_file_hic should be provided
| contact_tab_dump                | A list of contact table in dump format, named by chrs, stored as an R object. Only one of contact_file_dump, contact_tab_dump, contact_file_hic should be provided
| contact_file_hic                | A hic file generated by Juicer tools. It should contain all chromosomes in chrs. Only one of contact_file_dump, contact_tab_dump, contact_file_hic should be provided
| ref_genome                | One of 'hg19', 'hg38', 'mm9', 'mm10', 'others' (default). These compartments will be used as reference compartments for optimized bin_size selection. If ref_genome = others, an annotation_track should be provided (see below) and no optimized bin_size selection will be performed
| annotation_track                | A genomic annotation track in data.frame or data.table format. This track will be used for determing the A/B compartment direction when ref_genome=others and should presumably have higher values in A than in B compartment. Some suggested tracks can be gene density, H3K27ac, H3K4me1, H3K4me2, H3K4me3, H3K36me3 (or negative transform of H3K9me3 signals)
| bin_size         | The bin_size (resolution) to run CALDER. bin_size should be consistent with the data resolution in contact_file_dump or contact_tab_dump if these files are provided as input, otherwise bin_size should exist in the contact_file_hic file. Recommended bin_size is between 10000 to 50000
| save_dir             | the directory to save outputs
| save_intermediate_data  | logical. If TRUE, an intermediate_data will be saved. This file can be used for computing nested sub-domains later on
| n_cores     |  integer. Number of cores to be registered for running CALDER in parallel
| single_binsize_only     |  logical. If TRUE, CALDER will compute compartments only using the bin_size specified by the user and not do bin size optimization
| sub_domains     |  logical, whether to compute nested sub-domains
compartment_label, for example, B.2.2.2 and B.2.2.1 are two sub-branches of B.2.2. The pos_end column specifies all compartment domain borders, except when it is marked as gap, which indicates it is the border of a gap chromsome region that has too few contacts and was excluded from the analysis (e.g., due to low mappability, deletion, technique flaw)The output of the workflow is stored in the folder specified by --save_dir ("results" by default) and will look like this:
results/
└── HiC_sample_1
    ├── 100000
    │   └── KR
    │       ├── chr1
    │       │   ├── chr1_domain_boundaries.bed
    │       │   ├── chr1_domain_hierachy.tsv
    │       │   ├── chr1_log.txt
    │       │   ├── chr1_nested_boundaries.bed
    │       │   ├── chr1_sub_compartments.bed
    │       │   └── chr1_sub_domains_log.txt
For the computational requirement, running CALDER on the GM12878 Hi-C dataset at bin size of 40kb took 36 minutes to derive the chromatin domains and their hierarchy for all chromosomes (i.e., CALDER Step1 and Step2); 13 minutes to derive the nested sub-domains (i.e., CALDER Step3). At the bin size of 10kb, it took 1 h 44 minutes and 55 minutes correspondingly (server information: 40 cores, 64GB Ram, Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz). The evaluation was done using a single core although CALDER can be run in a parallel manner.
library(CALDER)
contact_mat_file = system.file("extdata", "mat_chr22_10kb_ob.txt.gz", package = 'CALDER')
CALDER_main(contact_mat_file, chr=22, bin_size=10E3, out_dir='./GM12878', sub_domains=TRUE, save_intermediate_data=FALSE)
The saved .bed files can be view directly through IGV:
If you use CALDER in your work, please cite: https://www.nature.com/articles/s41467-021-22666-3