Does running cicero on individual chromosomes affect distance parameter?

Daniel Gingerich

unread,

Aug 18, 2021, 4:35:39 PM8/18/21

to cicero-users

Hello,

I am working with a very large dataset. For the sake of time, I am running cicero separately on each chromosome using parallel computing (slurm job array). Does this mean that there will be a different distance parameter used for each chromosome? If so, does this affect the validity of my results?

Thanks,

Dan

Daniel Gingerich

unread,

Aug 19, 2021, 10:52:13 AM8/19/21

to cicero-users

Additional info:

I am running chromosomes individually by subsetting the 'chrom.sizes' data frame, as is done in the vignette.

# run chromosomes individually in parallel with slurm array job

j <- Sys.getenv('SLURM_ARRAY_TASK_ID') %>% as.numeric

# load data

cds <- readRDS('path/to/HUGE/file/cicero.cds.rds')

hg38 <- read.table('hg38.chrom.sizes.tsv')

# subset chroms and run cicero on each

hg38.j <- hg38[j,, drop = F]

conns <- run_cicero(cds, hg38.j)

My initial thoughts on this are:

When all chromosomes are ran at once, the distance parameter is representative of the whole genome. When only one chromosome is ran at a time, the distance parameter is representative of the chromosome, because the random sample windows are all located on that chromosome. This might be favorable because some chromosomes have more peak density than others, and it makes the distance penalty specific to the chromosome of interest

I would like to know what you think Hannah.

Thanks!

hpl...@gmail.com

unread,

Aug 24, 2021, 1:25:46 PM8/24/21

to cicero-users

Hi Dan,

Under the hood, Cicero always runs on the chromosomes separately after calculating the distance parameter, so using a shared distance parameter would exactly represent the output of running it the 'slow' way. You could do this by running the three functions that run_cicero wrap separately - first running estimate_distance_parameter once on the whole dataset and then running generate_cicero_models and assemble_connections separately using the value from the first function.

As for using a separate distance parameter for each chromosome, I haven't explored this myself. I suspect that so long as the chromosomes are large enough to have sufficient good windows for estimating, that you'd get similar results to using the shared parameter, but it would be interesting to look at in practice (one quick way is to simply check how variable the estimated parameters are). One concern to keep in mind, depending on your plans downstream, would be whether the coaccessibility scores would be less comparable across chromosomes if you were using penalties with large differences.

Best,

Hannah

Daniel Gingerich

unread,

Aug 26, 2021, 3:11:38 PM8/26/21

to cicero-users

Thanks Hannah. That makes a lot of sense. In terms of parallel computing I think it would make the most sense to do each function separately as you mentioned. I ended up running cicero without parallel computing, as it did not take as long as I thought it would. Models ran for around 24 hours, not bad! Using a slurm array only took around 30 minutes, to an hour, which was super surprising to me! I will definitely consider these tips for the next analysis, as this could greatly speed up calculation.