Additional info:
I am running chromosomes individually by subsetting the 'chrom.sizes' data frame, as is done in the vignette.
# run chromosomes individually in parallel with slurm array job
j <- Sys.getenv('SLURM_ARRAY_TASK_ID') %>% as.numeric
# load data
cds <- readRDS('path/to/HUGE/file/cicero.cds.rds')
hg38 <- read.table('hg38.chrom.sizes.tsv')
# subset chroms and run cicero on each
hg38.j <- hg38[j,, drop = F]
conns <- run_cicero(cds, hg38.j)
My initial thoughts on this are:
When all chromosomes are ran at once, the distance parameter is representative of the whole genome. When only one chromosome is ran at a time, the distance parameter is representative of the chromosome, because the random sample windows are all located on that chromosome. This might be favorable because some chromosomes have more peak density than others, and it makes the distance penalty specific to the chromosome of interest
I would like to know what you think Hannah.
Thanks!