Hi Hannah,
I have a question regarding your cell aggregation approach prior to co-accessibility score calculation. I understand your reasoning why it is necessary due to sparsity of the data. However, I am a bit worried with the "duplication" of cells in different aggregates/groups.
In your publication you say: "Note that with these parameter settings in a typical experiment, a cell will be part of more than one group and therefore the groups will sometimes contain some of the same cells, which could in principle inflate co-accessibility scores across cells. However, in practice in our analyses of both GM12878 and HSMM, the median number of cells shared between pairs of groups is zero."
What did you do to keep the median number of shared cells to zero? The 90 % cutoff for overlaps with existing groups/aggregates is not sufficient I guess. Did you simply reduce the number of groups/aggregates sampled? And if yes, what would be the minimal number of aggregates you would recommend using?
Additionally, why did you choose aggregation of 50 cells per group/aggregate initially?
In other methods, people aggregate cells disregarding the "duplication rate" of cells in different aggregates (same cells in up to 10-20 aggregates). What is your take on that? Wouldn't correlation coefficients on these cell aggregates be highly inflated?
Thank you very much in advance for your answer.
Best,
Isabelle