Hi
Kristina,
Thanks for asking this question, this confuses a lot of people.
Your first assumption about the
min.per.group is correct. During
uniting/merging of samples, a CpG or region is only kept if it has been detected by the minimum number of samples
per group (
min.per.group = 3) [see
code ]. Please note that by default we keep only sites that are detected in all samples, irrespective of their groups.
For the
regional analysis (tileMethylCounts, regionCounts), the first statement is true. The
cov.bases are used to filter regions based on the number of CpGs they cover (see
code). This allows us to keep low-coverage regions where samples may not cover the exact same CpGs, but close-by CpG within the same window.
To answer your last question, the order will be important here.
a) If you first merge and then tile, your samples need to have even coverage over the genome to retain a large amount of CpGs. Any CpG, which is not covered in three out of the four samples per group, will be discarded. Then, during the tiling step, the genome is chunked using sliding windows, and any window which does not at least overlap two CpGs from the merged data will be discarded as well. The tiling should increase the per-region coverage and can improve the statistical testing during DMR calling. As a note, if you recover a lot of CpGs during the merging, I would also recommend doing the per CpG DMC analysis and then using methSeg to aggregate them into DMRs.
b) You can also reverse the order of those two steps. Doing the tiling before merging will allow you to use recover more regions if the per CpG coverage is variable or you are dealing with low-coverage samples. The chance of missing a region during the merging be lower, and you should enter with more regions for DMR calling. However, this is usually only an issue when dealing with low-coverage probes, like cfDNA or similar, not so much when dealing with tissue samples etc.
I hope that helps.
Best,
Alex