Hi everyone,
I am setting up a differential methylation analysis pipeline using methylKit and have a question regarding data preparation and strand merging.
I am considering two different workflows and want to know if they are expected to yield identical results, and if one is preferred over the other:
Approach 1: Loading .cov.gz files directly from Bismark using methRead(). These files already have both strands grouped into single coordinates without explicit strand information.
Approach 2: Loading full Bismark CpG reports (which retain + and - strand info) using methRead(), and then running unite(..., destrand=TRUE) to merge the strands within methylKit.
My specific questions:
Does methylKit's unite(destrand=TRUE) algorithm merge symmetric CpG top/bottom strand counts exactly the same way Bismark groups them when generating a .cov file? Should I expect identical methylation percentages and read coverages per site?
Is one approach inherently better or more recommended for downstream differential analysis?
Any insights into how these two methods handle the math under the hood would be greatly appreciated!
(In fact, i am recovering more CpGs after merging using the CpG_report.txt files)
Thanks in advance, Álvaro
Hi Alex,
Thanks for your quick and detailed response.
I can certainly prepare some examples for you. In the meantime, I realize I should have included this information in my first post, so I have attached a copy of the same sample in both formats below.
CpG report:
I expected the data from both strands to be merged in the .cov.gz files. However, as you can see, positions 82 and 83 are treated as separate, contiguous cytosines rather than being combined. Shouldn't they be counted as a single entity in the coverage file?
Furthermore, how can methylKit::unite(..., destrand=TRUE) yield accurate results if the .cov.gz file lacks strand information and the strands aren't pre-merged? The missing strand information in the .cov.gz format seems like a notable disadvantage.
Regarding file size, if the main issue with CpG reports is that they become bloated by lines containing no data, those uninformative lines could easily be stripped out beforehand using a simple awk command.
This two files were generated using the following command:
/opt/software/Bismark-0.24.2/bismark_methylation_extractor "$input_dir"/*.bam \
--paired-end --comprehensive \
--cytosine_report \
--cutoff 1 --multicore 4 \
--samtools_path /opt/software/samtools-1.11 \
--genome_folder "$SEABASS_GENOME" \
--output "$EXTRACT_PATH"
Best regards,
You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/FTw3b10E7GU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to methylkit_discus...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/methylkit_discussion/1e1ad29a-b951-4498-8f82-dbb415588a22n%40googlegroups.com.