Expected differences between Bismark coverage files vs. CpG reports with unite(destrand=TRUE)

13 views
Skip to first unread message

Álvaro Gutiérrez Rodríguez

unread,
May 25, 2026, 8:21:31 AM (14 days ago) May 25
to methylkit_discussion

Hi everyone,

I am setting up a differential methylation analysis pipeline using methylKit and have a question regarding data preparation and strand merging.

I am considering two different workflows and want to know if they are expected to yield identical results, and if one is preferred over the other:

  • Approach 1: Loading .cov.gz files directly from Bismark using methRead(). These files already have both strands grouped into single coordinates without explicit strand information.

  • Approach 2: Loading full Bismark CpG reports (which retain + and - strand info) using methRead(), and then running unite(..., destrand=TRUE) to merge the strands within methylKit.

My specific questions:

  1. Does methylKit's unite(destrand=TRUE) algorithm merge symmetric CpG top/bottom strand counts exactly the same way Bismark groups them when generating a .cov file? Should I expect identical methylation percentages and read coverages per site?

  2. Is one approach inherently better or more recommended for downstream differential analysis?

Any insights into how these two methods handle the math under the hood would be greatly appreciated!

(In fact, i am recovering more CpGs after merging using the CpG_report.txt files)

Thanks in advance, Álvaro

Alexander Blume

unread,
May 27, 2026, 6:08:46 AM (12 days ago) May 27
to methylkit_...@googlegroups.com
Hi Álvaro,

Thank you for the detailed question and for outlining the different approaches.

They should hold identical results in default mode.

The math should be the same for both approaches: just summing reads
from the same CpG pairs.
On the bismark side, there are some more checks; you can perform a
discordance check (—discordance), and the reads are processed in a
loop ( see the merging routine in coverage2cytosine).
In methylKit, we take the data as is and perform the strand shift in a
vectorised manner on the whole dataset (see the merging function in
methylKit).

The logic for reading either coverage file or cytosine report is quite
simple; there should not be a difference in the objects created,
except for the lack of strand information in the coverage file.

Given that you already reported differences in the two approaches,
could you maybe give some numbers about how many CpGs are
shared/different in your case?
Also, if you are able to prepare a small reproducible example from a
subset of your data, I would be happy to check for an explanation of
the differences.

Overall, if you want to run more qc checks and are considering
discordance checks, then perform the merging via Bismark. If this is
not so relevant, you can also do it in methylKit.

Best,
Alex




Am Mo., 25. Mai 2026 um 14:21 Uhr schrieb Álvaro Gutiérrez Rodríguez
<alvarogut...@gmail.com>:
> --
> You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/methylkit_discussion/8678f739-ecb9-41ba-b360-680c0f064bdan%40googlegroups.com.

alex....@gmail.com

unread,
May 27, 2026, 6:12:47 AM (12 days ago) May 27
to methylkit_discussion

Álvaro Gutiérrez Rodríguez

unread,
May 27, 2026, 8:41:08 AM (12 days ago) May 27
to methylkit_...@googlegroups.com, nsanchez, alex....@gmail.com

Hi Alex,

Thanks for your quick and detailed response.

I can certainly prepare some examples for you. In the meantime, I realize I should have included this information in my first post, so I have attached a copy of the same sample in both formats below.

CpG report:

LG10    5       +       0       0       CG      CGT
LG10    6       -       0       0       CG      CGT
LG10    45      +       3       1       CG      CGC
LG10    46      -       0       0       CG      CGA
LG10    49      +       4       0       CG      CGC
LG10    50      -       0       0       CG      CGG
LG10    82      +       4       0       CG      CGC
LG10    83      -       0       1       CG      CGG

 Cov.gz:
LG10    45      45      75      3       1
LG10    49      49      100     4       0
LG10    82      82      100     4       0
LG10    83      83      0       0       1

I expected the data from both strands to be merged in the .cov.gz files. However, as you can see, positions 82 and 83 are treated as separate, contiguous cytosines rather than being combined. Shouldn't they be counted as a single entity in the coverage file?

Furthermore, how can methylKit::unite(..., destrand=TRUE) yield accurate results if the .cov.gz file lacks strand information and the strands aren't pre-merged? The missing strand information in the .cov.gz format seems like a notable disadvantage.

Regarding file size, if the main issue with CpG reports is that they become bloated by lines containing no data, those uninformative lines could easily be stripped out beforehand using a simple awk command.

This two files were generated using the following command:

/opt/software/Bismark-0.24.2/bismark_methylation_extractor "$input_dir"/*.bam \
        --paired-end --comprehensive \
        --cytosine_report \
        --cutoff 1 --multicore 4 \
        --samtools_path /opt/software/samtools-1.11 \
        --genome_folder "$SEABASS_GENOME" \
        --output "$EXTRACT_PATH"

Best regards,

Álvaro

You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/FTw3b10E7GU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to methylkit_discus...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/methylkit_discussion/1e1ad29a-b951-4498-8f82-dbb415588a22n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages