Coverage at overlapping paired end reads

342 views
Skip to first unread message

Eva König

unread,
Jul 21, 2015, 10:23:51 AM7/21/15
to qual...@googlegroups.com
Dear development team,

I think that QualiMap counts bases of overlapping paired end (PE) reads twice at the overlapping sites for the coverage statistics. I have paired end data and ran QualiMap once on a bam file generated from the raw reads (normal PE alignment, adapters removed before) and once on a bam file generated from the merged reads (adapters removed, single end alignment of the merged reads plus PE alignment of the non-overlapping PE reads). The coverage was much higher in the first case, the difference in coverage corresponds to the amount of overlapping base pairs.

In my opinion this is not correct measurement of coverage, since the general consensus of how to use overlapping PE data is to use the additional information in the overlap to improve base quality, mapping and variant calling confidence, but definitely not to count these reads twice as independent reads. So consequently, these reads should be counted only once when computing coverage.

What do you think?

Thanks
Eva

Konstantin Okonechnikov

unread,
Jul 22, 2015, 5:34:37 AM7/22/15
to qual...@googlegroups.com
Hi Eva,

thanks for interesting suggestion!

I suppose, that detection of  overlapping PE read alignments is quite important.  The number of detected overlapping PE reads can be provided in the report by default. I created an issue regarding this:


However, computing of coverage (and other alignment properties) by combining the pairs into single read can have different results in various research projects.

Here are some discussions regarding this aspect:

I suppose this can be added as a special option for BAM QC mode. I plan to create this one more issue and have some questions about this: 

Could you please provide more information how you performed the merging (tools, commands etc)? 

Also, is it possible to share a small subsample from your data with such big difference in coverage between default and modified BAM file? Or perhaps you know where some data with overlapping paired-end reads can be downloaded from? Such data will be important for testing of implemented option.

--
  Konstantin

 

--
You received this message because you are subscribed to the Google Groups "QualiMap" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qualimap+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eva König

unread,
Jul 23, 2015, 9:43:17 AM7/23/15
to QualiMap, k.okone...@gmail.com
Hi Konstantin,

thanks for creating the issue, this will be an interesting statistic to add to QualiMap.

Regarding the merging of overlapping PE reads, I am not claiming that it is a good way to deal with them, for me it was just the easiest and quickest way to test if QualiMap counts them twice in the coverage statistics or not.

Unfortunately I cannot share my data, in Italy the privacy laws are just too strict. But you should be able to get some test data from Illumina Base Space for the Nextera kit, for example for sample NA12144 or NA10859. In these samples the overlap is not as much as in my data, but there is some.

I merged the reads with the following commands:

nextera_adapter="CTGTCTCTTATACACATCTC"

SeqPrep -f Sample_R1.fastq -r Sample_R2.fastq \
    -1 Sample_trimmed_1.fastq.gz \
    -2 Sample_trimmed_2.fastq.gz \
    -s Sample_trimmed_merged.fastq.gz \
    -A $nextera_adapter -B $nextera_adapter \
    -L 40 -O 9 -o 15

Then I did a BWA sample on the two trimmed PE read files, BWA samse on the merged fastq file and then picards MergeSamFiles to merge the file into one.

I hope this helps.
Eva


Konstantin Okonechnikov

unread,
Jul 24, 2015, 9:42:38 AM7/24/15
to qual...@googlegroups.com
Hi Eva,

thanks for additional info!

I created an issue:

Currently I have some other tasks, but I will post here once there will be some results.

--
  Konstantin








Konstantin Okonechnikov

unread,
Aug 21, 2015, 8:13:33 AM8/21/15
to qual...@googlegroups.com
Hi Eva, 

I've implemented an option to detect overlapping paired-end reads during BAM QC analysis and compute mean coverage ignoring overlapping block from one read.

The option was tested on simulated data and also on dataset from Illumina (https://basespace.illumina.com/projects/609609).

Currently the option can be activated from command line analysis  with -ic parameter. Here's an example:

qualimap bamqc -ip -bam /home/kokonech/playgrnd/paired_reads_intersecting_Illumina_example/alignment.sorted.bam  -gff /data/annotations/Homo_sapiens.GRCh37.68.clean.bed 

Would be great if you will have time to test the novelty on your data and report if adapted coverage results are similar to those that you computed by modifying BAM file.

The latest package with this option is available from here:

Let me know if there are any additional questions.

--
  Konstantin










Message has been deleted

Eva König

unread,
Sep 10, 2015, 10:53:47 AM9/10/15
to QualiMap, k.okone...@gmail.com
Hi Konstantin,

sorry for the late reply. I tested the QualiMap 21-08-15 version with the -ic parameter. The additional info is only the mean (paired-end reads overlap ignored) under 2.4, right? Are the coverage related plots still computed from the coverage statistics with the overlapping PE reads counted twice?

First, I ran the QualiMap version on a bam where I previously merged the overlapping PE reads. As expected, I got the same value for mean and for mean (paired-end reads overlap ignored), which was 17.5.
Then, I ran QualiMap on a bam where I had not merged the overlapping PE reads. As expected, I got a higher value under mean (21.2), but a negative value (-4.8) for mean (paired-end reads overlap ignored). Since 21.2 - 4.8 is roughly 17, maybe you put the value that had to be subtracted under the mean (paired-end reads overlap ignored) field?

I attached the reports in case this was not clear.

Eva
PE_reads_not_merged.pdf
PE_reads_merged.pdf

Konstantin Okonechnikov

unread,
Sep 11, 2015, 8:29:06 AM9/11/15
to qual...@googlegroups.com
Hi Eva,

thanks a lot for the report! 

Actually, a bug was detected: the skipped duplicate alignments were not taken into account when computing mean coverage with overlapping read pairs. Now this issue is fixed. Would be great if you rerun analysis with novel version. Here's the link:


Also, it might be interesting to check the results without skipping duplicates. Note, that currently coverage plot is not modified with overlappers. However, number of overlapping read pairs is reported in "Globals"  or "Globals (inside regions") statistics if annotation is provided.

--
  Konstantin






--
Reply all
Reply to author
Forward
0 new messages