genomecoverageBed and Pair-end BAM files

2,768 views
Skip to first unread message

Assaf Gordon

unread,
Apr 2, 2012, 5:00:10 PM4/2/12
to bedtools...@googlegroups.com
Hi Aaron and all,

When calculating coverage of a BAM file from a Paired-end library, and using "-strand" option,
the program ignores the fact that the strand of the second mate is opposite to the first mate.

This issue has been brought up by David Citaro some time ago, but "awk sorcery" was deemed the recommended work-around :)

I'm considering adding an option for "genomecov" to correctly calculate coverage of paired-end BAM files (correctly in this case means: calculate the coverage of the logical "fragment", not the actual two reads).

Before I start coding it, any comments and suggestions would be appropriated, as can get tricky with corner-cases.

My idea is:
in "genomeCovergeBed.cpp", in "BedGenomeCoverage::CoverageBam()", add the following logic:

if ( user specified "-strand"
AND
read is mapped
AND
read is paired-end
AND
read is second mate
AND
first read mapped properly)
then
add coverage to the OPPOSITE strand of this read.
(because the coverage should be accounted towards to the strand of the first mate).

one complication:
what happens if the mates overlap ? need to take "BamAlignment::InsertSize" into account.

Comments are welcomed,
-gordon


Aaron Quinlan

unread,
Apr 2, 2012, 9:28:05 PM4/2/12
to bedtools...@googlegroups.com
Hi Assaf,

An alternative to awk sorcery is certainly welcome. I must admit, however, that I don't understand what is meant by the statement "because the coverage should be accounted towards the strand of the first mate". Do you mean to say that you want to count coverage for each end of the pair, but you always want the coverage to be counted for the strand observed for the first sequenced end of the pair? If so, why? What is the biological rationale for that?

The sentence "calculate the coverage of the logical "fragment", not the actual two reads" makes more sense to me, as this is what I would refer to as "physical" coverage. That is, even if you have alignments for 50bp reads from the (paired) 5' ends of a 500bp molecule, you intend to increment coverage for the entire 500bp span of the molecule. Is this what you want? If so, I agree that this would be very useful, but it would require that BAM files be query sorted.

Sorry if I am being dense - I've reduced my caffeine consumption.

Aaron

Assaf Gordon

unread,
Apr 11, 2012, 6:06:08 PM4/11/12
to bedtools...@googlegroups.com
Hello Aaron,

Aaron Quinlan wrote, On 04/02/2012 09:28 PM:
>
> An alternative to awk sorcery is certainly welcome.
>

> The sentence "calculate the coverage of the logical "fragment", not the actual two reads" makes more sense to me, as this is what I would refer to as "physical" coverage. That is, even if you have alignments for 50bp reads from the (paired) 5' ends of a 500bp molecule, you intend to increment coverage for the entire 500bp span of the molecule. Is this what you want? If so, I agree that this would be very useful, but it would require that BAM files be query sorted.
>

Yes, this is exactly what I meant.

Just to ensure we're talking about the same thing, this is the current genome-coverage operation (with paired-end data):
http://cancan.cshl.edu/labmembers/gordon/files/bedtools_genome_cov1.png

And this (at the bottom) is what I want to change:
http://cancan.cshl.edu/labmembers/gordon/files/bedtools_genome_cov2.png


The BAM file will not need to be sorted by query names:
for each read there's the FLAG field which will indicate whether it's the first or second mate and whether the "other" mate mapped,
the RNEXT/RPOS field will indicate the position of the other mate,
and the TLEN will indicate the length of the entire fragment.


The logic will be:
1. If it's the first mate, continue as before
2. If it's the second mate ( FLAGS & ( 0x01 | 0x80 ) )
AND
it mapped to the same chromosome as the first mate ( RNEXT == RNAME )
AND
it's properly mapped (FLAGS & 0x02)
AND
it's reversed orientation from the first mate ( (FLAGS & (0x10|0x20))==0x10 || (FLAGS & (0x10|0x20))==0x20 )
AND
the orientation of the *first* mate is the one the user requested (with "-strand X")
THEN
add the read to the coverage count (despite it being the opposite orientation from "-strand X" parameter).
ELSE
ignore this read.


There implicit assumptions are:
1. we're dealing with pair-end data (and there are only to mates: first and last - no more).
2. no fusing genes and other crazy stuff where mates map to different chromosomes
3. if the second mate mapped but the first mate didn't - we ignore it (can be probably improved in the future).

One complication is reads that overlap (if the insert size was smaller then the run lengths, e.g. insert size of 150 for a PE-101 run) - the coverage count will have to take then RPOS,TLEN fields into account when calculating proper coverage.


Comments are welcomed,
-gordon

Assaf Gordon

unread,
Apr 12, 2012, 6:11:10 PM4/12/12
to bedtools...@googlegroups.com
> Aaron Quinlan wrote, On 04/02/2012 09:28 PM:
>>
>> The sentence "calculate the coverage of the logical "fragment", not
>> the actual two reads" makes more sense to me, as this is what I
>> would refer to as "physical" coverage.

initial implementation is here:
https://github.com/agordon/bedtools/commit/1acff80aa10094f93565d902a4a008e8d0f57bc8

and the difference between using "-pe" and not using it is demonstrated here:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Cdnatata&hgS_otherUserSessionName=dm3_genome_cov_be_pe1

The first track is BAM,
the second and third are Plus/minus bedgraphs generated by genomeCoverageBed.
The fourth is the same BAm, with paired-end lines between reads,
and the fifth is the result of using "-pe" with genomeCoverageBed.

One problem is overlapping mates - some extra ugly code will be required to account for that.

Comments are welcomed,
-gordon

Aaron Quinlan

unread,
Apr 16, 2012, 11:14:27 AM4/16/12
to bedtools...@googlegroups.com
Hi Gordon,

Sorry for the delayed response, I just now had some time to this some thought. It turns out that what you are proposing is different than the "physical coverage" scenario is I was describing. What I was attempting to explain was, regardless of the orientations of a pair's ends, calculate the coverage based on the "full span" of the original molecule. For example, if it's 2x100 PE from a 513bp molecule, the coverage would be incremented for all 513bp, not 200bp.

Your examples and pseudocode make it very clear how you propose to calculate coverage with this new option. In your original email, you mentioned that you wanted to calculate coverage based on the entire fragment. However, your examples show that you are only counting coverage for the aligned ends of the fragment. For example, in the browser shot, were the entire fragment being counted, the depth for interval 830840-83062 would be 3, not 1. Th
This is fine - I just wanted to make sure you agree with my understanding.

What I can't get my head around is how the coverage displayed in track 5 (the result of the proposed changes) would be different than what would be produced by only sending propoerly-paired alignments to gCB:

samtools view -uf 2 aln.bam | gCB

The only difference that I can see is if we actually correct for overlapping ends - that would eliminate the extra coverage observed at ~830888.

Sorry if I am being dense, what am I missing?!

Aaron

Assaf Gordon

unread,
Apr 20, 2012, 11:18:31 AM4/20/12
to bedtools...@googlegroups.com
Hi Aaron!

(This thread is going slow :) but that's ok)

The "full span" is definitely a goal, but the problem is that in genomic samples, the full span is (most likely) just the entire region between the two sequenced fragments.
But when dealing with transcriptome/RNA-seq or similar samples, the "full span" is unknown. If we factor in multiple isoforms, then even when using a GTF file for a reference gene model, we can't be sure we're giving the use the correct span.

So for now, I would prefer to give users the minimum coverage that we know for sure is correct. Unless we want another command line option to differentiate between "genomic" and "transcriptom" libraries.

Regarding the difference:
When a BAM file contains properly paired reads, each of them (implicitly assuming they were sequenced in a standard illumina protocol) would have a different orientation (one would be on the plus strand, the other on the minus strand).

So if I use gCB on a properly-paired BAM files, *and* ask for only coverage of plus strand, I will only get one mate of each pair (unless I'm missing something).

-gordon

vince

unread,
Apr 20, 2012, 9:31:38 PM4/20/12
to bedtools-discuss
Hi,

I would vote for having an option to do coverage either way. For RNA
seq, actual base coverage is required, but I am starting to do paired
end reads with chip-seq, and filling in the uncovered part if reads
don't overlap would be important. Dealing with overlapping ends
properly would also be important so that overlap bases were not double
counted for either scenario.

Thanks,

Vince

Aaron Quinlan

unread,
Apr 22, 2012, 12:28:10 PM4/22/12
to bedtools...@googlegroups.com
Hey Gordon,

Thanks, I now see what you are getting at. I do think we a very specific parameter name to describe what this is doing, because it seems a bit specific to RNA-seq.

As you stated earlier, overlapping pairs are a problem, and it seems to me that one must sort the BAM by query in order to have the necessary (in order to compute the exact start and end of each alignment) access to both CIGARs. With unsorted BAMs, one has access to the ISIZE, but one would not know the end coordinate of the mate's alignment without it's CIGAR. Do you agree?

An alternative to query sorting is to maintain a hash for each pair. All necessary info (e.g., CIGAR) is retained for the first mate encountered until the second mate is seen. At this point, the coverage is tabulated, and the info for the pair is purged from the hash to minimize memory consumption.

What's your take?

Best,
Aaron

Aaron Quinlan

unread,
Apr 22, 2012, 12:33:09 PM4/22/12
to bedtools...@googlegroups.com
Hi Vince,

I think this is further support for the notion of distinguishing the rules for tabulating paired-end coverage for genomic DNA versus spliced RNA/cDNA.

Thanks for your input.
Aaron

Assaf Gordon

unread,
Feb 21, 2013, 10:18:08 AM2/21/13
to bedtools...@googlegroups.com, stef...@gmail.com
Hello Stefan,

stef...@gmail.com wrote, On 02/20/2013 09:58 PM:
>
> google helped me find this thread. I very much would like to be able
> to fill in the interval between two reads that are properly paired
> when generating a coverage (wig,bedgraph) file from ChIP-seq data. Is
> there any tool out there that will allow me to do this? I can't get
> bedtools to do it (2.16.2) Thank you,
>

Because you're dealing with ChIP-Seq (and presumably, DNA/genomic reads), you might be able to do the following (haven't tested it, though):

1. Use "bedtools bamtobed -bedpe" to convert paired-end reads from a BAM file into a non-standard BED-PE file.
Each line will contain two reads (the paired reads).

2. Use an AWK script to take the start of one read and the end of the other, and construct a new standard BED file.
This assumes the reads represent genomic/DNA data (ie no splicing consideration).
The strand of the combined fragment is the same as the first mate.

3. Use "bedtools genomecov" to calculate the coverage on the new BED file.

HTH,
-gordon

p.j.a...@googlemail.com

unread,
Oct 30, 2014, 10:13:28 AM10/30/14
to bedtools...@googlegroups.com, nandy...@gmail.com
Hello all,

I was just looking over the current documentation to try and find out how "bedtools genomecov" handled paired reads, and since it didn't answer my questions I searched the mailing list and found this thread.

Presumably each read and its coverage is still calculated in isolation, rather than the full span of a properly mapped read pair?

Thanks,

Peter

On Thursday, 28 February 2013 08:28:38 UTC, nandy...@gmail.com wrote:
Hello, 

I just write you to know if finally genomeCoverage option is updated ?
I mean can I use this option to estimate the physical coverage (it takes into account the insert and not only reads).

advance thanks, 
Siva

Aaron Quinlan

unread,
Oct 30, 2014, 10:23:17 AM10/30/14
to bedtools...@googlegroups.com
Hi Peter,

That is correct, this option is not yet implemented. The main challenge is that to do this properly for both concordant and discordant pairs, one needs to have the BAM file sorted by query name (in order to work out the exact alignment start and end positions for the fragment). However, the core algorithm in genomecov requires that the data are sorted by coordinate.  As such, what we have suggested in the past is that folks first convert the pairs to BED format (representing the fragment’s coordinates) and then tossing that into genomecov.

- Aaron





--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Cock

unread,
Oct 30, 2014, 11:14:19 AM10/30/14
to bedtools...@googlegroups.com
On Thu, Oct 30, 2014 at 2:23 PM, Aaron Quinlan <aaronq...@gmail.com> wrote:
> Hi Peter,
>
> That is correct, this option is not yet implemented. The main challenge is
> that to do this properly for both concordant and discordant pairs, one needs
> to have the BAM file sorted by query name (in order to work out the exact
> alignment start and end positions for the fragment). However, the core
> algorithm in genomecov requires that the data are sorted by coordinate. As
> such, what we have suggested in the past is that folks first convert the
> pairs to BED format (representing the fragment’s coordinates) and then
> tossing that into genomecov.
>
> - Aaron

Thank you Aaron.

Could you add a note to the documentation page?
http://bedtools.readthedocs.org/en/latest/content/tools/genomecov.html

If this does get added, another beautifully clear figure would be needed :)

Given a coordinate sorted BAM file, I think that a very good approximation
to the "pair aware" coverage is possible for properly paired reads (where
RNAME matches RNEXT), with a single pass though the BAM file. The
only problem I see right now would be the special case of overlapping
reads with CIGAR D/N operators in the overlap.

I will try to explain in words, but could sketch this if that helps?

I would count the implied spanning sequence while looking at the
"left most" read of any nice pair. At this point you know:

- left most read's start = POS
- left most reads'end = POS + ALEN (from CIGAR)
- right most read's start = PNEXT
- right most read's end = Unknown without partner's CIGAR
(so leave this until read the partner read later in the file)

My goal is to calculate the coverage from POS to PNEXT (looking
at the CIGAR of this read) now, and later on add the coverage from
the partner read from PNEXT onward.

Start by incrementing the coverage of the left-most read as normal.

In the non-overlapping case, you then increment the implicit coverage
between the reads from POS + ALEN up to PNEXT (and later on will
process the partner read).

In the overlapping case, to avoid double counting, you decrement
the coverage from PNEXT to POS + ALEN in anticipation of adding
the partner read later. There would be some corner cases with
CIGAR D/N in the overlap region - is this the stumbling block?
(Or have I missed some other consideration?)

Delaying dealing with the implied span until you reach the "right most"
aligned read would require caching information from the "left most" read,
which I understand rejecting for performance reasons.

Regards,

Peter

Aaron Quinlan

unread,
Oct 30, 2014, 11:18:40 AM10/30/14
to bedtools...@googlegroups.com
Will do Peter.  The algorithm you propose will work very nicely for "proper" pairs, but could perform very badly for BAM files with extensive structural variation and chromosomal rearrangement (e.g., solid tumors). In the latter case, one must store the info one end until the other end is seen. When rearrangment is rampant, this can lead to high (i.e., unpredictable) memory usage. Now, were this approach limited to proper pairs, it would be trivial.  That is an option to pursue.

Peter Cock

unread,
Oct 30, 2014, 11:56:15 AM10/30/14
to bedtools...@googlegroups.com
On Thu, Oct 30, 2014 at 3:18 PM, Aaron Quinlan <aaronq...@gmail.com> wrote:
> Will do Peter. The algorithm you propose will work very nicely for "proper"
> pairs, but could perform very badly for BAM files with extensive structural
> variation and chromosomal rearrangement (e.g., solid tumors). In the latter
> case, one must store the info one end until the other end is seen. When
> rearrangment is rampant, this can lead to high (i.e., unpredictable) memory
> usage. Now, were this approach limited to proper pairs, it would be trivial.
> That is an option to pursue.

Yes - I don't have that kind of data, but my attitude would be that for non
proper-pairs there isn't a well defined spanned region, so we can't do
any clever coverage calculation.

Thanks,

Peter

Aaron Quinlan

unread,
Oct 30, 2014, 11:58:17 AM10/30/14
to bedtools...@googlegroups.com
Good point, that is true for interchromosomal "improper" pairs, but arguable not for intrachromosomal "improper" pairs that suggest large deletions, inversions, or tandem duplications. Nonetheless, your point is a good one, and it makes sense to support this with the caveat that only proper pairs will be considered.

Thanks for brining this up!
Aaron


On Thursday, October 30, 2014, Peter Cock <p.j.a...@googlemail.com> wrote:
Reply all
Reply to author
Forward
0 new messages