When calculating coverage of a BAM file from a Paired-end library, and using "-strand" option,
the program ignores the fact that the strand of the second mate is opposite to the first mate.
This issue has been brought up by David Citaro some time ago, but "awk sorcery" was deemed the recommended work-around :)
I'm considering adding an option for "genomecov" to correctly calculate coverage of paired-end BAM files (correctly in this case means: calculate the coverage of the logical "fragment", not the actual two reads).
Before I start coding it, any comments and suggestions would be appropriated, as can get tricky with corner-cases.
My idea is:
in "genomeCovergeBed.cpp", in "BedGenomeCoverage::CoverageBam()", add the following logic:
if ( user specified "-strand"
AND
read is mapped
AND
read is paired-end
AND
read is second mate
AND
first read mapped properly)
then
add coverage to the OPPOSITE strand of this read.
(because the coverage should be accounted towards to the strand of the first mate).
one complication:
what happens if the mates overlap ? need to take "BamAlignment::InsertSize" into account.
Comments are welcomed,
-gordon
An alternative to awk sorcery is certainly welcome. I must admit, however, that I don't understand what is meant by the statement "because the coverage should be accounted towards the strand of the first mate". Do you mean to say that you want to count coverage for each end of the pair, but you always want the coverage to be counted for the strand observed for the first sequenced end of the pair? If so, why? What is the biological rationale for that?
The sentence "calculate the coverage of the logical "fragment", not the actual two reads" makes more sense to me, as this is what I would refer to as "physical" coverage. That is, even if you have alignments for 50bp reads from the (paired) 5' ends of a 500bp molecule, you intend to increment coverage for the entire 500bp span of the molecule. Is this what you want? If so, I agree that this would be very useful, but it would require that BAM files be query sorted.
Sorry if I am being dense - I've reduced my caffeine consumption.
Aaron
Aaron Quinlan wrote, On 04/02/2012 09:28 PM:
>
> An alternative to awk sorcery is certainly welcome.
>
> The sentence "calculate the coverage of the logical "fragment", not the actual two reads" makes more sense to me, as this is what I would refer to as "physical" coverage. That is, even if you have alignments for 50bp reads from the (paired) 5' ends of a 500bp molecule, you intend to increment coverage for the entire 500bp span of the molecule. Is this what you want? If so, I agree that this would be very useful, but it would require that BAM files be query sorted.
>
Yes, this is exactly what I meant.
Just to ensure we're talking about the same thing, this is the current genome-coverage operation (with paired-end data):
http://cancan.cshl.edu/labmembers/gordon/files/bedtools_genome_cov1.png
And this (at the bottom) is what I want to change:
http://cancan.cshl.edu/labmembers/gordon/files/bedtools_genome_cov2.png
The BAM file will not need to be sorted by query names:
for each read there's the FLAG field which will indicate whether it's the first or second mate and whether the "other" mate mapped,
the RNEXT/RPOS field will indicate the position of the other mate,
and the TLEN will indicate the length of the entire fragment.
The logic will be:
1. If it's the first mate, continue as before
2. If it's the second mate ( FLAGS & ( 0x01 | 0x80 ) )
AND
it mapped to the same chromosome as the first mate ( RNEXT == RNAME )
AND
it's properly mapped (FLAGS & 0x02)
AND
it's reversed orientation from the first mate ( (FLAGS & (0x10|0x20))==0x10 || (FLAGS & (0x10|0x20))==0x20 )
AND
the orientation of the *first* mate is the one the user requested (with "-strand X")
THEN
add the read to the coverage count (despite it being the opposite orientation from "-strand X" parameter).
ELSE
ignore this read.
There implicit assumptions are:
1. we're dealing with pair-end data (and there are only to mates: first and last - no more).
2. no fusing genes and other crazy stuff where mates map to different chromosomes
3. if the second mate mapped but the first mate didn't - we ignore it (can be probably improved in the future).
One complication is reads that overlap (if the insert size was smaller then the run lengths, e.g. insert size of 150 for a PE-101 run) - the coverage count will have to take then RPOS,TLEN fields into account when calculating proper coverage.
Comments are welcomed,
-gordon
initial implementation is here:
https://github.com/agordon/bedtools/commit/1acff80aa10094f93565d902a4a008e8d0f57bc8
and the difference between using "-pe" and not using it is demonstrated here:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Cdnatata&hgS_otherUserSessionName=dm3_genome_cov_be_pe1
The first track is BAM,
the second and third are Plus/minus bedgraphs generated by genomeCoverageBed.
The fourth is the same BAm, with paired-end lines between reads,
and the fifth is the result of using "-pe" with genomeCoverageBed.
One problem is overlapping mates - some extra ugly code will be required to account for that.
Comments are welcomed,
-gordon
Sorry for the delayed response, I just now had some time to this some thought. It turns out that what you are proposing is different than the "physical coverage" scenario is I was describing. What I was attempting to explain was, regardless of the orientations of a pair's ends, calculate the coverage based on the "full span" of the original molecule. For example, if it's 2x100 PE from a 513bp molecule, the coverage would be incremented for all 513bp, not 200bp.
Your examples and pseudocode make it very clear how you propose to calculate coverage with this new option. In your original email, you mentioned that you wanted to calculate coverage based on the entire fragment. However, your examples show that you are only counting coverage for the aligned ends of the fragment. For example, in the browser shot, were the entire fragment being counted, the depth for interval 830840-83062 would be 3, not 1. Th
This is fine - I just wanted to make sure you agree with my understanding.
What I can't get my head around is how the coverage displayed in track 5 (the result of the proposed changes) would be different than what would be produced by only sending propoerly-paired alignments to gCB:
samtools view -uf 2 aln.bam | gCB
The only difference that I can see is if we actually correct for overlapping ends - that would eliminate the extra coverage observed at ~830888.
Sorry if I am being dense, what am I missing?!
Aaron
(This thread is going slow :) but that's ok)
The "full span" is definitely a goal, but the problem is that in genomic samples, the full span is (most likely) just the entire region between the two sequenced fragments.
But when dealing with transcriptome/RNA-seq or similar samples, the "full span" is unknown. If we factor in multiple isoforms, then even when using a GTF file for a reference gene model, we can't be sure we're giving the use the correct span.
So for now, I would prefer to give users the minimum coverage that we know for sure is correct. Unless we want another command line option to differentiate between "genomic" and "transcriptom" libraries.
Regarding the difference:
When a BAM file contains properly paired reads, each of them (implicitly assuming they were sequenced in a standard illumina protocol) would have a different orientation (one would be on the plus strand, the other on the minus strand).
So if I use gCB on a properly-paired BAM files, *and* ask for only coverage of plus strand, I will only get one mate of each pair (unless I'm missing something).
-gordon
Thanks, I now see what you are getting at. I do think we a very specific parameter name to describe what this is doing, because it seems a bit specific to RNA-seq.
As you stated earlier, overlapping pairs are a problem, and it seems to me that one must sort the BAM by query in order to have the necessary (in order to compute the exact start and end of each alignment) access to both CIGARs. With unsorted BAMs, one has access to the ISIZE, but one would not know the end coordinate of the mate's alignment without it's CIGAR. Do you agree?
An alternative to query sorting is to maintain a hash for each pair. All necessary info (e.g., CIGAR) is retained for the first mate encountered until the second mate is seen. At this point, the coverage is tabulated, and the info for the pair is purged from the hash to minimize memory consumption.
What's your take?
Best,
Aaron
I think this is further support for the notion of distinguishing the rules for tabulating paired-end coverage for genomic DNA versus spliced RNA/cDNA.
Thanks for your input.
Aaron
Hello,I just write you to know if finally genomeCoverage option is updated ?I mean can I use this option to estimate the physical coverage (it takes into account the insert and not only reads).advance thanks,Siva
--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.