pysamstats_coverage_binned

108 views
Skip to first unread message

Xiaofei Wang

unread,
Nov 16, 2016, 3:07:15 PM11/16/16
to Pysam User group
Dear folks,

How is this statistics "coverage_binned" defined for pysamstats?

Here is the screenshot for my alignment as below. I used this code 

$ pysamstats --type coverage_binned --chromosome 7 --start 3091859 --end 3091899  --truncate --window-size=6 --window-offset=0 -f ../ref/ref.fa my.sorted.bam

chrom pos gc reads_all reads_pp

7 3091859 83 4 4

7 3091865 50 0 0

7 3091871 83 0 0

7 3091877 67 2 2

7 3091883 67 0 0

7 3091889 67 0 0

7 3091895 83 0 0



Why the results is displayed above, rather than I am expecting like this as below, which are the reads depth for the the column of "pos", for the column of "reads_all"? 

What is this "coverage_binned", Binned statistics types (each row has statistics over reads aligned starting within a genome window) ,will report for the result?

pos reads_all
3091859 4
3091865 4
3091871 4
3091877 4
3091883 6
3091889 6
3091895 6

Thanks a lot!

Looking forward to response!

Best,

Xiaofei





Alistair Miles

unread,
Nov 25, 2016, 10:53:38 AM11/25/16
to pysam-us...@googlegroups.com
Hi Xiaofei,

The "coverage_binned statistic" is based on a windowed algorithm. If you set a window size of, e.g., 300, then pysamstats will start with positions 1-300 on your first chromosome, find all reads where the start of the alignment is within this window, then compute various statistics over that set of reads. E.g., the "reads_all" column is simply the total number of reads in the window. The "reads_pp" column is the number of reads aligned in a proper pair. The "gc" column is the percent G+C content for the reference sequence within the window. After computing all values for this window, it will move onto the next window, i.e., from positions 301-600. The value reported in the "pos" column is position at the centre of the window.

I hope that's helpful, please feel free to ask further questions.

Best wishes,
Alistair

--
You received this message because you are subscribed to the Google Groups "Pysam User group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pysam-user-group+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
The Wellcome Trust Centre for Human Genetics
Roosevelt Drive
Oxford
OX3 7BN
United Kingdom
Email: alim...@googlemail.com
Web: http://purl.org/net/aliman
Twitter: https://twitter.com/alimanfoo
Tel: +44 (0)1865 287721

Xiaofei Wang

unread,
Dec 5, 2016, 10:48:04 AM12/5/16
to Pysam User group
Hi Alistair,

Thank you so much for your reply! 

Yes, I think my understand is same as your illustration. But, I don't know why the output is not like what I expected. For example, in the above example, the region is from 3091859  to 3091899 and the window size is 6bp as below. Please see the IGV alignment screenshot, the "reads_all" should be 4 for the second window (2nd line in the output) rather than 0 which reported in the output. 

--start 3091859 --end 3091899  --truncate --window-size=6


I posted the output here again:

$ pysamstats --type coverage_binned --chromosome 7 --start 3091859 --end 3091899  --truncate --window-size=6 --window-offset=0 -f ../ref/ref.fa my.sorted.bam

chrom pos gc reads_all reads_pp

7 3091859 83 4 4

7 3091865 50 0 0

7 3091871 83 0 0

7 3091877 67 2 2

7 3091883 67 0 0

7 3091889 67 0 0

7 3091895 83 0 0


Thank you so much!

Best,

Xiaofei
To unsubscribe from this group and stop receiving emails from it, send an email to pysam-user-gro...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alistair Miles

unread,
Dec 15, 2016, 10:54:40 AM12/15/16
to pysam-us...@googlegroups.com
Hi Xiaofei,

I'm sorry I don't have time to investigate fully, but here is a quick thought. The pysamstats coverage_binned (or any of the ..._binned functions) will assign each read to one bin, and will not count the same read again in another bin even if the read's alignment overlaps that bin. The reads processed for each bin are the reads whose alignment begins within that bin. So although depth of coverage in your second bin might be 4 (if you measured it via a pileup), if those 4 reads started their alignment in the first bin then they will only be counted there.

If you want fine-grained coverage data you might as well use the simple "coverage" statistic type, which works using a pileup rather than iterating through reads. The ..._binned functions were really only designed for the case where you want to handle data binned in larger bin sizes (e.g., 300bp or 1kbp), i.e., you want to sacrifice fine granularity for speed. If the bin size is smaller than the read size then you could get surprising results.

Hth,
Alistair

To unsubscribe from this group and stop receiving emails from it, send an email to pysam-user-group+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Xiaofei Wang

unread,
Dec 19, 2016, 1:39:50 PM12/19/16
to Pysam User group
Yes, I got it. It helps to clarify. Thanks a lot!

Best,

Xiaofei

Xiaofei Wang

unread,
Apr 20, 2017, 3:59:51 PM4/20/17
to Pysam User group
Hi Alistair, 

Hope you can still reach this thread.

I want to use  coverage_binned from Python. 

As above, here is the code from terminal: 

$ pysamstats --type coverage_binned --chromosome 7 --start 3091859 --end 3091899  --truncate --window-size=6 --window-offset=0 -f ../ref/ref.fa my.sorted.bam.

But, how can I use it from Python? I used it as below, but gave me an error as below. Why did this happen? The error showed I gave one positional arguments but I gave two, start and end? How should I change the code? Thank you so much! Best,

Xiaofei


>>> for rec in pysamstats.stat_coverage_binned("G3_noP_H3_rep2_markDup.sorted.bam", chrom="7", start=3091859, end=3094899,truncate=True, f="../ref/*.fa"):

...     print rec['chrom'], rec['pos'], rec['reads_all'], rec['reads_pp']

... 

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "pysamstats.pyx", line 3050, in pysamstats.stat_coverage_binned (pysamstats.c:29644)

TypeError: stat_coverage_binned() takes exactly 2 positional arguments (1 given)

Florian Finkernagel

unread,
Apr 21, 2017, 3:12:41 AM4/21/17
to Pysam User group
Dear Xiaofei, 

without knowing anything about pysamstats I see from it's source that stat_coverage_binned is defined as follows:
def stat_coverage_binned(alignmentfile, fafile, **kwargs):
 
Therefore you need to pass the fasta file as the second parameter by position, not by name.

So basically
>> for rec in pysamstats.stat_coverage_binned("G3_noP_H3_rep2_markDup.sorted.bam", "../ref/*.fa", chrom="7", start=3091859, end=3094899,truncate=True):


should work.

So long,
Florian

Xiaofei Wang

unread,
Apr 21, 2017, 10:03:11 AM4/21/17
to Pysam User group
Yes, you are right. It works. Thank you so much!

Best,

Xiaofei
Reply all
Reply to author
Forward
0 new messages