Strange behaviour of samfile.fetch()/samfile.pileup() for 1-base regions

364 views
Skip to first unread message

Christian Schudoma

unread,
Jun 12, 2012, 1:12:08 PM6/12/12
to Pysam User group
I am looking at SNPs in a genome, i.e. I want to obtain all reads that
cover a "region" of size 1 (the position of the SNP). Doing so, I
observed the following:

1. Calling samfile.fetch() or samfile.pileup() with
reference=<chromosome id>, start=<snp-pos>, end=<snp-pos> will yield
exactly 0 results. *)

2. Calling samfile.fetch() or samfile.pileup() with
reference=<chromosome id>, start=<snp-pos> will yield a list of reads
that appear to map all over the contig.

3. The desired result can be obtained by calling samfile.fetch() or
samfile.pileup() with reference=<chromosome id>, start=<snp-pos>
end=<snp-pos>+1.

This is quite confusing and not to be gathered from the documentation.


*) From prior testing I know that there are mapping reads, that is why
0 results are unexpected.

Daniel Brami

unread,
Oct 31, 2012, 8:52:22 PM10/31/12
to pysam-us...@googlegroups.com
Thanks for the post - kept me from pulling out my hair

Andreas

unread,
Nov 1, 2012, 10:20:14 AM11/1/12
to pysam-us...@googlegroups.com
Hi Christian,

pysam follows python conventions, so

reference=<chromosome id>, start=<snp-pos>, end=<snp-pos> is a
region of size 0, not of 1, just like "abcd"[1:1] will give you an empty string.

Best wishes,
Andreas

Christian Schudoma

unread,
Nov 1, 2012, 11:18:59 AM11/1/12
to pysam-us...@googlegroups.com
Ok, that makes sense in some way (both Pythonic and in "SNP-speak"). Nevertheless, would be nice to see this in the documentation. And this does not explain the behaviour described in 2.) (returning all reads, when end is left out)

[quote from documentation of "fetch"]

fetch aligned reads in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.

Without reference or region all mapped reads will be fetched.
[/quote]

Although I can infer from these statements that a missing end-position could somehow qualify as an absent region, it is not documented clearly enough. This is especially the case since end (or start or reference) has a default value of None, leading to the assumption that either value can be omitted. I think that the function should complain about the missing end-position (if start is present) and then terminate.

Finally, something I just noticed. Please have another look at the quoted part of the fetch/pileup documentation. In the first statement "reference" is a part of "region", in the second it is not. (I'd guess it should be be region or region string or something among those lines). It is these small things that add to the confusion when trying to work with the documentation.


Cheers,
Christian

Daniel Brami

unread,
Nov 1, 2012, 1:44:29 PM11/1/12
to pysam-us...@googlegroups.com
Hi Christian,

I am hoping not to hi-jack this conversation but it turns out I am still not getting what I need from code I found on this group, and what you are trying to do i think is similar to what I am trying to achieve.

I need the qual values and the sequence of aligned reads ONLY at the snp location, not the entire sequence that aligns at a SNP location.
Any ideas?





On Tuesday, June 12, 2012 10:12:08 AM UTC-7, Christian Schudoma wrote:

Ehsan Tabari

unread,
Nov 19, 2012, 2:42:24 PM11/19/12
to pysam-us...@googlegroups.com
I can see how it makes sense.
but still when I do:
     samfile.pileup(chromosome_name , 100, 101)

I get a list of pileupcolums whose positions range from 36 to 163

Is this what I should expect?

Christian Schudoma

unread,
Nov 20, 2012, 8:09:19 AM11/20/12
to pysam-us...@googlegroups.com
Addendum:

If I understand correctly, version 3 (reference=<chromosome id>, start=<snp-pos>
end=<snp-pos>+1) will yield a list of all reads aligning to that position.

It seems that you have to iterate over all columns, each time checking whether it is your desired column and only process that one.

To get only the reads that actually align (i.e. no deletion) to your desired position, you could use something like:

for col in bamfile.pileup(region='Chr1:100-101'):
 if col.pos == my_desired_position:
   reads_at_desired_position = [read for read in col.pileups if read.is_del == 0]

Given that the command-line mpileup does not require such detours (i.e., you ask for "Chr1:100-101", you get 2 columns and nothing more (you still get skips, but these are easily filtered)), wouldn't it make sense for pysam to provide the same functionality?

Cheers
Christian

Andreas Heger

unread,
Nov 20, 2012, 8:46:03 AM11/20/12
to pysam-us...@googlegroups.com
Dear Christian, Ehan,

pysam.pileup wraps the old behaviour of the C-samtools pileup
engine, which had exactly this behaviour.

I am worried that it might break other people's
code, though I assume that almost everybody is only interested in
the columns within the region.

Unless anyone objects, I will change it for the next release.

Best wishes,
Andreas

Peter Cock

unread,
Nov 21, 2012, 2:17:49 PM11/21/12
to pysam-us...@googlegroups.com
On Tue, Nov 20, 2012 at 1:46 PM, Andreas Heger <andrea...@gmail.com> wrote:
> Dear Christian, Ehan,
>
> pysam.pileup wraps the old behaviour of the C-samtools pileup
> engine, which had exactly this behaviour.
>
> I am worried that it might break other people's
> code, though I assume that almost everybody is only interested in
> the columns within the region.
>
> Unless anyone objects, I will change it for the next release.
>
> Best wishes,
> Andreas

Breaking existing scripts with this change seems unwise -
and the old behaviour can be useful depending on why you
are extracting reads in a particular region.

How about a new function, or an optional argument to the
old function (defaulting to the old behaviour)?

Peter

Andreas Heger

unread,
Nov 22, 2012, 10:28:58 AM11/22/12
to pysam-us...@googlegroups.com
Hi all,

I have added a truncate option to Samfile.pileup.

The default behaviour is unchanged.

Best wishes,
Andreas
Reply all
Reply to author
Forward
0 new messages