Sampling random reads from BAM files

16,201 views
Skip to first unread message

Alpesh Querer

unread,
Feb 1, 2012, 1:42:55 PM2/1/12
to bedtools-discuss
Hi Aaron,

Is there a way to sample say 100 random reads from a BAM file with
Bedtools or any other software?
One way is to convert from BAM to SAM and then pick up random lines,
but I would like to apply bedtool commands and find
coverage for specific features thereafter for the random samples.

Thanks,
Al

tom carroll

unread,
Feb 1, 2012, 1:56:21 PM2/1/12
to bedtools...@googlegroups.com
Hi,

Samtools has a function to randomly sample lines from a bam file where s is probabilty of keeping a read.

"Usage:   samtools view [options] <in.bam>|<in.sam> [region1 [...]]
Options: 
..
         -s FLOAT fraction of templates to subsample; integer part as seed [-1]
...
"

samtools view -s 0.1 bamfile.bam > Ten_Percent_Of_bamfile.bam

Picard has a similar tool --  DownsampleSam

best 

Tom

Aaron Quinlan

unread,
Feb 1, 2012, 1:59:45 PM2/1/12
to bedtools...@googlegroups.com
It's always nice to learn something new.  Thanks much, Tom.

Alpesh Querer

unread,
Feb 2, 2012, 1:12:55 PM2/2/12
to bedtools-discuss
Thanks Tom. Thats exactly what I wanted, the -s is not on the samtools
webpage.

On Feb 1, 12:59 pm, Aaron Quinlan <aaronquin...@gmail.com> wrote:
> It's always nice to learn something new.  Thanks much, Tom.
>
> On Feb 1, 2012, at 1:56 PM, tom carroll wrote:
>
>
>
>
>
>
>
> > Hi,
>
> > Samtools has a function to randomly sample lines from a bam file where s is probabilty of keeping a read.
>
> > "Usage:   samtools view [options] <in.bam>|<in.sam> [region1 [...]]
> > Options:
> > ..
> >          -s FLOAT fraction of templates to subsample; integer part as seed [-1]
> > ...
> > "
>
> > samtools view -s 0.1 bamfile.bam > Ten_Percent_Of_bamfile.bam
>
> > Picard has a similar tool --  DownsampleSam
>
> > best
>
> > Tom
>

Aaron Quinlan

unread,
Jul 4, 2012, 10:11:06 AM7/4/12
to bedtools...@googlegroups.com
Hi Dan,

As I can tell from the thread, there was never a response, but in principle, sampling a large file without replacement is far easier.  What are you looking for?  If you want to take a random sample that is relatively small with respect to the number of records in the file, reservoir sampling is your best best (see: http://en.wikipedia.org/wiki/Reservoir_sampling).

Best,
Aaron



On Jul 3, 2012, at 9:37 PM, dfe...@gmail.com wrote:

Hi, I just saw this thread, I have a short Q, is the sample with or without replacement?

Dan

tom carroll

unread,
Jul 4, 2012, 12:25:56 PM7/4/12
to bedtools...@googlegroups.com
Hi Dan,

My understanding is that both samtools and Picard use a probability for each read to be selected, so no replacement.
I have heard third-hand that the seed part of samtools isn't/wasn't working fully so i would use Picard.

best
Tom

madbe...@gmail.com

unread,
Jul 17, 2013, 4:54:42 AM7/17/13
to bedtools...@googlegroups.com
Hi,
Digging up this discussion, but I just wanted to point out that the samtools view command outputs a sam file, not a bam file !
So it should look like :

samtools view -s 0.1 bamfile.bam > Ten_Percent_Of_bamfile.sam

tom carroll

unread,
Jul 17, 2013, 1:40:54 PM7/17/13
to bedtools...@googlegroups.com
Wow..that was a long break in the discussion.
My bad, should be
samtools view -sb 0.1 bamfile.bam > Ten_Percent_Of_bamfile.bam

Thanks for pointing that out
tom


--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

pelo...@gmail.com

unread,
Sep 4, 2013, 2:29:16 PM9/4/13
to bedtools...@googlegroups.com
I had to modify your slightly:

samtools view -s 0.1 -b bamfile.bam > Ten_Percent_Of_bamfile.bam

ebet...@gmail.com

unread,
Apr 24, 2014, 2:22:09 PM4/24/14
to bedtools...@googlegroups.com
Hi guys,

   I realize this is an old thread, but I was wondering if anyone could clarify whether samtools view -s maintains read pairs intact?

Best,

Elizabeth

Aaron Quinlan

unread,
Apr 24, 2014, 5:20:14 PM4/24/14
to bedtools...@googlegroups.com
To my knowledge, it does not:

samtools view -b -s 0.01 NA18152.bam \
   | samtools view - \
   | cut -f 1 \
   | sort \
   | uniq -c \
   | head -10

      1 NA18152-SRR007381.1180700
      1 NA18152-SRR007381.1180701
      1 NA18152-SRR007381.1180702
      1 NA18152-SRR007381.1180703
      1 NA18152-SRR007381.1180704
      1 NA18152-SRR007381.1180705
      1 NA18152-SRR007381.1180706
      1 NA18152-SRR007381.1180707
      1 NA18152-SRR007381.1180708
      1 NA18152-SRR007381.1180709


--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Elizabeth Henaff

unread,
Apr 26, 2014, 12:01:29 PM4/26/14
to bedtools...@googlegroups.com
Thanks Aaron!

Do you know of any tool that does?

Regards,

Elizabeth


--
You received this message because you are subscribed to a topic in the Google Groups "bedtools-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bedtools-discuss/gf0KeAJN2Cw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bedtools-discu...@googlegroups.com.

Aaron Quinlan

unread,
Apr 26, 2014, 12:13:35 PM4/26/14
to bedtools...@googlegroups.com, Neil Edward (nek3d) Kindlon
I suspect there must be such a tool in the Picard suite, but I do not know for sure. We really should add such for this in the bedtools "sample" tool as well.

- Aaron




John Marshall

unread,
Apr 26, 2014, 2:59:27 PM4/26/14
to bedtools...@googlegroups.com
On 24 Apr 2014, at 22:20, Aaron Quinlan <aaronq...@gmail.com> wrote:
> On Apr 24, 2014, at 2:22 PM, ebet...@gmail.com wrote:
>> I was wondering if anyone could clarify whether samtools view -s maintains read pairs intact?

Samtools's view -s option selects reads according to scores based on (a hash calculated from) their read names. Hence the in or out decision comes out the same way for all reads with the same name, so read pairs are indeed kept intact.

[Aaron wrote:]
> To my knowledge, it does not:
>
> samtools view -b -s 0.01 NA18152.bam \
> | samtools view - \
> | cut -f 1 \
> | sort \
> | uniq -c \
> | head -10

I suspect you've found a bunch of singleton reads in your NA18152.bam, though I don't know why the alphabetically-first reads would all be singletons.

Using a similar test to yours but without the head -10,

samtools view -s 0.001 foo.bam | cut -f 1 | sort | uniq -c | cut -c1-7 | sort -n | uniq -c

I get numbers like

880 1
737008 2

which is a similar proportion of singleton reads as in the original foo.bam.

John

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Aaron Quinlan

unread,
Apr 26, 2014, 3:04:09 PM4/26/14
to bedtools...@googlegroups.com
Thanks so much for correcting this, John. It is great to know that pairs are sampled together. 

- Aaron





dro...@gmail.com

unread,
Jun 21, 2014, 7:41:45 PM6/21/14
to bedtools...@googlegroups.com
Hi all,
Can someone explain why this cmd working:
samtools view -s 0.5 1.bam 1:50000000-90000000
and this are not:
samtools view -s 0.95 1.bam 1:50000000-90000000

Thanks,
Pap

Aaron Quinlan

unread,
Jun 21, 2014, 8:14:24 PM6/21/14
to bedtools...@googlegroups.com
Hi Pap,

I am not an expert on samtools but I can try.  But could you be more precise about what you mean by the second command not working.  Is it seg faulting, running out of memory, or producing a sample that is not 95% of the original file?

- Aaron





dro...@gmail.com

unread,
Jun 21, 2014, 8:19:20 PM6/21/14
to bedtools...@googlegroups.com
Thanks Aaron!
This is my output for example:
samtools view 1.bam 1:50000000-50001000 | wc -l
385

samtools view -s 0.5 1.bam 1:50000000-50001000 | wc -l
192

samtools view -s 0.95 1.bam 1:50000000-50001000 | wc -l
0

The 2 first cases are working..
You can see that the third isnt seg fault.. but i also dont know what is it..

Aaron Quinlan

unread,
Jun 21, 2014, 8:23:49 PM6/21/14
to bedtools...@googlegroups.com
Hmm, I can't reproduce it with version 0.1.19-44428cd.  I would email the samtools list and provide the example you gave here as well as the version you are using.

dro...@gmail.com

unread,
Jun 21, 2014, 8:38:03 PM6/21/14
to bedtools...@googlegroups.com
Your version is newer than mine..  mine is 0.1.18 (r982:295)
I sent them also an email..
Thanks!

karl....@gmail.com

unread,
Dec 15, 2014, 6:05:36 PM12/15/14
to bedtools...@googlegroups.com, dro...@gmail.com
Today I've been trying to do a particular kind of downsampling, so I came across this discussion of samtools view -s.  As John Marshall said "Samtools's view -s option selects reads according to scores based on (a hash calculated from) their read names.  Hence the in or out decision comes out the same way for all reads with the same name, so read pairs are indeed kept intact. "  

This has implications for your -s 0.5 and -s 0.95 issue as well. Because the sampling is random-uniform, it's like a coin-toss with 95% chance of success to discard a read(pair). Therefore, on a set of 385 reads you could rarely have zero pass that filter.  On a genome-wide scale, it will approximate 5% retained, but for a given small region, could vary such that no reads are kept. 

cris...@gmail.com

unread,
Feb 3, 2015, 5:09:39 PM2/3/15
to bedtools...@googlegroups.com, dro...@gmail.com, karl....@gmail.com
This is a great thread, especially since "view -s" isn't on the samtools manuals.
However, it seems that the actual reads retained is deterministic, ie, whatever the sampling algorithm uses, it has a fixed seed.
I can't seem to figure out a way to get two samples (bootstrap) each containing 2 randomly selected 10% of reads. If I first select 10% then run again with 11% as arguments, the two files only differ by 1% of reads.
Any suggestions?
Thanks!
AM

John Marshall

unread,
Feb 3, 2015, 6:48:01 PM2/3/15
to bedtools...@googlegroups.com
On 3 Feb 2015, at 22:09, cris...@gmail.com wrote:
> This is a great thread, especially since "view -s" isn't on the samtools manuals.

The view -s option has been described in the manual since version 0.1.19; that version also added the ability to set the seed. If you can't find it in your manual, perhaps you have an even older version of samtools (in which case, the following will not work)?

> However, it seems that the actual reads retained is deterministic, ie, whatever the sampling algorithm uses, it has a fixed seed.
> I can't seem to figure out a way to get two samples (bootstrap) each containing 2 randomly selected 10% of reads. If I first select 10% then run again with 11% as arguments, the two files only differ by 1% of reads.

Current versions of samtools (ab)use the integer part of the -s value to set the seed:

-s FLOAT Integer part is used to seed the random number generator. Part after the decimal point sets the fraction of templates/pairs to subsample

So two runs with e.g. view -s 123.1 / view -s 456.1 should select two distinct randomly-selected 10% subsets.

BTW there is also a samtools-help mailing list that may come in handy:

https://lists.sourceforge.net/lists/listinfo/samtools-help

Cheers,

Ann Mongan

unread,
Feb 3, 2015, 6:53:53 PM2/3/15
to bedtools...@googlegroups.com
Thanks, John. Of course I realized that as soon as I sent the message :)
I'm hoping someone would add this feature to the current samtools manual; it's quite helpful.
Cheers,
Ann


--
You received this message because you are subscribed to a topic in the Google Groups "bedtools-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bedtools-discuss/gf0KeAJN2Cw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bedtools-discu...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages