[Bioperl-l] randomizing fastq sequences

shalu sharma

unread,

Feb 7, 2011, 5:07:33 PM2/7/11

to biop...@lists.open-bio.org

Hi,
i am trying to test one program for which i need to change order of
sequences in a fastq file.
My fastq file contains about 50,000 sequences.
Is there any script that can do this task?

Thanks
Shalu
_______________________________________________
Bioperl-l mailing list
Biop...@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l

simon andrews (BI)

unread,

Feb 8, 2011, 3:41:10 AM2/8/11

to biop...@lists.open-bio.org

On 7 Feb 2011, at 22:07, shalu sharma wrote:

> Hi,
> i am trying to test one program for which i need to change order of
> sequences in a fastq file.
> My fastq file contains about 50,000 sequences.
> Is there any script that can do this task?

Since FastQ is supported in SeqIO you could do something like (untested):

#!/usr/bin/perl
use warnings;
use strict;
use List::Util 'shuffle';
use Bio::SeqIO;

my @seqs;

my $in = Bio::SeqIO->new(-file => 'your_intput.fastq',
-format => 'Fastq');

while (my $seq = $in -> next_seq()) {
push @seqs,$seq;
}

@seqs = shuffle(@seqs);

my $out = Bio::SeqIO->new(-file => '>your_output.fastq',
-format => 'Fastq');

foreach my $seq (@seqs) {
$out->write_seq($seq);
}

## End

This has the disadvantage that it will hold all of the sequences in memory whilst shuffling, but I don't think there's an easy way around that.

Simon.

Frank Schwach

unread,

Feb 8, 2011, 4:08:37 AM2/8/11

to biop...@lists.open-bio.org

If memory is an issue then I guess you could create a file of just the
sequence IDs (one per line), then shuffle those (using List::Util like
Simon demonstrated). In the end you would substitute the IDs for the
whole fastq entry again, which you can do without reading an entire file
into memory (might be bit slow but that probably doesn't
matter)
Frank

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Roy Chaudhuri

unread,

Feb 8, 2011, 6:31:15 AM2/8/11

to sharmas...@gmail.com, biop...@lists.open-bio.org

TMTOWTDI, maybe also use the Tie::File module?

Something like:

#!/usr/bin/perl
use warnings FATAL=>qw(all);
use Modern::Perl;
use Tie::File;
use Fcntl qw(O_RDONLY);
use List::Util qw(shuffle);
my @fastq;
tie @fastq, 'Tie::File', $ARGV[0], mode=>O_RDONLY or die $!;
say join "\n", @fastq[4*$_..4*$_+3] for shuffle 0..$#fastq/4;

Cheers,
Roy.

Frank Schwach

unread,

Feb 8, 2011, 6:57:12 AM2/8/11

to Roy Chaudhuri, biop...@lists.open-bio.org, sharmas...@gmail.com

nice one - but if I understand it correctly it relies on there being
exactly 4 lines for each record. This is probably the case but it would
be a good idea to double-check the fastq file in question, just to make
sure.

Frank

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Roy Chaudhuri

unread,

Feb 8, 2011, 7:09:39 AM2/8/11

to Frank Schwach, biop...@lists.open-bio.org, sharmas...@gmail.com

Sorry, I should have included that caveat.

Cook, Malcolm

unread,

Feb 8, 2011, 10:12:47 AM2/8/11

to shalu sharma, biop...@lists.open-bio.org

Gotta chime in....

If
you're working with fastq files
are working in unix and have the `shuf` command available

I recommand you to install cdbyank http://sourceforge.net/projects/cdbfasta/ which provides for indexing fasta and fastq files and providing random access to them

Index the fastq, then extract the IDs with cdyank, pipe them through `shuf` and then through cdyank again to pull out the sequences.

Like this example, which uses a test fastq from my local install of bioperl:

> cd ~/local/src/bioperl-live/t/data/fastq/
> cdbfasta -Q example.fastq
3 entries from file example.fastq were indexed in file example.fastq.cidx
> cdbyank -l example.fastq.cidx | shuf | cdbyank example.fastq.cidx > shuf_example.fastq

There would be issues if your IDs are not unique.

Malcolm Cook
Stowers Institute for Medical Research - Bioinformatics
Kansas City, Missouri USA

Chris Fields

unread,

Feb 8, 2011, 10:53:27 AM2/8/11

to Cook, Malcolm, biop...@lists.open-bio.org, shalu sharma

Just to note, I have been thinking about wrapping this for fast indexing and retrieval of FASTQ for bioperl (this came up in a prior thread, with the same suggestion from Malcolm IIRC).

chris

shalu sharma

unread,

Feb 8, 2011, 11:48:44 AM2/8/11

to Chris Fields, biop...@lists.open-bio.org, Cook, Malcolm

Hi All,
Thanks for all the suggestions.
@Simon Andrew and Roy:
Your method worked perfect but now memory is the issue.
Now i have to select 50K fastq sequences from a illumina data (around 70 mil
reads) randomly , so is there again any module that can select random
sequences from fastq file?

I can still use same methods on 50k sequences but getting 50k from huge data
set is a problem.
Also at some point i need to shuffle the fastq reads (order of
nucleotides).

I am really sorry for asking lot of things , i know i am really bad in
handling fastq sequences.
i would really appreciate your suggestions.

Thanks
Shalu

Chris Fields

unread,

Feb 8, 2011, 12:29:56 PM2/8/11

to shalu sharma, biop...@lists.open-bio.org, Cook, Malcolm

Shalu,

(Note: this isn't a Perl solution): I do think this problem has been solved somewhat in R/BioC if you have it installed, in the ShortRead package (see 'Sampler-class' in the ShortRead docs).

I think using perl and the current BioPerl Bio::Index::Fastq indexing scheme for FASTQ will be problematic/slow for very large files with millions of sequences (i.e. pretty much anything that is rolling out of modern day sequencing pipelines), as the current indexing implementation uses a very simple indexing scheme using DB_File, originally designed years ago for much smaller sequencing samples. Think: Sanger sequencing.

Of course, this is with the caveat that I haven't tested this out personally, but I recall some complaints about this in the past (Jason?). There was an effort to deal with this at one point with AnyDBM_File (which allows SQLite now) but I don't think it progressed very far, primarily b/c there simply hasn't been enough demand. Most users seem to sample randomly from BAM files instead, which are conveniently accessible via samtools/Picard/bamtools/etc (bamtools has a 'random' option for this purpose).

chris

simon andrews (BI)

unread,

Feb 9, 2011, 3:35:34 AM2/9/11

to biop...@lists.open-bio.org

On 8 Feb 2011, at 16:48, shalu sharma wrote:

> Hi All,
> Thanks for all the suggestions.
> @Simon Andrew and Roy:
> Your method worked perfect but now memory is the issue.
> Now i have to select 50K fastq sequences from a illumina data (around 70 mil
> reads) randomly , so is there again any module that can select random
> sequences from fastq file?

The simple approach to this is to do it in two passes.

In the first pass you simply find out how many fastq entries you have in your file. You then randomly select 50K numbers from 1..[number of fastq seqs in file].

In the second pass you pull out any sequences at an index position you randomly selected. If you don't mind them being in the same order then you can just write them out immediately and use virtually no memory, or you could put them in an array and shuffle them before writing (using the same memory as the 50K experiment).

If you're going to be doing a lot of shuffling on the same dataset then it would be worth looking into doing a proper indexing of your file, as others have suggested, but if you're only going to do this once per dataset then it might not save you any time.

> Also at some point i need to shuffle the fastq reads (order of
> nucleotides).

Same basic process - extract the sequence, split it into an array, shuffle the array, reassign the sequence. If you want to keep the original quality scores associated with the same bases you'll need to shuffle indices and then reassemble both the shuffled sequences and qualities at the same time.

Simon.

Reply all

Reply to author

Forward