Thanks
Shalu
_______________________________________________
Bioperl-l mailing list
Biop...@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
> Hi,
> i am trying to test one program for which i need to change order of
> sequences in a fastq file.
> My fastq file contains about 50,000 sequences.
> Is there any script that can do this task?
Since FastQ is supported in SeqIO you could do something like (untested):
#!/usr/bin/perl
use warnings;
use strict;
use List::Util 'shuffle';
use Bio::SeqIO;
my @seqs;
my $in = Bio::SeqIO->new(-file => 'your_intput.fastq',
-format => 'Fastq');
while (my $seq = $in -> next_seq()) {
push @seqs,$seq;
}
@seqs = shuffle(@seqs);
my $out = Bio::SeqIO->new(-file => '>your_output.fastq',
-format => 'Fastq');
foreach my $seq (@seqs) {
$out->write_seq($seq);
}
## End
This has the disadvantage that it will hold all of the sequences in memory whilst shuffling, but I don't think there's an easy way around that.
Simon.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Something like:
#!/usr/bin/perl
use warnings FATAL=>qw(all);
use Modern::Perl;
use Tie::File;
use Fcntl qw(O_RDONLY);
use List::Util qw(shuffle);
my @fastq;
tie @fastq, 'Tie::File', $ARGV[0], mode=>O_RDONLY or die $!;
say join "\n", @fastq[4*$_..4*$_+3] for shuffle 0..$#fastq/4;
Cheers,
Roy.
Frank
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
If
you're working with fastq files
are working in unix and have the `shuf` command available
I recommand you to install cdbyank http://sourceforge.net/projects/cdbfasta/ which provides for indexing fasta and fastq files and providing random access to them
Index the fastq, then extract the IDs with cdyank, pipe them through `shuf` and then through cdyank again to pull out the sequences.
Like this example, which uses a test fastq from my local install of bioperl:
> cd ~/local/src/bioperl-live/t/data/fastq/
> cdbfasta -Q example.fastq
3 entries from file example.fastq were indexed in file example.fastq.cidx
> cdbyank -l example.fastq.cidx | shuf | cdbyank example.fastq.cidx > shuf_example.fastq
There would be issues if your IDs are not unique.
Malcolm Cook
Stowers Institute for Medical Research - Bioinformatics
Kansas City, Missouri USA
chris
I can still use same methods on 50k sequences but getting 50k from huge data
set is a problem.
Also at some point i need to shuffle the fastq reads (order of
nucleotides).
I am really sorry for asking lot of things , i know i am really bad in
handling fastq sequences.
i would really appreciate your suggestions.
Thanks
Shalu
(Note: this isn't a Perl solution): I do think this problem has been solved somewhat in R/BioC if you have it installed, in the ShortRead package (see 'Sampler-class' in the ShortRead docs).
I think using perl and the current BioPerl Bio::Index::Fastq indexing scheme for FASTQ will be problematic/slow for very large files with millions of sequences (i.e. pretty much anything that is rolling out of modern day sequencing pipelines), as the current indexing implementation uses a very simple indexing scheme using DB_File, originally designed years ago for much smaller sequencing samples. Think: Sanger sequencing.
Of course, this is with the caveat that I haven't tested this out personally, but I recall some complaints about this in the past (Jason?). There was an effort to deal with this at one point with AnyDBM_File (which allows SQLite now) but I don't think it progressed very far, primarily b/c there simply hasn't been enough demand. Most users seem to sample randomly from BAM files instead, which are conveniently accessible via samtools/Picard/bamtools/etc (bamtools has a 'random' option for this purpose).
chris
> Hi All,
> Thanks for all the suggestions.
> @Simon Andrew and Roy:
> Your method worked perfect but now memory is the issue.
> Now i have to select 50K fastq sequences from a illumina data (around 70 mil
> reads) randomly , so is there again any module that can select random
> sequences from fastq file?
The simple approach to this is to do it in two passes.
In the first pass you simply find out how many fastq entries you have in your file. You then randomly select 50K numbers from 1..[number of fastq seqs in file].
In the second pass you pull out any sequences at an index position you randomly selected. If you don't mind them being in the same order then you can just write them out immediately and use virtually no memory, or you could put them in an array and shuffle them before writing (using the same memory as the 50K experiment).
If you're going to be doing a lot of shuffling on the same dataset then it would be worth looking into doing a proper indexing of your file, as others have suggested, but if you're only going to do this once per dataset then it might not save you any time.
> Also at some point i need to shuffle the fastq reads (order of
> nucleotides).
Same basic process - extract the sequence, split it into an array, shuffle the array, reassign the sequence. If you want to keep the original quality scores associated with the same bases you'll need to shuffle indices and then reassemble both the shuffled sequences and qualities at the same time.
Simon.