greping a fastq file

sanks

unread,

Apr 17, 2015, 12:16:40 PM4/17/15

to unix-and-perl-...@googlegroups.com

I have a huge fasta file and I want to analyze reads with only a particular sequence say "AAGTTGATAACGGACTAGCCTTATTTT" in them. When I do

grep "AAGTTGATAACGGACTAGCCTTATTTT" file.fastq

I get the reads but lose other lines of the fastq format. Can you suggest an easy fix to retain the fastq format of the output. Thanks

Keith Bradnam

unread,

Apr 17, 2015, 12:20:56 PM4/17/15

to unix-and-perl-...@googlegroups.com

On Friday, April 17, 2015 at 9:16:40 AM UTC-7, sanks wrote:

I have a huge fasta file and I want to analyze reads with only a particular sequence say "AAGTTGATAACGGACTAGCCTTATTTT" in them. When I do

Just to double check, this is a FASTQ file or a FASTA file? You refer to both file formats.

grep "AAGTTGATAACGGACTAGCCTTATTTT" file.fastq

I get the reads but lose other lines of the fastq format. Can you suggest an easy fix to retain the fastq format of the output. Thanks

Assuming that it is a FASTQ format, then grep is doing exactly what you ask for, only showing lines that match your sequence of interest. You could write a simple Perl script to loop over the file, reading four lines at a time, and only print out output if the the sequence line matches the desired target sequence.

Regards,

Keith

Goutham atla

unread,

Apr 17, 2015, 12:26:05 PM4/17/15

to unix-and-perl-...@googlegroups.com

There is fast-grep which comes in handy to use grep on fast files. :)

--
You received this message because you are subscribed to the Google Groups "Unix and Perl for Biologists" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unix-and-perl-for-bi...@googlegroups.com.
To post to this group, send email to unix-and-perl-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unix-and-perl-for-biologists.
For more options, visit https://groups.google.com/d/optout.

--

Goutham Atla

Goutham atla

unread,

Apr 17, 2015, 12:26:42 PM4/17/15

to unix-and-perl-...@googlegroups.com

fastq-grep.

--

Goutham Atla

Steve Mount

unread,

Apr 17, 2015, 1:31:49 PM4/17/15

to unix-and-perl-...@googlegroups.com

From

$ man grep

-A NUM, --after-context=NUM

Print NUM lines of trailing context after matching lines.

Places a line containing -- between contiguous groups of

matches.

-B NUM, --before-context=NUM

Print NUM lines of leading context before matching lines.

Places a line containing -- between contiguous groups of

matches.

I've have done this with fastq files.

---------------------------------------------------

Stephen M. Mount
Center for Bioinformatics and Computational Biology
Dept. of Cell Biology and Molecular Genetics
2208 H. J. Patterson Hall
University of Maryland
College Park, MD 20742-5815

Phone 301-405-6934
alt. ph. 301-405-9904
URL www.SteveMount.org
------------------------------------------------------

Any commercial or marketing email sent to this address (including information about lab supplies, textbooks, conferences and some funding opportunities) will be tagged as spam. See SteveMount.org for alternatives.

Mario Roy

unread,

Apr 17, 2015, 6:11:56 PM4/17/15

to unix-and-perl-...@googlegroups.com

The following is a code snippet using Perl MCE retaining the other lines of the fast format. The file is process by the record separator RS => "\n@" MCE option.

## usage: perl grep_fastq.pl pattern file.fastq

use strict;
use warnings;

use MCE::Flow;

die "Not enough arguments given\n" if @ARGV < 2;

my $regex = shift;
$regex = qr/$regex/;

sub user_func {
   my ($mce, $slurp_ref, $chunk_id) = @_;
   if (${ $slurp_ref } =~ $regex) {
      $mce->print(${ $slurp_ref });
   }
}

for my $filename (@ARGV) {
  mce_flow_f {
     RS => "\n@", chunk_size => 1, use_slurpio => 1
  }, \&user_func, $filename;
}

Anthony Doran

unread,

Apr 28, 2015, 9:27:18 AM4/28/15

to unix-and-perl-...@googlegroups.com

Probably a bit late, but you could also do this with a perl one-liner, though unless you're very familiar with perl it might be best use one of the above suggestions. Also, for this I'm not sure if perl would be faster than grep anyway. The below has not been tested (don't have access to my fastq files today) but should work.

perl -ne ' $c = <>; $d = <>; $e = <>; print "$_$c$d$e" if($c =~ m/YOUR_PATTERN/); ' file.fastq > matching_seqs.fastq

Reply all

Reply to author

Forward