greping a fastq file

203 views
Skip to first unread message

sanks

unread,
Apr 17, 2015, 12:16:40 PM4/17/15
to unix-and-perl-...@googlegroups.com
I have a huge fasta file and I want to analyze reads with only a particular sequence say "AAGTTGATAACGGACTAGCCTTATTTT" in them. When I do 

grep "AAGTTGATAACGGACTAGCCTTATTTT" file.fastq

I get the reads but lose other lines of the fastq format. Can you suggest an easy fix to retain the fastq format of the output. Thanks


Keith Bradnam

unread,
Apr 17, 2015, 12:20:56 PM4/17/15
to unix-and-perl-...@googlegroups.com

On Friday, April 17, 2015 at 9:16:40 AM UTC-7, sanks wrote:
I have a huge fasta file and I want to analyze reads with only a particular sequence say "AAGTTGATAACGGACTAGCCTTATTTT" in them. When I do 

Just to double check, this is a FASTQ file or a FASTA file? You refer to both file formats.
 
grep "AAGTTGATAACGGACTAGCCTTATTTT" file.fastq

I get the reads but lose other lines of the fastq format. Can you suggest an easy fix to retain the fastq format of the output. Thanks

Assuming that it is a FASTQ format, then grep is doing exactly what you ask for, only showing lines that match your sequence of interest. You could write a simple Perl script to loop over the file, reading four lines at a time, and only print out output if the the sequence line matches the desired target sequence.

Regards,

Keith

Goutham atla

unread,
Apr 17, 2015, 12:26:05 PM4/17/15
to unix-and-perl-...@googlegroups.com
There is fast-grep which comes in handy to use grep on fast files. :)

--
You received this message because you are subscribed to the Google Groups "Unix and Perl for Biologists" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unix-and-perl-for-bi...@googlegroups.com.
To post to this group, send email to unix-and-perl-...@googlegroups.com.
Visit this group at http://groups.google.com/group/unix-and-perl-for-biologists.
For more options, visit https://groups.google.com/d/optout.



--
Goutham Atla

Goutham atla

unread,
Apr 17, 2015, 12:26:42 PM4/17/15
to unix-and-perl-...@googlegroups.com
fastq-grep.
--
Goutham Atla

Steve Mount

unread,
Apr 17, 2015, 1:31:49 PM4/17/15
to unix-and-perl-...@googlegroups.com
From 
$ man grep

       -A NUM, --after-context=NUM
              Print NUM  lines  of  trailing  context  after  matching  lines.
              Places  a  line  containing  --  between  contiguous  groups  of
              matches.

       -B NUM, --before-context=NUM
              Print  NUM  lines  of  leading  context  before  matching lines.
              Places  a  line  containing  --  between  contiguous  groups  of
              matches.

I've have done this with fastq files.

--------------------------------------------------- 
Stephen M. Mount
Center for Bioinformatics and Computational Biology
Dept. of Cell Biology and Molecular Genetics
2208 H. J. Patterson Hall
University of Maryland
College Park, MD    20742-5815

Phone      301-405-6934
alt. ph.     301-405-9904
URL        www.SteveMount.org
------------------------------------------------------
Any commercial or marketing email sent to this address (including information about lab supplies, textbooks, conferences and some funding opportunities) will be tagged as spam. See SteveMount.org for alternatives. 

Mario Roy

unread,
Apr 17, 2015, 6:11:56 PM4/17/15
to unix-and-perl-...@googlegroups.com

The following is a code snippet using Perl MCE retaining the other lines of the fast format. The file is process by the record separator RS => "\n@" MCE option.

## usage: perl grep_fastq.pl pattern file.fastq

use strict;
use warnings;

use MCE::Flow;

die "Not enough arguments given\n" if @ARGV < 2;

my $regex = shift;
$regex = qr/$regex/;

sub user_func {
   my ($mce, $slurp_ref, $chunk_id) = @_;
   if (${ $slurp_ref } =~ $regex) {
      $mce->print(${ $slurp_ref });
   }
}

for my $filename (@ARGV) {
  mce_flow_f {
     RS => "\n@", chunk_size => 1, use_slurpio => 1
  }, \&user_func, $filename;
}


Anthony Doran

unread,
Apr 28, 2015, 9:27:18 AM4/28/15
to unix-and-perl-...@googlegroups.com
Probably a bit late, but you could also do this with a perl one-liner, though unless you're very familiar with perl it might be best use one of the above suggestions. Also, for this I'm not sure if perl would be faster than grep anyway. The below has not been tested (don't have access to my fastq files today) but should work.

perl -ne ' $c = <>; $d = <>; $e = <>; print "$_$c$d$e" if($c =~ m/YOUR_PATTERN/);  '   file.fastq  >  matching_seqs.fastq

Reply all
Reply to author
Forward
0 new messages