ArgumentError in read_fastq for NCBI data from SRA

39 views
Skip to first unread message

Dan Nething

unread,
Aug 5, 2015, 12:29:02 PM8/5/15
to biopieces
Bear with me, I'm mostly a wet bench biochemist. Note that I'm using a virualized version of Ubuntu 14.04 as my platform.

I'm trying to use biopieces to clean up and interpret some RNAseq (small RNAs mainly) data I pulled off of the NCBI SRA database. An example entry is SRR070293, acquired here --> http://sra.dnanexus.com/runs/SRR070293. All the results I'm using came from Illumina systems, model unspecified.

These downloads come in the .sra file type, but the SRA Toolkit provides a method for converting these into FASTQ. I used this to convert the example entry into SRR070293.fastq. An example looks like this:

@SRR070293.126 HWI-EAS382_30FC7AAXX:7:1:1617:871 length=36
AACTGAGTGGCATAAATCTTTGATCGTATGCCGTCT
+SRR070293.126 HWI-EAS382_30FC7AAXX:7:1:1617:871 length=36
IIIIIIIIIIIIIIIIIIIII=II6IIII3I";I3I

Using the biopieces HowTo guide on cleaning up NGS data, I started stepping through the instructions, only to find that the read_fastq biopiece universally rejects all of the FASTQ files I feed it, with the following output:

dan@dan-VirtualBox:/media/sf_SRA/FASTQ/S$ read_fastq -i SRA967633.fastq
/home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:537:in `block in options_check_files': File not readable: 'SRA967633.fastq' (ArgumentError)
from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:535:in `each'
from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:535:in `options_check_files'
from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:493:in `block in options_check'
from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:488:in `each'
from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:488:in `options_check'
from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:396:in `options_parse'
from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:75:in `options_parse'
from /home/dan/biopieces/bp_bin/read_fastq:44:in `<main>'

I looked through the google group, and I noted that you've previously suggested problems with the -e setting, so I've tried using the settings '33', '64', and illumina1.8. I've also tried it with the default. As far as I can tell, the quality scores in the FASTQ data range from ! to I, which should match Phred33, but all of the conditions I've fed it give exactly the same error.

I also did some troubleshooting. The SRA database adds some additional description to the @ and + lines, which I tried removing, so that the input looked like the native output:

@HWI-EAS382_30FC7AAXX:7:1:1617:871
AACTGAGTGGCATAAATCTTTGATCGTATGCCGTCT
+HWI-EAS382_30FC7AAXX:7:1:1617:871
IIIIIIIIIIIIIIIIIIIII=II6IIII3I";I3I

This returned the same error.

I also directly converted the FASTQ files into FASTA files using the FASTX toolkit, and used that output as input for the read_fasta biopiece. This worked exactly like the manual said it did.

So to that end, is this a PEBCAK error, or is something wrong with the read_fastq biopiece? Or is it something else?

I'll also add that when I run bp_test, the output shows 

Biopieces tested: 86  Tests run: 293  OK: 277  FAIL: 16  WARNING: 0  Time: 78 secs

The failures were in the following biopieces.

plot_distribution
uclust_seq
usearch_seq
blast_seq
blast_seq_pair

I don't think these are related, but I'm not sure if it's important. I can provide the full diagnostic if needed.

Thanks,
Dan

Martin Asser Hansen

unread,
Aug 5, 2015, 2:21:59 PM8/5/15
to biop...@googlegroups.com
The error message:

options_check_files': File not readable: 'SRA967633.fastq' 

indicates that the file is not readable. My guess it is a permission problem.

Try the command:

head SRA967633.fastq



Cheers,


Martin

--
You received this message because you are subscribed to the Google Groups "biopieces" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biopieces+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Nething

unread,
Aug 5, 2015, 2:39:25 PM8/5/15
to biop...@googlegroups.com
Hi Martin,

I appreciate the quick reply.

You actually just solved it for me with that question, it was user error on my part. I was typing SRA967633.fastq instead of the actual file name SRR967633.fastq

The SRA archive uses the SR* style nomenclature based on (P)roject, (E)xperiment, (S)ample, and (R)un. I've been looking at it for so long I was blind to the obvious. 

When I tried the head command with the correct input, the header read was:

dan@dan-VirtualBox:/media/sf_SRA/FASTQ/S$ head SRR967633.fastq
@SRR967633.1 HWI-ST226:215:D0A2UACXX:6:1101:1076:2096 length=100
CNACTACCGGGTTGATCCTGCCATACTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAGGAAAAAATAAAGAAAAAGCATGACTACAAAGGTAC
+SRR967633.1 HWI-ST226:215:D0A2UACXX:6:1101:1076:2096 length=100
;#0@2@8@((@@1>@4@@@@5(=3?9)):??=???????????????????????#############################################
@SRR967633.2 HWI-ST226:215:D0A2UACXX:6:1101:1062:2116 length=100
CANCACGACTCTCGGCAACGGATATACTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAACAAGAGGGAGGCACAAAAAGAGCACGTGCAAGCGG
+SRR967633.2 HWI-ST226:215:D0A2UACXX:6:1101:1062:2116 length=100
>>#2<A?=<0)@=C=7AA=A<A=0:??=<:?88=9=??=-->A=777A=>BBAAAAAA##########################################
@SRR967633.3 HWI-ST226:215:D0A2UACXX:6:1101:1182:2224 length=100
CAACTACCAGTGAGGAACGAACACTACTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAATGAAAAAGGAAAAAAAAGAATAACGACGGAGGAGA

I repeated with read_fastq and it worked fine. Sorry for wasting your time!

While I have you I did want to mention that as a complete beginner, it took me about two weeks to get a Linux system set up and installed with a (mostly) functional Biopieces installation. I kept extensive notes of all the errors I ran into, as well as their fixes and workarounds. Would it be useful if I sent you or posted a copy for any other complete newbies to look at if they are having trouble? 

Thanks,

Dan


--
You received this message because you are subscribed to a topic in the Google Groups "biopieces" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/biopieces/XiWUFA-WGQE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to biopieces+...@googlegroups.com.

Martin Asser Hansen

unread,
Aug 5, 2015, 2:42:33 PM8/5/15
to biop...@googlegroups.com
You are most welcome to post you notes to this mailing list. Use a separate mail thread though.


Cheers,


Martin
Reply all
Reply to author
Forward
0 new messages