ArgumentError in read_fastq for NCBI data from SRA

Dan Nething

unread,

Aug 5, 2015, 12:29:02 PM8/5/15

to biopieces

Bear with me, I'm mostly a wet bench biochemist. Note that I'm using a virualized version of Ubuntu 14.04 as my platform.

I'm trying to use biopieces to clean up and interpret some RNAseq (small RNAs mainly) data I pulled off of the NCBI SRA database. An example entry is SRR070293, acquired here --> http://sra.dnanexus.com/runs/SRR070293. All the results I'm using came from Illumina systems, model unspecified.

These downloads come in the .sra file type, but the SRA Toolkit provides a method for converting these into FASTQ. I used this to convert the example entry into SRR070293.fastq. An example looks like this:

@SRR070293.126 HWI-EAS382_30FC7AAXX:7:1:1617:871 length=36

AACTGAGTGGCATAAATCTTTGATCGTATGCCGTCT

+SRR070293.126 HWI-EAS382_30FC7AAXX:7:1:1617:871 length=36

IIIIIIIIIIIIIIIIIIIII=II6IIII3I";I3I

Using the biopieces HowTo guide on cleaning up NGS data, I started stepping through the instructions, only to find that the read_fastq biopiece universally rejects all of the FASTQ files I feed it, with the following output:

dan@dan-VirtualBox:/media/sf_SRA/FASTQ/S$ read_fastq -i SRA967633.fastq

/home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:537:in `block in options_check_files': File not readable: 'SRA967633.fastq' (ArgumentError)

from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:535:in `each'

from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:535:in `options_check_files'

from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:493:in `block in options_check'

from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:488:in `each'

from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:488:in `options_check'

from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:396:in `options_parse'

from /home/dan/biopieces/code_ruby/lib/maasha/biopieces.rb:75:in `options_parse'

from /home/dan/biopieces/bp_bin/read_fastq:44:in `<main>'

I looked through the google group, and I noted that you've previously suggested problems with the -e setting, so I've tried using the settings '33', '64', and illumina1.8. I've also tried it with the default. As far as I can tell, the quality scores in the FASTQ data range from ! to I, which should match Phred33, but all of the conditions I've fed it give exactly the same error.

I also did some troubleshooting. The SRA database adds some additional description to the @ and + lines, which I tried removing, so that the input looked like the native output:

@HWI-EAS382_30FC7AAXX:7:1:1617:871

AACTGAGTGGCATAAATCTTTGATCGTATGCCGTCT

+HWI-EAS382_30FC7AAXX:7:1:1617:871

IIIIIIIIIIIIIIIIIIIII=II6IIII3I";I3I

This returned the same error.

I also directly converted the FASTQ files into FASTA files using the FASTX toolkit, and used that output as input for the read_fasta biopiece. This worked exactly like the manual said it did.

So to that end, is this a PEBCAK error, or is something wrong with the read_fastq biopiece? Or is it something else?

I'll also add that when I run bp_test, the output shows

Biopieces tested: 86 Tests run: 293 OK: 277 FAIL: 16 WARNING: 0 Time: 78 secs

The failures were in the following biopieces.

plot_distribution

uclust_seq

usearch_seq

blast_seq

blast_seq_pair

I don't think these are related, but I'm not sure if it's important. I can provide the full diagnostic if needed.

Thanks,

Dan

Martin Asser Hansen

unread,

Aug 5, 2015, 2:21:59 PM8/5/15

to biop...@googlegroups.com

The error message:

options_check_files': File not readable: 'SRA967633.fastq'

indicates that the file is not readable. My guess it is a permission problem.

Try the command:

head SRA967633.fastq

Cheers,

Martin

--
You received this message because you are subscribed to the Google Groups "biopieces" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biopieces+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Nething

unread,

Aug 5, 2015, 2:39:25 PM8/5/15

to biop...@googlegroups.com

Hi Martin,

I appreciate the quick reply.

You actually just solved it for me with that question, it was user error on my part. I was typing SRA967633.fastq instead of the actual file name SRR967633.fastq

The SRA archive uses the SR* style nomenclature based on (P)roject, (E)xperiment, (S)ample, and (R)un. I've been looking at it for so long I was blind to the obvious.

When I tried the head command with the correct input, the header read was:

dan@dan-VirtualBox:/media/sf_SRA/FASTQ/S$ head SRR967633.fastq

@SRR967633.1 HWI-ST226:215:D0A2UACXX:6:1101:1076:2096 length=100

CNACTACCGGGTTGATCCTGCCATACTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAGGAAAAAATAAAGAAAAAGCATGACTACAAAGGTAC

+SRR967633.1 HWI-ST226:215:D0A2UACXX:6:1101:1076:2096 length=100

;#0@2@8@((@@1>@4@@@@5(=3?9)):??=???????????????????????#############################################

@SRR967633.2 HWI-ST226:215:D0A2UACXX:6:1101:1062:2116 length=100

CANCACGACTCTCGGCAACGGATATACTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAACAAGAGGGAGGCACAAAAAGAGCACGTGCAAGCGG

+SRR967633.2 HWI-ST226:215:D0A2UACXX:6:1101:1062:2116 length=100

>>#2<A?=<0)@=C=7AA=A<A=0:??=<:?88=9=??=-->A=777A=>BBAAAAAA##########################################

@SRR967633.3 HWI-ST226:215:D0A2UACXX:6:1101:1182:2224 length=100

CAACTACCAGTGAGGAACGAACACTACTCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAATGAAAAAGGAAAAAAAAGAATAACGACGGAGGAGA

I repeated with read_fastq and it worked fine. Sorry for wasting your time!

While I have you I did want to mention that as a complete beginner, it took me about two weeks to get a Linux system set up and installed with a (mostly) functional Biopieces installation. I kept extensive notes of all the errors I ran into, as well as their fixes and workarounds. Would it be useful if I sent you or posted a copy for any other complete newbies to look at if they are having trouble?

Thanks,

Dan

--
You received this message because you are subscribed to a topic in the Google Groups "biopieces" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/biopieces/XiWUFA-WGQE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to biopieces+...@googlegroups.com.

Martin Asser Hansen

unread,

Aug 5, 2015, 2:42:33 PM8/5/15

to biop...@googlegroups.com

You are most welcome to post you notes to this mailing list. Use a separate mail thread though.