SRA headers conversion

830 views
Skip to first unread message

taua...@gmail.com

unread,
Mar 8, 2017, 11:09:04 AM3/8/17
to trinityrnaseq-users
Dear all,

I am running trinity on many SRA datasets and it fails really fast with the following error:

Error, cannot convert fastq file to fasta since cannot recognize read orientation as /1 or /2 (instead: E)Thread 2 terminated abnormally: Error, cmd: /n/sw/fasrcsw/apps/Core/trinityrnaseq/2.3.2-fasrc01/util/..//trinity-plugins/fastool/fastool --illumina-trinity --to-fasta /n/regal/Giribet_lab/tauanajc/prep_for_assemblies/Solemya_velum/bowtie/Solemya_velum_Cleaned.2.fq >> right.fa died with ret 768 at /n/sw/fasrcsw/apps/Core/trinityrnaseq/2.3.2-fasrc01/util/insilico_read_normalization.pl line 769.


I searched through the topics and my understanding is that to fix this issue, I have to download data from SRA again with the --defline-seq option like:
fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files file.sra

I would love to avoid doing that though, since I have already ran several steps for trimming bad quality sequences, adaptors and rRNA contamination.
Can someone explain to me what is the issue with the original fastq files? Is it a header problem? Maybe I can change the headers now instead of starting from scratch?

Any help is much appreciated!

Best,
Tauana





taua...@gmail.com

unread,
Mar 9, 2017, 6:07:15 PM3/9/17
to trinityrnaseq-users
Ok, just in case someone has the same issue:

I found that the problem is the .1 and .2 identifiers that the SRA headers have right after the SRR number. That makes the name of reads in both files different. There is also a space, so we want to get rid of both the identifiers in the middle and the space. I wrote the following sed command to keep basically all the important header information and get rid of the problems (also removing the "length=" at the end).

sed -i 's:\(@SRR.*\).1 \(HWI.*\)\( length=.*\):\1-\2:g' ReadsFile.1.fq
sed -i 's:\(@SRR.*\).2 \(HWI.*\)\( length=.*\):\1-\2:g' ReadsFile.2.fq

The only different between the two is the .1 or .2 in the first fragment.
It worked perfectly on a test run I did with reduced files.


Brian Haas

unread,
Mar 9, 2017, 7:22:30 PM3/9/17
to Tauana Junqueira da Cunha, trinityrnaseq-users
great advice!  thx for posting

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

taua...@gmail.com

unread,
Mar 10, 2017, 10:51:04 AM3/10/17
to trinityrnaseq-users, taua...@gmail.com
No problem!

And after running that for all my ~50 assemblies from SRA, I noticed that I picked one of the few examples that actually have that format I used to create the sed command. Headers actually vary a lot, some of them don't use the SRR number but the actual name of the species, and the words/numbers in the middle also change. I just put together this more generalized regex that should take care of it.

sed -i 's:\(.*\).1 \(.*\)\( length=.*\):\1-\2:g' File.1.fq
sed -i 's:\(.*\).2 \(.*\)\( length=.*\):\1-\2:g' File.2.fq

I will post back if it fails again. But if anyone is having the same issue, just try that and look at the headers to figure what might be different in your case.

Tauana



On Thursday, March 9, 2017 at 7:22:30 PM UTC-5, Brian Haas wrote:
great advice!  thx for posting

~b
On Thu, Mar 9, 2017 at 6:07 PM, <taua...@gmail.com> wrote:
Ok, just in case someone has the same issue:

I found that the problem is the .1 and .2 identifiers that the SRA headers have right after the SRR number. That makes the name of reads in both files different. There is also a space, so we want to get rid of both the identifiers in the middle and the space. I wrote the following sed command to keep basically all the important header information and get rid of the problems (also removing the "length=" at the end).

sed -i 's:\(@SRR.*\).1 \(HWI.*\)\( length=.*\):\1-\2:g' ReadsFile.1.fq
sed -i 's:\(@SRR.*\).2 \(HWI.*\)\( length=.*\):\1-\2:g' ReadsFile.2.fq

The only different between the two is the .1 or .2 in the first fragment.
It worked perfectly on a test run I did with reduced files.


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages