error during trinity normalization

Laura Entrambasaguas

unread,

Jul 8, 2015, 4:18:06 AM7/8/15

to trinityrn...@googlegroups.com

Hi everybody,

After quality trimming (FASTX Toolkit) and rRNA contamination removal (SortMeRNA) I am using Trinity v2.0.6 on SE reads from Ion Proton sequencing machine with the following command:

/opt/trinityrnaseq-2.0.6/Trinity --seqType fq --max_memory 240G --single 001_chip1_qtrimmed_non_rRNA_cleaned.fq,001_chip2_qtrimmed_non_rRNA.fq,001_chip3_qtrimmed_non_rRNA.fq,002_chip1_qtrimmed_non_rRNA_cleaned.fq,002_chip2_qtrimmed_non_rRNA.fq,002_chip3_qtrimmed_non_rRNA_cleaned.fq,003_chip1_qtrimmed_non_rRNA.fq,003_chip2_qtrimmed_non_rRNA_blankline.fq,003_chip3_qtrimmed_non_rRNA.fq,004_chip1_qtrimmed_non_rRNA_cleaned.fq,004_chip2_qtrimmed_non_rRNA.fq,004_chip3_qtrimmed_non_rRNA.fq,005_chip1_qtrimmed_non_rRNA_cleaned.fq,005_chip2_qtrimmed_non_rRNA.fq,005_chip3_qtrimmed_non_rRNA.fq,006_chip1_qtrimmed_non_rRNA_cleaned.fq,006_chip2_qtrimmed_non_rRNA.fq,006_chip3_qtrimmed_non_rRNA.fq,007_chip1_qtrimmed_non_rRNA_cleaned.fq,007_chip2_qtrimmed_non_rRNA.fq,007_chip3_qtrimmed_non_rRNA.fq,008_chip1_qtrimmed_non_rRNA_cleaned.fq,008_chip2_qtrimmed_non_rRNA.fq,008_chip3_qtrimmed_non_rRNA.fq --SS_lib_type F --normalize_reads --min_contig_length 200 --min_kmer_cov 2 --inchworm_cpu 24 --bflyHeapSpaceInit 24G --bflyHeapSpaceMax 240G --bflyCalculateCPU --CPU 24 --output ./trinity_results2 > trinity_proton.log

I got the following error (I have omitted the full path of the files).

-------------------------------------------

----------- Jellyfish --------------------

-- (building a k-mer catalog from reads) --

-------------------------------------------

CMD finished (0 seconds)

CMD: /opt/trinityrnaseq-2.0.6/util/..//trinity-plugins/jellyfish/bin/jellyfish count -t 24 -m 25 -s 34582920881 single.fa

CMD finished (245 seconds)

CMD: /opt/trinityrnaseq-2.0.6/util/..//trinity-plugins/jellyfish/bin/jellyfish histo -t 24 -o jellyfish.K25.min2.kmers.fa.histo mer_counts.jf

CMD finished (29 seconds)

CMD: /opt/trinityrnaseq-2.0.6/util/..//trinity-plugins/jellyfish/bin/jellyfish dump -L 2 mer_counts.jf > jellyfish.K25.min2.kmers.fa

CMD finished (67 seconds)

CMD: touch jellyfish.K25.min2.kmers.fa.success

CMD finished (0 seconds)

CMD: /opt/trinityrnaseq-2.0.6/util/..//Inchworm/bin/fastaToKmerCoverageStats --reads single.fa --kmers jellyfish.K25.min2.kmers.fa --kmer_size 25 --num_threads 24 > single.fa.K25.stats

-reading Kmer occurences...

done parsing 157794964 Kmers, 156294025 added, taking 246 seconds

STATS_GENERATION_TIME: 1476 seconds.

CMD finished (1759 seconds)

CMD: touch single.fa.K25.stats.ok

-sorting each stats file by read name.

CMD finished (0 seconds)

CMD: /usr/bin/sort --parallel=24 -k5,5 -T . -S 240G single.fa.K25.stats > single.fa.K25.stats.sort

CMD finished (356 seconds)

CMD: touch single.fa.K25.stats.sort.ok

CMD finished (0 seconds)

CMD: /opt/trinityrnaseq-2.0.6/util/..//util/support_scripts//nbkc_normalize.pl single.fa.K25.stats.sort 50 200 > single.fa.K25.stats.sort.C50.pctSD200.accs

36101866 / 172720634 = 20.90% reads selected during normalization.

3847930 / 172720634 = 2.23% reads discarded as likely aberrant based on coverage profiles.

0 / 172720634 = 0.00% reads missing kmer coverage (N chars included?).

CMD finished (421 seconds)

CMD: touch single.fa.K25.stats.ied records have been rsort.C50.pctSD200.accs.ok

CMD finished (0 seconds)

Thread 2 terminated abnormally: Error, not all specified records have been retrieved (missing 11490858) from [path to my files] at /opt/trinityrnaseq-2.0.6/util/insilico_read_normalization.pl line 526.

Error encountered with thread.

Error, at least one thread died at /opt/trinityrnaseq-2.0.6/util/insilico_read_normalization.pl line 424.

Error, cmd: /opt/trinityrnaseq-2.0.6/util/insilico_read_normalization.pl --seqType fq --JM 240G --max_cov 50 --CPU 24 –output [path to my files] died with ret 6400 at /opt/trinityrnaseq-2.0.6/Trinity line 2116.

Previously, I deleted one blank line in one file (the error was detected byTrinity). At the same time, I have checked the top few lines of each fastq file and everything seems ok in headers (no spaces). I have also checked the blank lines, indentation, etc. A few lines of one input fastq file:

@C60IL:03002:04349

GCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCG
+
;;;9;49;;7<=7?7<7;<::;7<7==<;;<<<<<<4<6:;5;;4::;<<6==<<-==>>=

I am really interested in performing the preprocessing step with Fastx Toolkit as well as remove rRNA before de novo assembly.

Please, let me know your opinion and any possible solution.

Thanks so much!!

Laura

Tiago Hori

unread,

Jul 8, 2015, 4:41:31 AM7/8/15

to Laura Entrambasaguas, trinityrn...@googlegroups.com

Weird,

That error is a header incompatibility issue and I know Brian was looking into it However, when we seen that error it was because of spaces in the the read names, which you don't seem to have. Could there be a trailing space in the name?

You could try to remove any spaces: cat file.fastq | perl -lane 's/\s/_/g; print;' > mod.fastq

Can you send us an example of the headers from your raw files?

T.

"Profanity the is the only language all programmers understand"

Sent from my iPhone, the universal excuse for my poor spelling.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Tiago Hori

unread,

Jul 9, 2015, 10:20:38 AM7/9/15

to Laura Entrambasaguas, trinityrn...@googlegroups.com

How much memory do you have available to you and how many reads are you using?

T.

I cc'ed the list cause there may be folks there with more Java experience than me.

Ultimate stresses sportsmanship and fair play. Competitive play is encouraged, but NEVER AT THE EXPENSE OF RESPECT BETWEEN PLAYERS, adherence to the rules and the BASIC JOY OF PLAY.

On Jul 09, 2015, at 10:42 AM, Laura Entrambasaguas <lent...@gmail.com> wrote:

Hello,

Now, I am facing the following error:

“Number of Commands: 118
WARNING, cannot remove output directory /data01/proton/trinity/data_sed/trinity_results/read_partitions/Fb_0/CBin_442/c44235.trinity.reads.fa.out, since not created in this run. (safety precaution)
succeeded(1) 0.847458% completed. WARNING, cannot remove output directory /data01/proton/trinity/data_sed/trinity_results/read_partitions/Fb_0/CBin_442/c44248.trinity.reads.fa.out, since not created in this run. (safety precaution)
.
.
succeeded(14) 11.8644% completed. OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00002aaf75600000, 17179869184, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 17179869184 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /data01/proton/trinity/data_sed/trinity_results/read_partitions/Fb_0/CBin_442/c44244.trinity.reads.fa.out/hs_err_pid820.log
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00002b4f43780000, 25823281152, 0) failed; error='Impossibile allocare memoria' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 25823281152 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /data01/proton/trinity/data_sed/trinity_results/read_partitions/Fb_0/CBin_444/c44417.trinity.reads.fa.out/hs_err_pid7015.log
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00002b77ae480000, 25746735104, 0) failed; error='Impossibile allocare memoria' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 2

..”
Please, find also attached a file from a Cbin directory..

Before launching the trinity command I set the stack size to unlimited due to “There is insufficient memory for the Java Runtime Environment to continue” error, typing:
ulimit -s unlimited
ulimit -a

What should I do?? Perhaps something related to tweaking Trinity to play nicely with Java or including the --bflyGCThreads parameter in my trinity command (but I am using v2.0.6)??
Thanks a lot for any help!!
L.

On Wed, Jul 8, 2015 at 2:12 PM, Laura Entrambasaguas <lent...@gmail.com> wrote:
Hi Tiago,

Thanks so much!!
Before receiving your email, I have again removed any possible space (cat file.fa | sed '/^$/d;s/[[:blank:]]//g' > output.fa) and now Trinity is running. I suppose I did not remove all spaces..Let's see now what happens!!

Laura

hs_err_pid12738.log

Brian Haas

unread,

Jul 19, 2015, 8:22:46 PM7/19/15

to Tiago Hori, Laura Entrambasaguas, trinityrn...@googlegroups.com

I think the fastq read names need to have the /1 or /2 suffix for the normalization process to work correctly.

ie.

@C60IL:03002:04349/1

~b

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Brian Haas

unread,

Jul 20, 2015, 11:58:53 AM7/20/15

to Laura Entrambasaguas, trinityrn...@googlegroups.com

Hi Laura,

responses below

On Mon, Jul 20, 2015 at 11:51 AM, Laura Entrambasaguas <lent...@gmail.com> wrote:

Sorry, I have sent the email without finishing it...

Hi Brian,

Thanks for your quick but disquieting answer... So,

1. Do you thing de novo assembly is wrong although I didn't get any error during the processes and I even got a successfully message at the end of the log file??

The de novo assembly should proceed ok w/o the /1 suffix. The assembly process is less picky about this than the normalization process code, and we should update the normalization code so it's less picky too. (I'll add an issue on github for it).

2. From this point:
- Due to the fastq read names from the Ion Proton Server haven't got the /1 suffix, I suppose I should introduce it in each input file?
- After the run, I should find all the input files (24) in the insilico_read_normalization folder.

If you want the in silico normalization to run w/o error, then I think this might be needed. If you're able to assemble ok w/o doing the normalization, then I wouldn't bother going back and doing the normalization. Only if you have hundreds of millions of reads, would in silico normalization be needed.

best,

~b

Thanks a lot for your answers and warning.

L.

On Mon, Jul 20, 2015 at 4:53 PM, Brian Haas <bh...@broadinstitute.org> wrote:
Hi Laura,

I think the code still wants to see the /1 suffix. sorry!

~brian

On Mon, Jul 20, 2015 at 10:33 AM, Laura Entrambasaguas <lent...@gmail.com> wrote:
Dear Brian,

But I am working with SE reads, do you still thinking my files must have the /1 or /2 suffix or do you think the Trinity de novo assembly could be wrong??

Thanks

Laura Entrambasaguas

unread,

Nov 9, 2015, 4:07:35 AM11/9/15

to trinityrn...@googlegroups.com

Hello,

After reading the post of Cristina (both.fa.read_count does not include all reads) I do realize that also the both.fa.read_count file generated after my de novo assembly is disconcerting. This file indicates 44805904 but I performed the assembly from two pooled fastq PE files of 348.521.669 reads each.

I first run Trinity with butterfly settings (--inchworm_cpu 24 --bflyHeapSpaceInit 8G --bflyHeapSpaceMax 240G --bflyCPU 24) but after receiving an error (# There is insufficient memory for the Java Runtime Environment to continue...), I rerun Trinity without those parameters and it seemed that I must have been really close to the end of the butterfly process because the analysis finished in minutes. The .log indicated that everything was completed successfully.

The final Trinity command was:

/opt/trinityrnaseq-2.0.6/Trinity --seqType fq --max_memory 240G --left all_1.fq --right all_2.fq --SS_lib_type RF --normalize_reads --min_contig_length 200 --min_kmer_cov 2 --CPU 24 –output /data01/illumina/ddlab.sci.univr.it/results/trinity > trinity.log &

Also .stats seemed ok:

################################

## Counts of transcripts, etc.
################################
Total trinity 'genes': 104350

Total trinity transcripts: 222548

Percent GC: 40.53

########################################

Stats based on ALL transcript contigs:

########################################

Contig N10: 6171

Contig N20: 4780

Contig N30: 3924

Contig N40: 3255

Contig N50: 2690

Median contig length: 785

Average contig: 1439.30

Total assembled bases: 320312502

#####################################################

## Stats based on ONLY LONGEST ISOFORM per 'GENE':

#####################################################

Contig N10: 5691

Contig N20: 4304

Contig N30: 3370

Contig N40: 2601

Contig N50: 1886

Median contig length: 427

Average contig: 918.21

Total assembled bases: 95815280

By its part, the bowtie total alignment rate was, aprox., 82.76% (unique, aprox. 18%).

In addition, I've just assessed the read content of the transcriptome assembly, and I got scared when I saw the results:

#read_type count pct

single 1 100.00

Total aligned reads: 1

I've also run another trinity assembly. Both, .log and .stats seems ok, and these latest results are similar to the trinity first.stats results. I've also assessed the read content of this second transcriptome assembly and I get the same results:

#read_type count pct

single 1 100.00

Total aligned reads: 1

I'm really worried about these results, above all because I have been working with the first trinity.fasta assembly for some time. I would really appreciate any advice.

Thanks so much,

Brian Haas

unread,

Nov 9, 2015, 6:07:25 AM11/9/15

to Laura Entrambasaguas, trinityrn...@googlegroups.com

Hi Laura,

Since you used the --normalize_reads parameter, the number of reads assembled by Trinity will be far fewer than the total number you're starting with. Based on your assembly stats, it would seem the assembly was fine.

The thing that's not working for some reason is the bowtie_PE step, as the bam file being processed has no reads showing up as aligned. I'm happy to look into this - what I'd suggest doing is to try running the bowtie_PE step using only a few million reads from your sample and see if that works. Assuming it doesn't work correctly (fails reproducibly) you could share the output directory with me and I could look into it directly.

Another thing you can do is to just run bowtie directly w/ the paired reads, generate a bam file, sort it by read name (not coordinate) and run that through the read counting script.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,

Nov 9, 2015, 7:14:17 AM11/9/15

to Laura Entrambasaguas, trinityrnaseq-users

Sounds right. With our salamander data we went from 1.4 billion reads down to 85 M. It all depends in the complexity if the data.

-Brian

(by iPhone)

On Nov 9, 2015, at 6:51 AM, Laura Entrambasaguas <lent...@gmail.com> wrote:

Hi Brian,
Thanks.
Yes, I did realize that the normalization significantly reduces the number of reads, but from 697.043.338 millions of reads to 44.805.904?? It's a huge quantity of "lost" reads for getting a reliable assembly??!!
I know that Trinity in-silico normalization is necessary for large data sets to reduce memory requirements and improve runtimes, but taking into account that these data came from a non-model organism and I'm mainly interested in analysing gene expression (I would want to use as many reads as possible in the assembly to maximize the coverage level), do you still recommend me to perform the in-silico read normalization??
In relation to bowtie_PE step results, was my fault due I forgot the '--' parameter. Now seems that its properly running (its working for than 1.5 h)..Let's hope everything will come right!!
Thanks again!!
Laura

Brian Haas

unread,

Nov 9, 2015, 7:17:10 AM11/9/15

to Laura Entrambasaguas, trinityrnaseq-users

Wrt bowtie_PE, what option did you forget? It sounds like an error mode we should be better handling.

Many thx

-Brian

(by iPhone)

On Nov 9, 2015, at 6:51 AM, Laura Entrambasaguas <lent...@gmail.com> wrote:

Hi Brian,
Thanks.
Yes, I did realize that the normalization significantly reduces the number of reads, but from 697.043.338 millions of reads to 44.805.904?? It's a huge quantity of "lost" reads for getting a reliable assembly??!!
I know that Trinity in-silico normalization is necessary for large data sets to reduce memory requirements and improve runtimes, but taking into account that these data came from a non-model organism and I'm mainly interested in analysing gene expression (I would want to use as many reads as possible in the assembly to maximize the coverage level), do you still recommend me to perform the in-silico read normalization??
In relation to bowtie_PE step results, was my fault due I forgot the '--' parameter. Now seems that its properly running (its working for than 1.5 h)..Let's hope everything will come right!!
Thanks again!!
Laura

On Mon, Nov 9, 2015 at 12:07 PM, Brian Haas <bh...@broadinstitute.org> wrote:

Reply all

Reply to author

Forward