trinity 2.0.3 very slow

384 views
Skip to first unread message

Martin Kolisko

unread,
Feb 10, 2015, 3:18:49 PM2/10/15
to trinityrn...@googlegroups.com
Hi,

I have notice that the newest version of Trinity (2.0.2 and 2.0.3) takes a lot longer to complete (about 3x longer), when compared to the older version (20149717). We have noticed this on several datasets. In one of our test assemblies (~400 thousand MiSeq reads) version 2.0.3 took 65 minutes to complete, while 20140717 took only 23 minutes to complete. We have notice similar time difference on real larger datasets as well.

Is this expected through the assembly process improvement?    

If not could it be caused by faulty installation?

Thank you
Best Regards,
Martin

Brian Haas

unread,
Feb 10, 2015, 3:45:13 PM2/10/15
to Martin Kolisko, trinityrn...@googlegroups.com, ctat_t...@googlegroups.com, Ben Fulton
Hi Martin,

The newer version should be slightly slower when run on a single server.  v2.0.3 should be ~20% faster than v2.0.2

Ben - can you comment on the time difference between the earlier trinity and v2?

If you're able to leverage LSF or SGE for the 2nd phase of Trinity v2 (via --grid_conf), then the runtime will be very short. Note, the grid-computing section is greatly improved over the earlier versions of Trinity (pre-v2), with only a single massively-distributed phase (where, earlier Trinity used two such phases).

My guess is that you're experiencing a massive slowdown in phase 2 due to the increased complexity of that phase, and potentially related to some local hardware-related issues (I/O, etc).  If you can leverage LSF, SGE, SLURM, or PBS for phase 2, you'll hopefully experience *much* faster run times.

As far as process improvement, Trinity 2 is definitely an advance over Trinity v1 (pre-2), and offers many more opportunities for interfacing with compute farms and supercomputers (ie. Cray), although it may run slower (slightly - not expected to be 3x slower) on some stand-alone systems.

Ben - ?    

:)

best,

~brian







--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Elisabeth Hehenberger

unread,
Feb 12, 2015, 4:49:56 PM2/12/15
to trinityrn...@googlegroups.com
Hi,

I have some trouble with the speed of the 2.0. versions of trinity too:

My assembly with Trinity 2.0.2 of 80 Million HiSeq reads took >6 days (16 cores, 100 G memory). I repeated the same dataset with the latest Trinity version and it's running samtools since almost 48 hours:

Running cmd: bash -c " set -o pipefail; bowtie -a -m 20 --best --strata --threads 16  --chunkmbs 512 -q -S -f /media/Data1/Lily/RSD_trinity_150210/chrysalis/inchworm.K25.L25.fa.min100 both.fa  | samtools view -F4 -Sb - | samtools sort  -no - - > /media/Data1/Lily/RSD_trinity_150210/chrysalis/iworm.bowtie.nameSorted.bam"  2>/dev/null

With the version before the 2.0 versions the complete assembly took me less than 2 days (same size of the dataset, closely related organism, same number of cores and memory).

my command from the current run:

perl5.18.2 /opt/trinityrnaseq-2.0.3/Trinity --seqType fq --max_memory 100G --CPU 16 --trimmomatic --quality_trimming_params "LEADING:5 TRAILING:5 MINLEN:50 SLIDINGWINDOW:10:25" --left RSD_prep_R1.fastq --right RSD_prep_R2.fastq --output RSD_trinity_150210 --SS_lib_type RF

Is there anything I can change to get roughly close to the speed of the older versions? I'm assembling dinoflagellate data and I'm not looking for expression analysis or SNPs, maybe the old version is sufficient for me.

Thanks for any suggestions!


Best,

Elisabeth

Brian Haas

unread,
Feb 12, 2015, 5:04:11 PM2/12/15
to Elisabeth Hehenberger, trinityrn...@googlegroups.com
Hi Lis,

The initial part of Trinity involving jellyfish, inchworm, and chrysalis (with bowtie) is actually not significantly different from earlier pre-2.0 versions of Trinity.  A new version of jellyfish is integrated as a plugin, and if anything, the bowtie should be much faster than before because the target limits the inchworm contigs to those that are at least 100 bases, instead of targeting the entire inchworm output (which is much larger).   All I can think of is that maybe there are some hardware-related issues you're contending with now that you weren't dealing with earlier (competition for cores or I/O with other running processes).

There were some other changes that could have an impact - earlier versions used unix sort instead of samtools sort, where both would use multiple threads if multi-threaded versions of the tools are installed.  I'm pretty sure Trinity should be using multithreaded samtools if you have the newer version installed.  

Anyway, the (slightly) slower steps have yet to begin.... 

If you're happy with the older version of Trinity, you could certainly continue to use it, but it would be useful to understand why it seems to be running much slower in this particular instance.

best,

~brian


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Martin Kolisko

unread,
Feb 12, 2015, 5:48:19 PM2/12/15
to trinityrn...@googlegroups.com
I noticed in Trinity script that in order for trinity to execute samtools as multithreaded, it uses this line to decide wheher multithreaded samtools are installed:

if ($samtools_version_info =~ /Version: 1.1/) {


I am not so familiar with perl, but from my test it seems it requires samtools version1.1 only, while version 1.0 or 1.2 will not be considered for multithreaded option. Even those from comapring the command option for samtools sort all 1.0, 1.1, and 1.2 should be identical. Would it make sense to modify the condition to match 1.0, 1.1 and 1.2, or am I completely wrong?

Thanks
cheers
Martin

Brian Haas

unread,
Feb 12, 2015, 5:53:09 PM2/12/15
to Martin Kolisko, trinityrn...@googlegroups.com
Right - I just saw that.   This will be updated shortly. 

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Feb 12, 2015, 6:05:19 PM2/12/15
to Martin Kolisko, trinityrn...@googlegroups.com
'devel' branch on github is now updated.   I'm making multithreaded samtools a general requirement (it's been out long enough now), and the command will look like this (from running the sample data):

Running cmd: bash -c " set -o pipefail; bowtie -a -m 20 --best --strata --threads 4  --chunkmbs 512 -q -S -f /Users/bhaas/GITHUB/trinityrnaseq/sample_data/test_Trinity_Assembly/trinity_out_dir/chrysalis/inchworm.K25.L25.fa.min100 both.fa  | samtools view -@ 4 -F4 -Sb - | samtools sort -@ 4 -no - - > /Users/bhaas/GITHUB/trinityrnaseq/sample_data/test_Trinity_Assembly/trinity_out_dir/chrysalis/iworm.bowtie.nameSorted.bam"  2>/dev/null


This should hopefully speed things up, but note that the bowtie step was never meant to be a major rate-limiting step in Trinity.... when it's running for a very long time, it usually means there's something hardware-related going on, such as multiple processes competing with Trinity for I/O.


best,

~brian



Brian Haas

unread,
Feb 27, 2015, 8:15:05 PM2/27/15
to Martin Kolisko, trinityrn...@googlegroups.com, Fulton, Ben
Hi all,

It looks like Ben may have found the bug causing the unintended slowdown.  It would only surface in very large data sets, which is why we didn't encounter it in our regular tests (which don't target massive data sets... but apparently we should do this) 
The likely slowdown is due to the input files not being fanned out as they were intended to be (a single var++ was missing where it was supposed to go), which means that bazillions of files were ending up in single directory - which can easily cause problems with most file systems... a bug manifesting with incomplete penetrance.

The Trinity github devel branch is updated and we'll do some more testing to verify this, but it is very likely to be the culprit.  As soon as we confirm this, I'll cut a patch-release.

best,

~brian



--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages