Working out some problems with Trinity v2.2.0

151 views
Skip to first unread message

Alejandro Sanchez

unread,
Aug 27, 2016, 4:59:28 PM8/27/16
to trinityrn...@googlegroups.com
Hi everyone,

A few months ago, I posted an issue I had running trinity using Jaccard clip. However, I'm running into the same problem again, although I'm not using that parameter. I'm using SGE by the way...

The problem relies in the second phase of Trinity


--------------------------------------------------------------------------------
------------ Trinity Phase 2: Assembling Clusters of Reads ---------------------
--------------------------------------------------------------------------------

$VAR1 = {
          'memory' => '16G',
          'grid' => 'SGE',
          'cmd' => 'qsub -V -cwd',
          'max_nodes' => '300',
          'cmds_per_node' => '20'
        };
-note, 153707 commands already completed successfully. Skipping them here.
-there are 213 cmds left to run here.
  CMDS: 213 / 213  [11/300 nodes in use]
* All cmds submitted to grid.  Now waiting for them to finish.

  CMDS: 213 / 213  [7/300 nodes in use]   -no record of job_id 11876, setting as state unknown
-no record of job_id 11873, setting as state unknown
-no record of job_id 11871, setting as state unknown
-no record of job_id 11874, setting as state unknown
-no record of job_id 11869, setting as state unknown
-no record of job_id 11868, setting as state unknown
  CMDS: 213 / 213  [0/300 nodes in use]
* All nodes completed.  Now auditing job completion status values
Failures encountered:
num_success: 83 num_fail: 61    num_unknown: 69
Finished.

130 commands failed during grid computing.
-failed commands written to: recursive_trinity.cmds.htc_cache_success.__failures

As you can see, once all the commands are submitted, at some point the program decide to check the jobs and report some as unknown. In particular, those ones are still running!

Then it goes and try to use Parafly to run them but now I get this errors:

Trying to run them using parafly...

Number of Commands: 130
-----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(1)   0.769231% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(2)   1.53846% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(3)   2.30769% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(4)   3.07692% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(5)   3.84615% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(6)   4.61538% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(7)   5.38462% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(1), failed(8)   6.92308% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(1), failed(9)   7.69231% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(1), failed(10)   8.46154% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(1), failed(11)   9.23077% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(1), failed(12)   10% completed.    -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------

So there is a bunch of ERROR warnings like those above and it keeps going until reaching 100%.

So my questions:

-Which script is checking or how is checked if a job has already finished or not. Should it be wise to use jobs dependencies when submitting the jobs?
-My single.fa file is there at the output dir, but I'm not sure where does Chrysalis want it.
-Should I pack fewer cmds per job? currently I put 20 as you can see in my conf file params at the very beginning of my post. I don't want to hammer the queue system and I though that 20 will be a sensible number.

At some point Brian Haas adviced to just delete the output directories from the failed commands and rerun the Trinity script. I will try that but I would like to find out the real reason of these failures. My guess is that some jobs are still running and the pipeline doesn't wait for them but I haven't found who controls that, hpc_conf?

Cheers!

Saludos!

Dr. Alejandro Sanchez-Flores
Jefe de Unidad / Head of Core Facility
Unidad de Secuenciacion Masiva y Bioinformatica

Brian Haas

unread,
Aug 27, 2016, 8:47:37 PM8/27/16
to Alejandro Sanchez, trinityrn...@googlegroups.com
Hi Alejandro,

The SGE related code that you're looking for is here:

trinityrnaseq/PerlLib/HPC/SGE_handler.pm

and the methods that you'll want to take a look at is: 

  

   submit_job_to_grid()

   job_running_or_pending_on_grid()


The main control logic is in Farmit.pm, where it submits the jobs.  Farmit calls SGE_handler::submit_job_to_grid() to submit the job to SGE, and then grabs the job_id that SGE reports.

If  the job hasn't finished within 15 minutes, then it (in the case of SGE) will use the 'qstat' command to poll the system to look for the job and to check it's status. This is where the SGE_handler::job_running_or_pending() comes into play. 

If you run 'qstat' and you see that your job is running, then either the submit_job_to_grid() is not scraping the job_id correctly from the output that SGE is providing when qsub is run to launch the job, or job_running_or_pending_on_grid() is not able to parse the results from 'qstat' to check and see if the job is still running.

I use SGE myself and the system is configured to work with how I'm experiencing it, and I never figured that different SGE installations might behave differently (silly me!).  With a minor bit of hacking, you should get it to work though.


best,

~brian




Alejandro Sanchez

unread,
Aug 27, 2016, 11:10:07 PM8/27/16
to Brian Haas, trinityrn...@googlegroups.com

Thanks Brian!

I'll get into the code and see if i can figure out what's going on.

Cheers!

Alejandro Sanchez

unread,
Sep 19, 2016, 4:12:50 PM9/19/16
to Brian Haas, Alejandro Sanchez, trinityrn...@googlegroups.com
Hi Brian and everyone,

I'm still stuck with my run but I think I'm just about to fix the problem. What I detected is that I have some failing jobs due to memory, like these two:

sh: line 1: 25529 Aborted                 (core dumped) /share/apps/trinityrnaseq-2.2.0/Chrysalis/ReadsToTranscripts -i s
ingle.fa -f /scratch/alexsf/trinity_results/FArcos/centolla/trinity_default/read_partitions/Fb_0/CBin_786/c78685.trinity.
reads.fa.out/chrysalis/bundled_iworm_contigs.fasta -o /scratch/alexsf/trinity_results/FArcos/centolla/trinity_default/rea
d_partitions/Fb_0/CBin_786/c78685.trinity.reads.fa.out/chrysalis/readsToComponents.out -t 1 -max_mem_reads 50000000 -stra
nd 2> tmp.25513.stderr
-------------------------------------------
---- Chrysalis: ReadsToTranscripts --------
-- (Place reads on Inchworm Bundles) ------
-------------------------------------------

Setting maximum number of reads to load in memory to 50000000
-setting num threads to: 1
Reading bundled inchworm contigs...
done!
Allocating: 6521
Resizing: 4971
SortingDone.
Assigning kmers to Iworm bundles ... done!
Processing reads:
 reading another 50000000... done.  Read 12126803 reads.
[12126803] reads analyzed for mapping.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Trinity run failed. Must investigate error above.


[alexsf@teopanzolco trinity_default]$ more J25306.S20.sh.o12143
HOST: compute-0-0.local
sh: line 1:  9477 Aborted                 (core dumped) /share/apps/trinityrnaseq-2.2.0/Chrysalis/ReadsToTranscripts -i s
ingle.fa -f /scratch/alexsf/trinity_results/FArcos/centolla/trinity_default/read_partitions/Fb_0/CBin_840/c84058.trinity.
reads.fa.out/chrysalis/bundled_iworm_contigs.fasta -o /scratch/alexsf/trinity_results/FArcos/centolla/trinity_default/rea
d_partitions/Fb_0/CBin_840/c84058.trinity.reads.fa.out/chrysalis/readsToComponents.out -t 1 -max_mem_reads 50000000 -stra
nd 2> tmp.9461.stderr
-------------------------------------------
---- Chrysalis: ReadsToTranscripts --------
-- (Place reads on Inchworm Bundles) ------
-------------------------------------------

Setting maximum number of reads to load in memory to 50000000
-setting num threads to: 1
Reading bundled inchworm contigs...
done!
Allocating: 15266
Resizing: 13166
SortingDone.
Assigning kmers to Iworm bundles ... done!
Processing reads:
 reading another 50000000... done.  Read 13601131 reads.
[13601131] reads analyzed for mapping.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Trinity run failed. Must investigate error above.

These are two jobs that apparently are running out of memory. However, I haven't found where I can modify the amount of memory assigned to either this kind job:

/share/apps/trinityrnaseq-2.2.0/util/support_scripts/../../Trinity --single \
 "/scratch/alexsf/trinity_results/FArcos/centolla/trinity_default/read_partitions/Fb_0/CBin_834/c83474.trinity.reads.fa" \
--output "/scratch/alexsf/trinity_results/FArcos/centolla/trinity_default/read_partitions/Fb_0/CBin_834/c83474.trinity.reads.fa.out" --CPU 1 \
--max_memory 1G --SS_lib_type F --seqType fa --trinity_complete --full_cleanup --no_version_check --no_distributed_trinity_exec

or this job:

/share/apps/trinityrnaseq-2.2.0/Chrysalis/ReadsToTranscripts -i single.fa -f /scratch/alexsf/trinity_results/FArcos/centolla/trinity_default/read_partitions/Fb_0/CBin_786/c78685.trinity.reads.fa.out/chrysalis/bundled_iworm_contigs.fasta \
-o /scratch/alexsf/trinity_results/FArcos/centolla/trinity_default/read_partitions/Fb_0/CBin_786/c78685.trinity.reads.fa.out/chrysalis/readsToComponents.out -t 1 \
-max_mem_reads 50000000 -strand 2> tmp.25513.stderr

I think that the first one is kind of the main command that actually is listed in the recursive_trinity.cmds.htc_cache_success.__failures.FAILED_DURING_PARAFLY list of commands. Then I guess that Farmit.pl will send the command below.

So my question are:

How do I assign more memory to the first job? I tried with the --grid_node_max_memory option but I think it is not necessary since I have my config file for SGE where I specify it. In fact, I tried with 32Gb or RAM but apparently is not enough or is not the place where I have to modify it. My guess is that is the --max_memory option which is set just 1G, in the Trinity command applied to a specific read partition.

Is it ok if I run individually each failed job? Or how do I continue without running the whole pipeline since I've been trying this and still it can't through the same 27 commands which can't allocate enough memory?

Thanks in advance!


Saludos!

Dr. Alejandro Sanchez-Flores
Jefe de Unidad / Head of Core Facility
Unidad de Secuenciacion Masiva y Bioinformatica

Alejandro Sanchez

unread,
Sep 19, 2016, 4:13:46 PM9/19/16
to Alejandro Sanchez, Brian Haas, trinityrn...@googlegroups.com
By the way,

I already tried removing the directories of the offending jobs in the FAILED list and reran the program but still the same problem.

Saludos!

Dr. Alejandro Sanchez-Flores
Jefe de Unidad / Head of Core Facility
Unidad de Secuenciacion Masiva y Bioinformatica

Brian Haas

unread,
Sep 19, 2016, 4:30:42 PM9/19/16
to Alejandro Sanchez, trinityrn...@googlegroups.com
Hi Alejandro,

Was normalization performed via (--normalize_reads) ?    Usually, normalization solves this sort of problem. If it was *not* originally used, then you can just find the offending commands in the 'trinity_out_dir/recursive_trinity.cmds' file and tack '--normalize_reads' onto the end of them.  Delete the output directories that correspond to each of those jobs, and then restart the original Trinity job.   This tends to solve the problem - even if it is a hassle. Usually it's just one or two jobs that cause the issue.

If you're already using --normalize_reads, then tacking on '--min_kmer_cov 2'   should do the trick. Same thing - remove the output directories for each of those jobs, and then restart the original command.

The '--normalize_reads' parameter will be a default in the next release, yet folks will be able to turn it off if they would prefer to not use it.  

best,

~brian

Alejandro Sanchez

unread,
Sep 19, 2016, 5:05:58 PM9/19/16
to Brian Haas, Alejandro Sanchez, trinityrn...@googlegroups.com
Hi Brian,

I'm not using normalization since I remember that I read somewhere that is only used when you have more than 1 billion reads, which is not my case. However, I have hundreds of millions and I don't mind rerunning the whole thing again from scratch to test if normalize_reads will do the trick.

If not, then I will try --min_kmer_cov 2. I'll keep you posted.

Also, I think I have a fix for the farmit module using dependencies for the case of SGE, instead of using sleep for a certain amount of time. I'll let you know when we finish testing it.

Thanks!

Saludos!

Dr. Alejandro Sanchez-Flores
Jefe de Unidad / Head of Core Facility
Unidad de Secuenciacion Masiva y Bioinformatica

Brian Haas

unread,
Sep 19, 2016, 8:48:58 PM9/19/16
to Alejandro Sanchez, trinityrn...@googlegroups.com
Hi Alejandro,

If you have more than 100M reads, I generally recommend using normalization just to avoid these sorts of performance problems.  To assemble 1 billion reads, it's an absolute necessity.

I'm glad to hear the SGE integration is working out.

best,

~b
--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 
Reply all
Reply to author
Forward
0 new messages