Commands failing in trinity phase 2

554 views
Skip to first unread message

Jacob Musser

unread,
Nov 16, 2015, 7:41:19 AM11/16/15
to trinityrnaseq-users
Hello,

I am having a problem running trinity. First, some background. I am running trinity on a large illumina rnaseq dataset on a node on our cluster (80 core with 1 TB ram). The dataset was originally 2.5 billion paired-end reads but after normalization it is ~130M paired-end reads. I did the normalization in two steps, first by splitting the dataset in 5 pieces, normalizing each of these pieces (using the --prep option), then combining the normalized reads from this round and running the normalization again. 

Since the dataset was still quite large post-normalization I tested the --grid_config option on a small subset of the original reads and requesting a small number of cores on our cluster. The run worked fine so I then scaled up the number of cores I was requesting (to 500 cores with 4M memory for each request) and ran it with the large dataset. The first problem occurred during the phase where trinity was farming out jobs to the cluster. According to the std output it had 500 nodes in use most of the time. However, when I checked what jobs I had running it did not show those jobs. When I looked at the history of job submission it looked like some jobs had been submitted, those had finished, and new jobs had not been submitted, and trinity was not detecting this. This led to the following error at the end of this phase according to Trinity:

 CMDS: 574080 / 574271  [287/500 nodes in use]   
  CMDS: 574120 / 574271  [288/500 nodes in use]   
  CMDS: 574160 / 574271  [289/500 nodes in use]   
  CMDS: 574200 / 574271  [290/500 nodes in use]   
  CMDS: 574240 / 574271  [291/500 nodes in use]   
  CMDS: 574271 / 574271  [292/500 nodes in use]   
* All cmds submitted to grid.  Now waiting for them to finish.

  CMDS: 574271 / 574271  [114/500 nodes in use]   
  CMDS: 574271 / 574271  [13/500 nodes in use]   
  CMDS: 574271 / 574271  [2/500 nodes in use]   
  CMDS: 574271 / 574271  [1/500 nodes in use]   
  CMDS: 574271 / 574271  [0/500 nodes in use]   
* All nodes completed.  Now auditing job completion status values
574271 commands failed during grid computing.
-failed commands written to: recursive_trinity.cmds.htc_cache_success.__failures



Trying to run them using parafly...

Failures encountered:
num_success: 0 num_fail: 574271 num_unknown: 0
Finished.

Number of Commands: 574271

succeeded(1)   0.000174134% completed.    
succeeded(2)   0.000348268% completed.    
succeeded(3)   0.000522401% completed.    
succeeded(4)   0.000696535% completed.    


It then tried to run "parafly", but encountered some errors here as well. Below is a snapshot of one of the errors:


succeeded(152688), failed(2)   26.5814% completed.    
succeeded(152689), failed(2)   26.5816% completed.    
succeeded(152690), failed(2)   26.5818% completed.    
succeeded(152691), failed(2)   26.5819% completed.    
succeeded(152693), failed(2)   26.5823% completed.    
succeeded(152693), failed(2)   26.5823% completed.    Error, cannot rename Trinity.fasta.tmp to /scratch/musser/plat_rnaseq_data/trinity_full/2015_nov_11/trinity_2015_nov_11_full1_out_dir/read_partitions/Fb_1/CBin_1526/c152689.trinity.reads.fa.out.Trinity.fasta at /home/musser/software/trinityrnaseq-2.0.6/util/support_scripts/../../Trinity line 1114.

succeeded(152693), failed(3)   26.5825% completed.    
succeeded(152694), failed(3)   26.5826% completed.    
succeeded(152695), failed(3)   26.5828% completed.    
succeeded(152696), failed(3)   26.583% completed.    
succeeded(152697), failed(3)   26.5832% completed.    
succeeded(152698), failed(3)   26.5833% completed.    
succeeded(152699), failed(3)   26.5835% completed.    

Here is a snapshot of the end of the std output of the run:

succeeded(574265), failed(3)   99.9995% completed.    
succeeded(574266), failed(3)   99.9997% completed.    
succeeded(574267), failed(3)   99.9998% completed.    
succeeded(574268), failed(3)   100% completed.    

We are sorry, commands in file: [recursive_trinity.cmds.htc_cache_success.__failures.FAILED_DURING_PARAFLY] failed.  :-( 

Trinity run failed. Must investigate error above.


As I mentioned, I first got this error during the run where I was using the --grid_config options.  I tried rerunning trinity directing it to the same output directory but without using the --grid_config option. The run picked up at the start of phase 2 but still had several failed instances, which led to trinity exiting at the end of phase 2. The errors were of the same type ("Error, cannot rename Trinity.fasta.tmp....") but there were 6 failed instances instead of 3 in total:

succeeded(574261), failed(6)   99.9993% completed.    
succeeded(574262), failed(6)   99.9995% completed.    
succeeded(574263), failed(6)   99.9997% completed.    
succeeded(574264), failed(6)   99.9998% completed.    
succeeded(574265), failed(6)   100% completed.    

We are sorry, commands in file: [FailedCommands] failed.  :-( 

Trinity run failed. Must investigate error above.


I then tried a run starting from scratch (i.e. different output directory) and without using the --grid_config option. Again it had problems during phase 2 and quit. This time there were 9 failures instead of 6 (all "Error, cannot rename Trinity.fasta.tmp...), and again Trinity exited at the end of the phase:

succeeded(574413), failed(9)   99.9997% completed.    
succeeded(574414), failed(9)   99.9998% completed.    
succeeded(574415), failed(9)   100% completed.    

We are sorry, commands in file: [FailedCommands] failed.  :-( 

Trinity run failed. Must investigate error above.


Can you please offer some advice about what I can do to solve this problem?





Brian Haas

unread,
Nov 21, 2015, 11:11:21 AM11/21/15
to trinityrnaseq-users
Hi,

It sounds like there are a couple of problems, one having to do with a small number of failed jobs due to not being able to rename a file (generally a rare filesystem glitch and can happen on some systems, especially when being hit hard simultaneously by many jobs running on a compute farm).  The other issue is related to the --grid_conf parameter not working well on your system.

For troubleshooting --grid_conf, you might try just running the small sample data set through.  The grid_conf system is based on this:

If you can get that working, then it should be straightforward to get it running within Trinity, but it might require some fiddling.

For the other issue, if just a few jobs are failing during the ParaFly run, you can just rerun your original command and it should just try to reexecute those few failed jobs.  If they keep failing, you'd need to look at the error messages to figure out why.  If it's another file system glitch, you can remove the output directories for just those few failed jobs and then rerun your original trinity command. This will usually solve whatever the issue was.

Finally, if you want to try running Trinity on other systems:

best,

~brian

VSingan

unread,
Nov 22, 2015, 5:34:56 PM11/22/15
to trinityrnaseq-users
Hi Brian,

I have had a similar case where the grid failed on a few jobs. However with ParaFly I get the following error and even if I re-run the commands I end up getting the same errors and Trinity failing on me. Any idea how to work around this issue ?

###### WARNING: /Path/To/Data/Trinity_test/trinity_out_dir/read_partitions/Fb_0/CBin_577/c57786.trinity.reads.fa.out/inchworm.K25.L25.DS.fa.clipped.fa already exists, skipping the jaccard-clip step, using already existing output: /Path/To/Data/Trinity_test/trinity_out_dir/read_partitions/Fb_0/CBin_577/c57786.trinity.reads.fa.out/inchworm.K25.L25.DS.fa.clipped.fa

Error, no fasta file reported as: /Path/To/Data/Trinity_test/trinity_out_dir/read_partitions/Fb_0/CBin_689/c68967.trinity.reads.fa.out/chrysalis/Component_bins/Cbin0/c0.graph.allProbPaths.fasta

Trinity run failed. Must investigate error above.

succeeded(0), failed(1)   14.2857% completed.    

Error, no fasta file reported as: /Path/To/Data/Trinity_test/trinity_out_dir/read_partitions/Fb_0/CBin_268/c26824.trinity.reads.fa.out/chrysalis/Component_bins/Cbin0/c0.graph.allProbPaths.fasta

Trinity run failed. Must investigate error above.

Error, no fasta file reported as: /Path/To/Data/Trinity_test/trinity_out_dir/read_partitions/Fb_0/CBin_137/c13720.trinity.reads.fa.out/chrysalis/Component_bins/Cbin0/c0.graph.allProbPaths.fasta

Brian Haas

unread,
Nov 23, 2015, 7:37:23 AM11/23/15
to VSingan, trinityrnaseq-users
Hi

For these few, try removing their output directories
ie.
/Path/To/Data/Trinity_test/trinity_out_dir/read_partitions/Fb_0/CBin_689/c68967.trinity.reads.fa.out/

And then try rerunning the original trinity command.

Let's see how it goes.

Best

-Brian
(by iPhone)


> On Nov 22, 2015, at 5:34 PM, VSingan <sin...@gmail.com> wrote:
>
> /Path/To/Data/Trinity_test/trinity_out_dir/read_partitions/Fb_0/CBin_689/c68967.trinity.reads.fa.out/
Reply all
Reply to author
Forward
0 new messages