Hello,
I am having a problem running trinity. First, some background. I am running trinity on a large illumina rnaseq dataset on a node on our cluster (80 core with 1 TB ram). The dataset was originally 2.5 billion paired-end reads but after normalization it is ~130M paired-end reads. I did the normalization in two steps, first by splitting the dataset in 5 pieces, normalizing each of these pieces (using the --prep option), then combining the normalized reads from this round and running the normalization again.
Since the dataset was still quite large post-normalization I tested the --grid_config option on a small subset of the original reads and requesting a small number of cores on our cluster. The run worked fine so I then scaled up the number of cores I was requesting (to 500 cores with 4M memory for each request) and ran it with the large dataset. The first problem occurred during the phase where trinity was farming out jobs to the cluster. According to the std output it had 500 nodes in use most of the time. However, when I checked what jobs I had running it did not show those jobs. When I looked at the history of job submission it looked like some jobs had been submitted, those had finished, and new jobs had not been submitted, and trinity was not detecting this. This led to the following error at the end of this phase according to Trinity:
CMDS: 574080 / 574271 [287/500 nodes in use]
CMDS: 574120 / 574271 [288/500 nodes in use]
CMDS: 574160 / 574271 [289/500 nodes in use]
CMDS: 574200 / 574271 [290/500 nodes in use]
CMDS: 574240 / 574271 [291/500 nodes in use]
CMDS: 574271 / 574271 [292/500 nodes in use]
* All cmds submitted to grid. Now waiting for them to finish.
CMDS: 574271 / 574271 [114/500 nodes in use]
CMDS: 574271 / 574271 [13/500 nodes in use]
CMDS: 574271 / 574271 [2/500 nodes in use]
CMDS: 574271 / 574271 [1/500 nodes in use]
CMDS: 574271 / 574271 [0/500 nodes in use]
* All nodes completed. Now auditing job completion status values
574271 commands failed during grid computing.
-failed commands written to: recursive_trinity.cmds.htc_cache_success.__failures
Trying to run them using parafly...
Failures encountered:
num_success: 0 num_fail: 574271 num_unknown: 0
Finished.
Number of Commands: 574271
succeeded(1) 0.000174134% completed.
succeeded(2) 0.000348268% completed.
succeeded(3) 0.000522401% completed.
succeeded(4) 0.000696535% completed.
It then tried to run "parafly", but encountered some errors here as well. Below is a snapshot of one of the errors:
succeeded(152688), failed(2) 26.5814% completed.
succeeded(152689), failed(2) 26.5816% completed.
succeeded(152690), failed(2) 26.5818% completed.
succeeded(152691), failed(2) 26.5819% completed.
succeeded(152693), failed(2) 26.5823% completed.
succeeded(152693), failed(2) 26.5823% completed. Error, cannot rename Trinity.fasta.tmp to /scratch/musser/plat_rnaseq_data/trinity_full/2015_nov_11/trinity_2015_nov_11_full1_out_dir/read_partitions/Fb_1/CBin_1526/c152689.trinity.reads.fa.out.Trinity.fasta at /home/musser/software/trinityrnaseq-2.0.6/util/support_scripts/../../Trinity line 1114.
succeeded(152693), failed(3) 26.5825% completed.
succeeded(152694), failed(3) 26.5826% completed.
succeeded(152695), failed(3) 26.5828% completed.
succeeded(152696), failed(3) 26.583% completed.
succeeded(152697), failed(3) 26.5832% completed.
succeeded(152698), failed(3) 26.5833% completed.
succeeded(152699), failed(3) 26.5835% completed.
Here is a snapshot of the end of the std output of the run:
succeeded(574265), failed(3) 99.9995% completed.
succeeded(574266), failed(3) 99.9997% completed.
succeeded(574267), failed(3) 99.9998% completed.
succeeded(574268), failed(3) 100% completed.
We are sorry, commands in file: [recursive_trinity.cmds.htc_cache_success.__failures.FAILED_DURING_PARAFLY] failed. :-(
Trinity run failed. Must investigate error above.
As I mentioned, I first got this error during the run where I was using the --grid_config options. I tried rerunning trinity directing it to the same output directory but without using the --grid_config option. The run picked up at the start of phase 2 but still had several failed instances, which led to trinity exiting at the end of phase 2. The errors were of the same type ("Error, cannot rename Trinity.fasta.tmp....") but there were 6 failed instances instead of 3 in total:
succeeded(574261), failed(6) 99.9993% completed.
succeeded(574262), failed(6) 99.9995% completed.
succeeded(574263), failed(6) 99.9997% completed.
succeeded(574264), failed(6) 99.9998% completed.
succeeded(574265), failed(6) 100% completed.
We are sorry, commands in file: [FailedCommands] failed. :-(
Trinity run failed. Must investigate error above.
I then tried a run starting from scratch (i.e. different output directory) and without using the --grid_config option. Again it had problems during phase 2 and quit. This time there were 9 failures instead of 6 (all "Error, cannot rename Trinity.fasta.tmp...), and again Trinity exited at the end of the phase:
succeeded(574413), failed(9) 99.9997% completed.
succeeded(574414), failed(9) 99.9998% completed.
succeeded(574415), failed(9) 100% completed.
We are sorry, commands in file: [FailedCommands] failed. :-(
Trinity run failed. Must investigate error above.
Can you please offer some advice about what I can do to solve this problem?