A few months ago, I posted an issue I had running trinity using Jaccard clip. However, I'm running into the same problem again, although I'm not using that parameter. I'm using SGE by the way...
--------------------------------------------------------------------------------
------------ Trinity Phase 2: Assembling Clusters of Reads ---------------------
--------------------------------------------------------------------------------
$VAR1 = {
'memory' => '16G',
'grid' => 'SGE',
'cmd' => 'qsub -V -cwd',
'max_nodes' => '300',
'cmds_per_node' => '20'
};
-note, 153707 commands already completed successfully. Skipping them here.
-there are 213 cmds left to run here.
CMDS: 213 / 213 [11/300 nodes in use]
* All cmds submitted to grid. Now waiting for them to finish.
CMDS: 213 / 213 [7/300 nodes in use] -no record of job_id 11876, setting as state unknown
-no record of job_id 11873, setting as state unknown
-no record of job_id 11871, setting as state unknown
-no record of job_id 11874, setting as state unknown
-no record of job_id 11869, setting as state unknown
-no record of job_id 11868, setting as state unknown
CMDS: 213 / 213 [0/300 nodes in use]
* All nodes completed. Now auditing job completion status values
Failures encountered:
num_success: 83 num_fail: 61 num_unknown: 69
Finished.
130 commands failed during grid computing.
-failed commands written to: recursive_trinity.cmds.htc_cache_success.__failures
As you can see, once all the commands are submitted, at some point the program decide to check the jobs and report some as unknown. In particular, those ones are still running!
Trying to run them using parafly...
Number of Commands: 130
-----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(1) 0.769231% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(2) 1.53846% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(3) 2.30769% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(4) 3.07692% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(5) 3.84615% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(6) 4.61538% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(0), failed(7) 5.38462% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(1), failed(8) 6.92308% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(1), failed(9) 7.69231% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(1), failed(10) 8.46154% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(1), failed(11) 9.23077% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
ERROR, cannot open file: single.fa
Trinity run failed. Must investigate error above.
succeeded(1), failed(12) 10% completed. -----------------------------------------
--- Chrysalis: GraphFromFasta -----------
-- (cluster related inchworm contigs) ---
-----------------------------------------
So there is a bunch of ERROR warnings like those above and it keeps going until reaching 100%.
-Which script is checking or how is checked if a job has already finished or not. Should it be wise to use jobs dependencies when submitting the jobs?
-My single.fa file is there at the output dir, but I'm not sure where does Chrysalis want it.
-Should I pack fewer cmds per job? currently I put 20 as you can see in my conf file params at the very beginning of my post. I don't want to hammer the queue system and I though that 20 will be a sensible number.
At some point Brian Haas adviced to just delete the output directories from the failed commands and rerun the Trinity script. I will try that but I would like to find out the real reason of these failures. My guess is that some jobs are still running and the pipeline doesn't wait for them but I haven't found who controls that, hpc_conf?