Hi Julian and Kalpi,
I have also noticed similar behavior of "stalled" cstacks when run on my University's cluster. When I check the denovo_map.log file while it is running, it contains only the initial execution line of cstacks. It will usually run for several more days without printing anything to the log file. Sometimes these files finish but, usually these jobs run out of time and are killed, so my solution has been to just run them for more time, often 7+ days.
It was actually this issue that brought me to the forum today where I stumbled upon this post...
Let me add a few more details about the study:
The cluster is currently running v.1.05
The Data:
37 individuals 2 populations, (haploid moss gametophytes)
Each individual contains ~700,000 single-end reads after stringent filtering.
denovo_map.pl -m 2 -M 1 -N 2 -n 3 -t -T 5 -b 7 -B alaska_radtags .....the rest is probably unimportant
In the most recent running of this data set I asked for 5 processors, and a total of 50Gb of RAM, with a run time of 6 days.
I routinely check the denovo_map.log file and ustacks will often run for around 16 hours before initiating cstacks. This is where the "stall" occurs, the following is usually the last thing I see in the denovo_map.log file, before the job is killed, often 5 days later.
/apps/stacks/1.05/bin//cstacks -b 7 -o ./output -s ./output/alaska_cyano_pluro_20 <<long list of sampels here>> -n 3 -p 5 2>&1
So, I am wondering if others have noticed this as well, and/or is this normal behavior?
I have had several runs with fewer individuals which have completed in less time and in the log file there is cstacks output for each sample written to the denovo_map.log file. But what I am not sure of is if this is done as cstacks is working or is this printed to the log after cstacks has finished? I am assuming it is designed to write to the log file as it adds each individual into the catalog (as this is the behavior of ustacks when writing to the log file). But I wanted to ask around. Additionally, in the runs that do not finish there is never a batch_X.catalog.X file ever created in the output.
I ask in part to help inform how much time I allocated for each run, I have some rather large sample sets we plan on running in the near future and short of trial and error I have no way of knowing how far cstacks has progressed in X amount of time. In most runs cstacks seems to be the portion of the pipe line that takes the longest to complete.
Julian, in your recent stickleback paper your data set was infinitely larger, how did you handle that catalog creation. Can you thread your way out of this issue? ie. is cstacks a portion of the pipe line that can utilize multi-threading?
Thanks a lot in advance for any input!
-Adam