Hi Dylan and Julian,
I have the same issue. I have run STACKS before, with the entire pipeline running in a matter of hours. Now it takes days for ustacks to get through a single sample, and while my data are larger now (1 498 679 004 - 1 089 793 663 bytes per .fg file), I am just wondering if that is the only problem.
For context, I am trying to optimize the denovo pipeline for M1...10 on a subset of 14 samples. I am running each of the programs of the pipeline rather than the pipeline wrapper because it has been running so slowly. I working in STACKS v2.59 in a computing cluster, and have found that the following things help:
1) Obviously, increasing the cpus and mem that the script calls for.
2) Using -p # instead of -t # to run the script on multiple cpus instead of multiple threads (even though I don't see the -p parameter in the STACKS manual).
I have tried zipping my files to see if this improves processing speed, but it has not seemed to make much of a difference. Then, I have also split my samples to run in parallel in 4 "sub-subsets" of 3-4 samples, having read on these forums that ustacks can be run in separate steams before being combined in cstacks. However, I didn't realize that this only applies to STACKS v2.65 and beyond, as I am working on STACKS v2.59 I'm unable to run cstacks since I have duplicate IDs. I will need to rerun ustacks, but am pressed for time. I think I have read on these forums as well that the pipeline sometimes does take very long to run, but that this hasn't been reproducible. I have included an example script below, along with three of my sample files.
Moving forward, I will increase my cpus and mem even more, and looks like I could overcome the ID problem by specifying id=# in my script. For example, if I still want to run four sub-subsets in parallel (samples 1-3, 4-6, 7-10, 11-14), I could specify id=1, id=4, id=7, and id=11 in each script, respectively, to "pick up" where I left off with ID numbers, so to speak (see script below).
I have included some job info below, if that helps. I have tried to attached three of my .1.fq files as examples, but even a single zipped file exceeds the 24MB attachment limit--I would be happy to drop them to you a different way. Please let me know if there is anything else I can provide, and thank you in advance for your help. Julian, your promptness and thoroughness in replying to this forum is really amazing, and it is one of the reasons I have been able to apply STACKS as a young researcher. Thank you.
My scripts look something like this (example for a subset of 3 samples set to run at M=3, which has been running for ~93 hours and has almost completed all three samples):
Script Name: 05_denovo_param_opt_ustacks_M2_4-6
#!/bin/bash
#PBS -N CHAR_M2_4-6_param_opt_ustacks
#PBS -l select=1:ncpus=16:mem=32GB
#PBS -l walltime=168:00:00
#PBS -m e
#PBS -M 2540...@sun.ac.za
cd $PBS_O_WORKDIR
module load app/stacks/2.59
id=1
for i in $(cat ~/msc/00_info/subset4-6.txt)
do
ustacks -f ~/msc/00_samples/$i".1.fq" \
-o ~/msc/05_denovo_pipeline/param_opt_ustacks/denovo.M2 \
-m 3 -M 2 -p 16 -i $id
let "id+=1"
done
Subset 4-6.txt
P01_C03_PP14
P01_C09_CC22
P01_D03_CC29
Example sample file names in 00_samples
(In some scripts I tried using zipped versions of these files instead)
P01_C03_PP14.1.fq
P01_C03_PP14.2.fq
P01_C03_PP14.rem.1.fq
P01_C03_PP14.rem.2.fq
Job info (qstat -fx 228251)
Job Id: 228251.hpc2.hpc
Job_Name = CHAR_M2_4-6_param_opt_ustacks
Job_Owner = 2540...@hpc2.hpc
resources_used.cpupercent = 1598
resources_used.cput = 1179:57:28
resources_used.mem = 305710988kb
resources_used.ncpus = 16
resources_used.vmem = 327405272kb
resources_used.walltime = 93:43:13
job_state = R
queue = week
server = hpc2.hpc
Checkpoint = u
ctime = Mon Aug 26 11:32:24 2024
Error_Path = hpc2.hpc:/home/25405942/msc/00_scripts/CHAR_M2_4-6_param_opt_u
stacks.e228251
exec_host = n08.hpc/2*16
exec_vnode = (n08.hpc:ncpus=16:mem=33554432kb)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = e
Mail_Users = 2540...@sun.ac.za
mtime = Fri Aug 30 09:16:21 2024
Output_Path = hpc2.hpc:/home/25405942/msc/00_scripts/CHAR_M2_4-6_param_opt_
ustacks.o228251
Priority = 0
qtime = Mon Aug 26 11:32:24 2024
Rerunable = True
Resource_List.mem = 32gb
Resource_List.mpiprocs = 1
Resource_List.ncpus = 16
Resource_List.nodect = 1
Resource_List.place = free
Resource_List.select = 1:ncpus=16:mem=32GB
Resource_List.walltime = 168:00:00
stime = Mon Aug 26 11:32:54 2024
session_id = 3857815
jobdir = /home/25405942
substate = 42
Variable_List = PBS_O_HOME=/home/25405942,PBS_O_LANG=en_ZA.UTF-8,
PBS_O_LOGNAME=25405942,
PBS_O_PATH=/opt/pbs/default/bin:/usr/local/bin:/usr/bin:/usr/local/sbi
n:/usr/sbin:/opt/puppetlabs/bin,PBS_O_MAIL=/var/spool/mail/25405942,
PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/home/25405942/msc/00_scripts,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,PBS_O_HOST=hpc2.hpc
comment = Job run at Mon Aug 26 at 11:32 on (n08.hpc:ncpus=16:mem=33554432k
b)
etime = Mon Aug 26 11:32:24 2024
run_count = 1
Submit_arguments = 05_denovo_param_opt_ustacks_M2_4-6
project = _pbs_project_default
Submit_Host = hpc2.hpc
Job -e file (qpeek -e 228251)
Assembling stacks (max. dist. M=2)...
Assembled 307219 stacks into 280282; blacklisted 548 stacks.
Coverage after assembling stacks: mean=6.23; stdev=3.76; max=75; n_reads=1740412(51.0%)
Merging secondary stacks (max. dist. N=4 from consensus)...
Merged 560150 out of 1481146 secondary reads (37.8%), 38895 merged with gapped alignments.
Coverage after merging secondary stacks: mean=8.24; stdev=5.23; max=406; n_reads=2300562(67.5%)
Assembling stacks, allowing for gaps (min. match length 80.0%)...
Assembled 280282 stacks into 278000 stacks.
Coverage after gapped assembly: mean=8.30; stdev=5.34; max=406; n_reads=2300562(67.5%)
Merging secondary stacks, allowing for gaps (min. match length 80.0%)...
Merged 70509 out of 920996 secondary reads (7.7%).
Coverage after merging gapped secondary stacks: mean=8.56; stdev=6.21; max=659; n_reads=2371071(69.5%)
Final coverage: mean=8.56; stdev=6.21; max=659; n_reads=2371071(69.5%)
Calling consensus sequences and haplotypes for catalog assembly...
Writing tags, SNPs, and alleles files...
Refetching read IDs...done.
ustacks is done.
ustacks parameters selected:
Input file: '/home/25405942/msc/00_samples/P01_C09_CC22.1.fq'
Sample ID: 2
Min depth of coverage to create a stack (m): 3
Repeat removal algorithm: enabled
Max distance allowed between stacks (M): 2
Max distance allowed to align secondary reads: 4
Max number of stacks allowed per de novo locus: 3
Deleveraging algorithm: disabled
Gapped assembly: enabled
Minimum alignment length: 0.8
Model type: SNP
Alpha significance level for model: 0.05
Loading RAD-Tags...1M...2M...3M...
Loaded 3343356 reads; formed:
302318 stacks representing 1899454 primary reads (56.8%)
1263871 secondary stacks representing 1443902 secondary reads (43.2%)
Stack coverage: mean=6.28; stdev=19.79; max=5688; n_reads=1899454(56.8%)
Removing repetitive stacks: cov > 66 (mean+3*stdev)...
Blacklisted 6833 stacks.
Coverage after repeat removal: mean=5.74; stdev=3.66; max=66; n_reads=1695983(50.7%)
Assembling stacks (max. dist. M=2)...
Assembled 295485 stacks into 266080; blacklisted 814 stacks.
Coverage after assembling stacks: mean=6.08; stdev=3.89; max=128; n_reads=1610302(48.2%)
Merging secondary stacks (max. dist. N=4 from consensus)...
Merged 516599 out of 1443902 secondary reads (35.8%), 35691 merged with gapped alignments.
Coverage after merging secondary stacks: mean=8.03; stdev=5.75; max=443; n_reads=2126901(63.6%)
Assembling stacks, allowing for gaps (min. match length 80.0%)...
Assembled 266080 stacks into 263966 stacks.
Coverage after gapped assembly: mean=8.09; stdev=5.88; max=512; n_reads=2126901(63.6%)
Merging secondary stacks, allowing for gaps (min. match length 80.0%)...
Merged 69139 out of 927303 secondary reads (7.5%).
Coverage after merging gapped secondary stacks: mean=8.36; stdev=6.99; max=591; n_reads=2196040(65.7%)
Final coverage: mean=8.36; stdev=6.99; max=591; n_reads=2196040(65.7%)
Calling consensus sequences and haplotypes for catalog assembly...
Writing tags, SNPs, and alleles files...
Refetching read IDs...done.
ustacks is done.
ustacks parameters selected:
Input file: '/home/25405942/msc/00_samples/P01_D03_CC29.1.fq'
Sample ID: 3
Min depth of coverage to create a stack (m): 3
Repeat removal algorithm: enabled
Max distance allowed between stacks (M): 2
Max distance allowed to align secondary reads: 4
Max number of stacks allowed per de novo locus: 3
Deleveraging algorithm: disabled
Gapped assembly: enabled
Minimum alignment length: 0.8
Model type: SNP
Alpha significance level for model: 0.05
Loading RAD-Tags...1M...2M...3M...
Loaded 3429628 reads; formed:
311143 stacks representing 1929693 primary reads (56.3%)
1314655 secondary stacks representing 1499935 secondary reads (43.7%)
Stack coverage: mean=6.20; stdev=13.98; max=2615; n_reads=1929693(56.3%)
Removing repetitive stacks: cov > 49 (mean+3*stdev)...
Blacklisted 6164 stacks.
Coverage after repeat removal: mean=5.79; stdev=3.50; max=49; n_reads=1764637(51.5%)
Assembling stacks (max. dist. M=2)...
Assembled 304979 stacks into 276378; blacklisted 696 stacks.
Coverage after assembling stacks: mean=6.20; stdev=3.83; max=88; n_reads=1705201(49.7%)
Merging secondary stacks (max. dist. N=4 from consensus)...
Merged 559386 out of 1499935 secondary reads (37.3%), 37375 merged with gapped alignments.
Coverage after merging secondary stacks: mean=8.23; stdev=5.51; max=381; n_reads=2264587(66.0%)
Assembling stacks, allowing for gaps (min. match length 80.0%)...