Slow running ustacks and cstacks

141 views
Skip to first unread message

Dylan Wichman

unread,
Aug 14, 2024, 5:28:49 PM8/14/24
to Stacks
Hey everyone,

I'm running into some severe delays where my samples are taking and extremely long time to filter in STACKS and I'm not sure why. For reference, I used a 3RAD protocol generating ~542,000,000 reads across 86 individuals. Samples were demultiplexed and RADtags were processed by the sequencing facility via ipyrad into paired end FASTQ files. I then renamed the samples from ipyrad designations (e.g. A1_R1_01, A1_R1_02) to file names recognized by STACKS (e.g A1.1, A1.2). I then attempted to run denovo_map on a simplified script to generate quick structure analysis and start optimizing parameters according to r80 methodology. Here is the simple script:

stacks denovo_map.pl -T 32 -M 1 -o ./R80Output_Structure/Output1 --paired --popmap ./Structure_popmap.txt --samples ./raw

Here are the specs of the workstation and it has succesfully run GBS and RADseq data in the past.
CPU: 2x 32-core AMD EPYC 7542 Memory: 515801.0 MiB Storage: 15.46 TiB (7.6% used)

This machine has taken over a week to finish running the ustacks portion and an additional 2+ weeks on the cstacks portion. Is there something I have not optimized that might explain this extremely long delay? Any recommendations other than using an HPC that might help shorten this time frame especially as it seems to lengthen with each step?

Charlotte Combrink

unread,
Aug 30, 2024, 3:50:51 PM8/30/24
to Stacks
Hi Dylan and Julian, 

I have the same issue. I have run STACKS before, with the entire pipeline running in a matter of hours. Now it takes days for ustacks to get through a single sample, and while my data are larger now (1 498 679 004 - 1 089 793 663 bytes per .fg file), I am just wondering if that is the only problem. 

For context, I am trying to optimize the denovo pipeline for M1...10 on a subset of 14 samples. I am running each of the programs of the pipeline rather than the pipeline wrapper because it has been running so slowly. I working in STACKS v2.59 in a computing cluster, and have found that the following things help:

1) Obviously, increasing the cpus and mem that the script calls for. 
2) Using -p # instead of -t # to run the script on multiple cpus instead of multiple threads (even though I don't see the -p parameter in the STACKS manual). 

I have tried zipping my files to see if this improves processing speed, but it has not seemed to make much of a difference. Then, I have also split my samples to run in parallel in 4 "sub-subsets" of 3-4 samples, having read on these forums that ustacks can be run in separate steams before being combined in cstacks. However, I didn't realize that this only applies to STACKS v2.65 and beyond, as I am working on STACKS v2.59 I'm unable to run cstacks since I have duplicate IDs. I will need to rerun ustacks, but am pressed for time. I think I have read on these forums as well that the pipeline sometimes does take very long to run, but that this hasn't been reproducible. I have included an example script below, along with three of my sample files. 

Moving forward, I will increase my cpus and mem even more, and looks like I could overcome the ID problem by specifying id=# in my script. For example, if I still want to run four sub-subsets in parallel (samples 1-3, 4-6, 7-10, 11-14), I could specify id=1, id=4, id=7, and id=11 in each script, respectively, to  "pick up" where I left off with ID numbers, so to speak (see script below). 

I have included some job info below, if that helps. I have tried to attached three of my .1.fq files as examples, but even a single zipped file exceeds the 24MB attachment limit--I would be happy to drop them to you a different way. Please let me know if there is anything else I can provide, and thank you in advance for your help. Julian, your promptness and thoroughness in replying to this forum is really amazing, and it is one of the reasons I have been able to apply STACKS as a young researcher. Thank you. 

My scripts look something like this (example for a subset of 3 samples set to run at M=3, which has been running for ~93 hours and has almost completed all three samples): 
Script Name: 05_denovo_param_opt_ustacks_M2_4-6
#!/bin/bash
#PBS -N CHAR_M2_4-6_param_opt_ustacks
#PBS -l select=1:ncpus=16:mem=32GB
#PBS -l walltime=168:00:00
#PBS -m e
#PBS -M 2540...@sun.ac.za

cd $PBS_O_WORKDIR
module load app/stacks/2.59

id=1
for i in $(cat ~/msc/00_info/subset4-6.txt)
do
ustacks -f ~/msc/00_samples/$i".1.fq" \
-o ~/msc/05_denovo_pipeline/param_opt_ustacks/denovo.M2 \
-m 3 -M 2 -p 16 -i $id
 let "id+=1"
done


Subset 4-6.txt
P01_C03_PP14
P01_C09_CC22
P01_D03_CC29


Example sample file names in 00_samples
(In some scripts I tried using zipped versions of these files instead) 
P01_C03_PP14.1.fq
P01_C03_PP14.2.fq
P01_C03_PP14.rem.1.fq
P01_C03_PP14.rem.2.fq

Job info (qstat -fx 228251) 
Job Id: 228251.hpc2.hpc
    Job_Name = CHAR_M2_4-6_param_opt_ustacks
    Job_Owner = 2540...@hpc2.hpc
    resources_used.cpupercent = 1598
    resources_used.cput = 1179:57:28
    resources_used.mem = 305710988kb
    resources_used.ncpus = 16
    resources_used.vmem = 327405272kb
    resources_used.walltime = 93:43:13
    job_state = R
    queue = week
    server = hpc2.hpc
    Checkpoint = u
    ctime = Mon Aug 26 11:32:24 2024
    Error_Path = hpc2.hpc:/home/25405942/msc/00_scripts/CHAR_M2_4-6_param_opt_u
stacks.e228251
    exec_host = n08.hpc/2*16
    exec_vnode = (n08.hpc:ncpus=16:mem=33554432kb)
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = e
    Mail_Users = 2540...@sun.ac.za
    mtime = Fri Aug 30 09:16:21 2024
    Output_Path = hpc2.hpc:/home/25405942/msc/00_scripts/CHAR_M2_4-6_param_opt_
ustacks.o228251
    Priority = 0
    qtime = Mon Aug 26 11:32:24 2024
    Rerunable = True
    Resource_List.mem = 32gb
    Resource_List.mpiprocs = 1
    Resource_List.ncpus = 16
    Resource_List.nodect = 1
    Resource_List.place = free
    Resource_List.select = 1:ncpus=16:mem=32GB
    Resource_List.walltime = 168:00:00
    stime = Mon Aug 26 11:32:54 2024
    session_id = 3857815
    jobdir = /home/25405942
    substate = 42
    Variable_List = PBS_O_HOME=/home/25405942,PBS_O_LANG=en_ZA.UTF-8,
PBS_O_LOGNAME=25405942,
PBS_O_PATH=/opt/pbs/default/bin:/usr/local/bin:/usr/bin:/usr/local/sbi
n:/usr/sbin:/opt/puppetlabs/bin,PBS_O_MAIL=/var/spool/mail/25405942,
PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/home/25405942/msc/00_scripts,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,PBS_O_HOST=hpc2.hpc
    comment = Job run at Mon Aug 26 at 11:32 on (n08.hpc:ncpus=16:mem=33554432k
b)
    etime = Mon Aug 26 11:32:24 2024
    run_count = 1
    Submit_arguments = 05_denovo_param_opt_ustacks_M2_4-6
    project = _pbs_project_default
    Submit_Host = hpc2.hpc


Job -e file (qpeek -e 228251) 
Assembling stacks (max. dist. M=2)...
  Assembled 307219 stacks into 280282; blacklisted 548 stacks.
Coverage after assembling stacks: mean=6.23; stdev=3.76; max=75; n_reads=1740412(51.0%)

Merging secondary stacks (max. dist. N=4 from consensus)...
  Merged 560150 out of 1481146 secondary reads (37.8%), 38895 merged with gapped alignments.
Coverage after merging secondary stacks: mean=8.24; stdev=5.23; max=406; n_reads=2300562(67.5%)

Assembling stacks, allowing for gaps (min. match length 80.0%)...
  Assembled 280282 stacks into 278000 stacks.
Coverage after gapped assembly: mean=8.30; stdev=5.34; max=406; n_reads=2300562(67.5%)

Merging secondary stacks, allowing for gaps (min. match length 80.0%)...
  Merged 70509 out of 920996 secondary reads (7.7%).
Coverage after merging gapped secondary stacks: mean=8.56; stdev=6.21; max=659; n_reads=2371071(69.5%)

Final coverage: mean=8.56; stdev=6.21; max=659; n_reads=2371071(69.5%)
Calling consensus sequences and haplotypes for catalog assembly...
Writing tags, SNPs, and alleles files...
Refetching read IDs...done.
ustacks is done.
ustacks parameters selected:
  Input file: '/home/25405942/msc/00_samples/P01_C09_CC22.1.fq'
  Sample ID: 2
  Min depth of coverage to create a stack (m): 3
  Repeat removal algorithm: enabled
  Max distance allowed between stacks (M): 2
  Max distance allowed to align secondary reads: 4
  Max number of stacks allowed per de novo locus: 3
  Deleveraging algorithm: disabled
  Gapped assembly: enabled
  Minimum alignment length: 0.8
  Model type: SNP
  Alpha significance level for model: 0.05

Loading RAD-Tags...1M...2M...3M...

Loaded 3343356 reads; formed:
  302318 stacks representing 1899454 primary reads (56.8%)
  1263871 secondary stacks representing 1443902 secondary reads (43.2%)

Stack coverage: mean=6.28; stdev=19.79; max=5688; n_reads=1899454(56.8%)
Removing repetitive stacks: cov > 66 (mean+3*stdev)...
  Blacklisted 6833 stacks.
Coverage after repeat removal: mean=5.74; stdev=3.66; max=66; n_reads=1695983(50.7%)

Assembling stacks (max. dist. M=2)...
  Assembled 295485 stacks into 266080; blacklisted 814 stacks.
Coverage after assembling stacks: mean=6.08; stdev=3.89; max=128; n_reads=1610302(48.2%)

Merging secondary stacks (max. dist. N=4 from consensus)...
  Merged 516599 out of 1443902 secondary reads (35.8%), 35691 merged with gapped alignments.
Coverage after merging secondary stacks: mean=8.03; stdev=5.75; max=443; n_reads=2126901(63.6%)

Assembling stacks, allowing for gaps (min. match length 80.0%)...
  Assembled 266080 stacks into 263966 stacks.
Coverage after gapped assembly: mean=8.09; stdev=5.88; max=512; n_reads=2126901(63.6%)

Merging secondary stacks, allowing for gaps (min. match length 80.0%)...
  Merged 69139 out of 927303 secondary reads (7.5%).
Coverage after merging gapped secondary stacks: mean=8.36; stdev=6.99; max=591; n_reads=2196040(65.7%)

Final coverage: mean=8.36; stdev=6.99; max=591; n_reads=2196040(65.7%)
Calling consensus sequences and haplotypes for catalog assembly...
Writing tags, SNPs, and alleles files...
Refetching read IDs...done.
ustacks is done.
ustacks parameters selected:
  Input file: '/home/25405942/msc/00_samples/P01_D03_CC29.1.fq'
  Sample ID: 3
  Min depth of coverage to create a stack (m): 3
  Repeat removal algorithm: enabled
  Max distance allowed between stacks (M): 2
  Max distance allowed to align secondary reads: 4
  Max number of stacks allowed per de novo locus: 3
  Deleveraging algorithm: disabled
  Gapped assembly: enabled
  Minimum alignment length: 0.8
  Model type: SNP
  Alpha significance level for model: 0.05

Loading RAD-Tags...1M...2M...3M...

Loaded 3429628 reads; formed:
  311143 stacks representing 1929693 primary reads (56.3%)
  1314655 secondary stacks representing 1499935 secondary reads (43.7%)

Stack coverage: mean=6.20; stdev=13.98; max=2615; n_reads=1929693(56.3%)
Removing repetitive stacks: cov > 49 (mean+3*stdev)...
  Blacklisted 6164 stacks.
Coverage after repeat removal: mean=5.79; stdev=3.50; max=49; n_reads=1764637(51.5%)

Assembling stacks (max. dist. M=2)...
  Assembled 304979 stacks into 276378; blacklisted 696 stacks.
Coverage after assembling stacks: mean=6.20; stdev=3.83; max=88; n_reads=1705201(49.7%)

Merging secondary stacks (max. dist. N=4 from consensus)...
  Merged 559386 out of 1499935 secondary reads (37.3%), 37375 merged with gapped alignments.
Coverage after merging secondary stacks: mean=8.23; stdev=5.51; max=381; n_reads=2264587(66.0%)

Assembling stacks, allowing for gaps (min. match length 80.0%)...

Reply all
Reply to author
Forward
0 new messages