resume option not recognizing already complete matches.tsv.gz files

Jamie Maxwell

unread,

Oct 7, 2023, 10:58:10 AM10/7/23

to Stacks

Hi

I am running denovo_map.pl in v2.65 on 100 samples. There is a time limit of 6 days on the cluster I use that cannot be changed. I used the --resume option when it was halfway through sstacks, however it did not recognize that half the samples were already complete and started from sample one again. Does this mean I have to run the remaining 50 samples individually as it will take longer than 6 days to complete sstacks? And if so once they are complete will I be able to use --resume to run the rest of the pipeline, or will I have to run tsv2bam and gstacks independently as well?

Many thanks

Jamie

Catchen, Julian

unread,

Oct 9, 2023, 1:46:11 PM10/9/23

to stacks...@googlegroups.com

Hi Jamie,

You have a couple options:

You could run denovo_map.pl with the “--dry-run” option, which will print all the individual commands for the pipeline. Then you could batch these up in different ways to run them as multiple jobs on the server that can complete individually in less than 6 days. Cstacks is typically the longest-running program in the pipeline. If you made it to sstacks, this can be run as a single execution with all sample files, or it can be run once per sample, in which case each run will complete quickly.
The --resume option should work, if you want to pursue that you need to give more information on what you did, the command you ran, the output, what files are in your Stacks output directory (resume works by scanning the output directory to see which programs produced output files from the pipeline run).

Regarless, both tsv2bam and gstacks should complete relatively quickly, gstacks can be multithreaded for a faster run.

Best,

julian

Jamie Maxwell

unread,

Oct 17, 2023, 6:28:45 AM10/17/23

to stacks...@googlegroups.com

Thanks Julian

I went with single execution however it only got through 50 of the 100 samples during the time. So now I will try to break down the remaining samples and run them individually.

when I used the --resume function I used the following command

/ichec/work/nglif054c/stacksupdate/stacks-2.65/bin/denovo_map.pl -M 4 -m 3 -n 4 -o /ichec/work/nglif054c/stacksupdate/stacks-2.65/bin/all_samples_denovo/denovo_noid --samples /ichec/work/nglif054c/stacksinstall/bin/mobiseq_raw_data_redo --popmap /ichec/work/nglif054c/stacksupdate/stacks-2.65/bin/specimen_list_for_denovo_all.txt --catalog-popmap /ichec/work/nglif054c/stacksupdate/stacks-2.65/bin/pop_for_cat.txt -T 24 --paired --min-samples-per-pop 0.80 --resume

I have run everything succesfully up until sstacks and the resume function has worked up until now. The output directory has all the alleles, tags and snps .tsv.gz files, as well as catalog snps, tags, sample list and alleles files (I have not moved or changed anything since the initial run of denovo_map.pl). The first time sstacks ran it got through 50 samples before i was timed out, and all the matches.tsv.gz files are present in the output directory for these samples. On resumimg however these to not seem to have been recognized and sstacks started matching fromm the first sample again. The samples in the popmap file are divided into 3 populations.

Cheers

Jamie

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/stacks-users/SN6PR11MB2557D940788ED58B9D77D1D6A7CEA%40SN6PR11MB2557.namprd11.prod.outlook.com.

Reply all

Reply to author

Forward