denovo_map very slow

327 views
Skip to first unread message

Samantha Ockhuis

unread,
Sep 27, 2021, 4:22:43 AM9/27/21
to Stacks
Hi all,

I have been trying to run the denovo_map pipeline with a total of 178 individuals across six populations. Sample reads ranges between 129392 and 46 million reads and 147 bp length. I have paired-end data generated from quaddRAD sequencing. I am using a cluster computing system with 24 cores and 128 GiB memory. 

The problem is, denovo_map is very slow and process a total of six samples per day and I have a total of 178 samples to process. Due to this denovo_map never completes the run within specified walltime and does not even go through half of the samples. I have played around with using different cores/threads and the best seems to be 24 cores, however it never completes the denovo_map run. 

I would greatly appreciate for any help, as I am stuck at the moment and don't how to further proceed.

Thanks in advance!
Samantha

Catchen, Julian

unread,
Sep 27, 2021, 3:00:07 PM9/27/21
to stacks...@googlegroups.com

Hi Samantha,

 

If you look at denovo_map.log file you will see all the individual ustacks commands. Instead of running denovo_map.pl, you can queue up the individual ustacks runs independently on your cluster. Once that is complete, you can run cstacks, sstacks, tsv2bam, gstacks, and populations. If you add the “--dry-run” flag to denovo_map.pl it will print out all the commands it would run (without running them). You may consider writing your sysadmins and asking them to override the maximum walltime. You may also consider down-sampling those samples with >40million reads as that will significantly reduce your run time as well.

 

Best,

 

julian

paige...@ucsb.edu

unread,
Nov 28, 2021, 7:49:05 PM11/28/21
to Stacks
Hello, I have a follow up question about how to actually merge all the data back together after running all the commands through smaller subsets of the dataset?  Is it OK to put all the output from the prior runs into one folder and then run populations on everything?  Or, is there a way to combine all the separate population outputs together afterwards?

Thank you for any suggestions on how to deal with this appropriately because I can't run all my data at one time.

Best,
Paige

Catchen, Julian

unread,
Nov 29, 2021, 3:41:31 PM11/29/21
to stacks...@googlegroups.com

Hi Paige,

 

1) You are able to run ustacks independently on your samples. If you do so, these outputs should all be in a single, stacks output directory.

2) You can then run cstacks on all of these samples, or a subset (if you have a large number of samples you may choose to build your catalog from a representative subset, trying to get at least one copy of each locus/allele into the catalog), but it must be a single cstacks run.

3) You can run sstacks in a single run on all samples, or you can run it independently on each sample (either way you supply the catalog you generated in the previous step).

 

All of these outputs go into the same stacks output directory.

 

4) Finally, you have to run tsv2bam and gstacks, one time each on the whole data set.

 

(Populations can be run as many times as you like, once the core pipeline steps are complete, and you can vary the samples that are processed using different population maps.)

 

Hope this helps—

 

julian

paige...@ucsb.edu

unread,
Nov 30, 2021, 7:51:08 PM11/30/21
to Stacks
Hi Julian,
This information is super helpful.  I am trying running ustacks by itself now and it is working much better that way.  I think I will have to try the use of a subset when I get to cstacks.  Thank you for that suggestion. 

I want to test with multiple parameters following your "Lost in parameter space" 2017 paper.  I did try doing my first stacks analysis using the .bam files sent to me by the "radseq" company I used (it is a flavor of radseq, not exactly the same but similar) but it did not work well!  It turned out that they had not actually removed a lot of the adapter sequences when they demultiplexed and there were a lot of different lengths in the sequences (even really short sequences) that weren't removed.  So, now I had to get them to send me the .fastq.gz files to start over.  Using these files requires a lot more compute power than I originally thought I would need.

Thanks for the help.
Best,

paige...@ucsb.edu

unread,
Dec 4, 2021, 1:19:35 PM12/4/21
to Stacks
Hello Julian,
I followed the instructions to run ustacks on the samples separately and then put all the outputs into one directory to run cstacks.  Unfortunately, I now get this error when running cstacks (it ran some samples and then this happened):

"Sample ID '1' occurs more than once. Sample IDs must be unique."

After searching this google group I see that I now have run into the problem that cstacks needs unique id's and since I ran my samples separately in ustacks I now have more than one with the same id.  Has there been any resolution on how to reset the id's after already completing the samples in ustacks?  Or, will I need to go back and rerun everything and make sure I set the id=# to something that will mean everything gets a unique id number? If I need to do that, perhaps it would be possible to add a comment indicating this in the stacks manual so people will know that in advance?

Thank you.

Best,

Catchen, Julian

unread,
Dec 6, 2021, 8:04:58 PM12/6/21
to stacks...@googlegroups.com

Yes, you need to have unique IDs for your samples. If you ran denovo_map.pl with the --dry-run option, it would show you each ustacks command demonstrating the unique IDs.

 

You can either re-run ustacks with different IDs for each sample, or you can use some UNIX commands to search and replace the ID field in each of the output files. The former is simpler (if you don’t know UNIX well), but the latter is faster (particularly if you know UNIX).

paige...@ucsb.edu

unread,
Dec 11, 2021, 10:22:38 PM12/11/21
to Stacks
Hi Julian,
Thanks for letting me know about the --dry-run option.  I just reran everything this time.  I did have to restart a few jobs so the ID issue came up again, but I just checked the last ID assigned and made sure to start at a higher number which worked fine.
Best,
Reply all
Reply to author
Forward
0 new messages