cstacks: adding samples to existing catalog in batch job processes on server

Maja Mucko

unread,

Sep 4, 2024, 2:54:55 AM9/4/24

to Stacks

Hi to all!

Sorry if I missed the answer (if you already talked about this)!

I have 438 individuals to process together and my server capacities could build catalog from 300 individuals on bigger node that is available for jobs lasting up to 720h of walltime, so I have 138 left to add to catalog.

Since server occupation is now pretty busy, I don't have that big node available anymore, so I need to manage with smaller nodes built for jobs lasting up to 168h of walltime.

According to my calculations (using maximum cpu number and max. memory allocation), I can split those 138 ind to several jobs and add them separately to the same catalog. The one thing I am not sure of is: will cstacks add all those individuals from separate jobs running simultaneously to the same catalog, or will it make a "copy" of catalog with adding individuals according to jobs?

Currently, I am running just the first job and it is going well, but I am afraid to start the others until I know for sure catalog will remain one and every individual will be added to that catalog.

How can I check this?

Thank you very much!

Maja

Catchen, Julian

unread,

Sep 5, 2024, 4:55:09 PM9/5/24

to stacks...@googlegroups.com

Hi Maja,

If you want to add to an existing catalog, just make sure the input path of the existing catalog (--catalog) is different from your output path for cstacks (--outpath). If so, cstacks will write the new catalog in the location you specified and leave the old catalog files untouched. You can’t run multiple copies of cstacks at the same time and expect it to somehow merge them all into a single catalog. The independent program runs don’t know anything about any other cstacks runs executing at the same time. You can run cstacks one time, with as many additional samples as you want, and that one run will write a new catalog containing the old and new samples together.

That said, you don’t need to put all your samples in the catalog. It is fine to put a representative number of samples in the catalog so that you are likely to see all the SNPs in your various populations one or more times. You can then just match the remaining samples to your catalog using sstacks. So, for example, you could load 10 individuals from each population into the catalog and then just match the remaining samples to that catalog using sstacks.

Best,

Julian

Maja Mucko

unread,

Nov 26, 2024, 3:44:28 AM11/26/24

to Stacks

Dear Julian,

thank you for your input and explanation.

I have catalog of 320individuals and I matched remaining 118 with sstacks. My script is attached.

I wonder now how do I proceed? If I understand correctly, I need to perform tsv2bam, gstacks and then populations, but I am confused what to parse as input folder for tsv2bam - my matches of those 118 individuals plus paired-end reads of catalog individuals?

Or do I use paired-end reads of all 438individuals?

Or did I suppose to match all my individuals in sstacks step, to get matches for all 438inds and proceed with that?

Sorry for this confusion, but so far I always did denovo_map and it processed all my individuals (far less than 438 though) and tsv2bam, gstacks and populations always took in all samples...

Thank you

Maja

sstacks_cat320ind.sh

Catchen, Julian

unread,

Dec 7, 2024, 11:30:24 AM12/7/24

to stacks...@googlegroups.com

Hi Maja,

To simplify running things by hand, you should use a population map. For example, you could have used this file to simplify running sstacks. Regardless, you will need it for tsv2bam, which will sort the data in each sample and incorporate the paired-end reads. You run it like this:

tsv2bam -P stacks_dir -M popmap -R paired_reads_dir

The popmap will list the name of each sample, e.g. Fb141_1. You then provide the path to the directory that contains the raw, paired-end reads, which should have been output by process_radtags and be named with the same prefix, e.g. Fb141_1.

You will then use the same popmap to run gstacks and finally populations.

Reply all

Reply to author

Forward