Guidance Needed for SNP Mining

Praveen Dhyani

unread,

Jan 26, 2025, 6:14:12 PM1/26/25

to Stacks

Dear group members,

I am new to bioinformatics and would greatly appreciate your guidance on SNP mining and analyzing my dataset using STACKS.

Research Objective: To study population connectivity in butterflies.
Dataset: I have ddRAD PE data from 640 individuals, with 100 bp reads.

Current Progress and Results:

USTACKS: Average depth of coverage = 142x (range: 5.75x to 428x)
CSTACKS: Final catalog contains 103,954 loci.
GSTACKS:
- Genotyped 101,864 loci.
- Effective per-sample coverage: mean = 202.5x, stdev = 79.5x, min = 3.4x, max = 560.5x.
- Consistent phasing for 73.4% of diploid loci (382,808 out of 521,811).
Populations Program:
- Without filters (% populations -P "$src" -M "$popmap_second" -t 8 ): Retained 101,864 loci (20,115,155 sites; 274,261 variant sites).
- With r = 0.75 filter only:
  - Removed 101,475 loci, retaining only 389 loci (68,941 sites; 381 variant sites).

I’m concerned about the drastic reduction in loci when applying the r = 0.75 filter and would like to understand if I might be doing something wrong.

Is such a large reduction in loci expected with the r = 0.75 filter?
How can I optimize my pipeline or parameters to retain more loci without compromising data quality?
Are there common issues with data or parameter settings that might cause such low retention?

Additional context: The effective coverage of my data seems good overall, but I suspect uneven coverage or missing data might be an issue. How this thing can be fixed/improved

Any advice on troubleshooting or optimizing my pipeline would be greatly appreciated!

Thank you in advance for your time and expertise.

Praveen

Julian Catchen

unread,

Jan 26, 2025, 6:25:36 PM1/26/25

to Stacks

Hi Praveen,

Did you optimize your assembly parameters prior to running ustacks on all your samples? How is your population map set up – how many populations are you defining with how many samples (-r is dependent on the population map you use)?

You will likely want to remove the very low coverage samples from your analysis, they are certainly affecting the filtering, since they will be missing almost all the data.

Best,

Julian

Praveen Dhyani

unread,

Feb 9, 2025, 11:03:01 AM2/9/25

to Stacks

Dear Julian

Thanks a lot for replying to me!

1. I optimized the assembly parameters with M=2 to M=9. Although I could not make the type of graph you prepared for your paper, I did it on an Excel sheet (graph attached). After this, I selected the M=8 for further assembly.
2. For full assembly, the pop map is set up in two columns, one for each sample and population (popmap attached). I have 640 samples defined in 8 populations.

3. For r 0.75, I am considering all samples in one population (not defining the pop map). However, I have also defined the population(s) in one run, but the results were the same bad (population log attached for this).

Meanwhile, I have also run the pipeline afresh by increasing the samples in the catalogue (160 from 96) and with M=8 and removing samples with coverage of less than 20. This time, the number of all the sites increased (as expected) from 101,864 to 207,054. However, the final results did not improve with this run, even with the r-0.5 filter instead of r 0.75 (gsatcks log and population log attached).

I am now more confused if all my samples have very, very high missingness or if I have done a bad optimization!

Looking forward to hearing from you!

Best Regards

Praveen

Picture1.png

gstacks (new_run).log

populations (1).log

popmap_allsamples_aftercatalog.txt

populations (new_run).log

Reply all

Reply to author

Forward