MAF and sampling bias

felixvaux...@gmail.com

unread,

Jun 25, 2024, 11:19:13 PM6/25/24

to Stacks

Hello all,

I'm reviewing a manuscript using Stacks, and I'd like to check about the interaction of sample sizes and the minimum minor allele frequency (MAF).

The study has a biased sample design. There are two regions each with >600 sampled individuals, and four regions with only ~50 individuals each (i.e. ~1400 total).

The authors applied a MAF of 0.05, which excluded ~90% of the initial loci identified by populations - leaving a fairly small dataset (<1000 SNPs).

Am I correct to interpret that the minimum minor allele frequency uses the metapopulation, where it ignores any populations designated in the population map? I.e. the it's frequency of a minor allele for a given locus, using all individuals in a dataset.

If so, is it plausible that many loci have been excluded because 50/1400 is <0.05. In this situation, variant loci are excluded even when the alternative allele is present in 50/50 individuals from the less sampled regions.

I'm planning to suggest that the authors conduct an even, subsampled analysis and explore using the minimum allele count (MAC) setting for the existing, biased dataset. I'd appreciate any other ideas though!

Thanks,
Felix

Catchen, Julian

unread,

Jun 26, 2024, 2:26:18 PM6/26/24

to stacks...@googlegroups.com

Hi Felix,

Yes, you are correct that MAF is applied in the metapopulation and that MAC is a good alternative (particularly for the removal of errant, low frequency alleles). The authors could also export populations individually (with different MAFs), using a whitelist to ensure the same SNPs are exported. Regardless, any filter that removes 90% of the data is likely a bad idea and is likely creating a very biased output set of data.

I would also suggest you review (or ask them to review) their depth of coverage for the different populations as well as the level of PCR duplicates (reported by gstacks) assuming they did paired-end sequencing (and did not use ddRAD). If their coverage is very low, and/or their PCR duplicates were very high, they will have a large bias (lots of missing heterozygotes at some loci and many missing loci in general) in the data regardless of chosen filters and at a minimum need to explain that in the results.

Cheers,

Julian

felixvaux...@gmail.com

unread,

Jun 26, 2024, 6:35:39 PM6/26/24

to Stacks

Hi Julian,

Thank you for the swift and detailed reply! Really helpful to get a expert opinion/sanity check.

Yes, that's also a good idea with the coverage depths among populations. The whitelist solution is also cool! They're using conventional ddRAD and so they can't use the gstacks output for PCR duplicates.

In the de novo assembly they also set -M 5 and -n 4 in ustacks and cstacks, compared to the default values of 2 and 1. I'm wondering if this could have also inflated the number of rare alleles and impacted the MAF filtering, because fairly divergent/mismatched stacks and loci have been combined?

Thanks again,
Felix

Message has been deleted

Angel Rivera-Colón

unread,

Jun 28, 2024, 9:22:05 AM6/28/24

to Stacks

Hi Felix,

To add to this, in addition to the data quality checks mentioned by Julian, I have also found that using the -p (--min-populations) flag can be particularly useful in cases like this, in which the number of samples is very uneven across populations. Sometimes, the other missing data filters could exclude whole sets of smaller populations, depending on the exact flag used (-r or -R) and the value provided. Setting some combination of --min-mac/maf and -r/R alongside -p should assist in ensuring that loci/SNPs are kept in those smaller populations.

Thanks,

Angel

felixvaux...@gmail.com

unread,

Jun 30, 2024, 8:49:28 PM6/30/24

to Stacks

Hi Angel,

Thanks for your reply, appreciate your insight!

Yes, I'm also going to recommend exploring setting the -p and -r parameters with a population map. Their current analysis sorts all individuals into one panmictic population, and the -r value is quite low - meaning that the smaller regions likely contain mostly missing data.

Thanks again,
Felix

Reply all

Reply to author

Forward