snp filter problem use plink

558 views
Skip to first unread message

Chao Fang

unread,
Jun 20, 2023, 1:00:04 PM6/20/23
to plink2...@googlegroups.com
Hi Every one,
            I have two questions regarding my investigation of population genetic structure. Two questions all concern the filtering process of the raw data, first is regarding the minor allele frequency (MAF).  The second question pertains to the Hardy-Weinberg equilibrium (HWE) filtering.
The first question pertains to the issue of filtering based on minor allele frequency (MAF). It is common for many studies to use a threshold of 0.05 or 0.01. However, when dealing with a dataset of 200 individuals, using a threshold of 0.05 would mean that SNPs present in only 10 individuals would be filtered out, and using a threshold of 0.01 would result in the removal of SNPs present in only one individual. In other words, if a SNP is only present in one or a few breeds, its overall MAF could be lower than 5%.




The second question pertains to the issue of filtering based on Hardy-Weinberg equilibrium (HWE). When using HWE filtering, the commonly recommended threshold is --hwe 0.001. However, if you have a dataset of 200 individuals that come from 10 different breeds, with 20 individuals per breed, it is important to consider the principles of Hardy-Weinberg equilibrium. HWE is valid for single populations and not for the entire set of 200 individuals simultaneously, as stratification can be a factor causing disequilibrium. This suggests that it may be more appropriate to test HWE within each breed separately to ensure accurate evaluation of equilibrium within individual populations. But I have seen in many papers that all individuals are filtered together, so what should I specifically do?

For my dataset consisting of 200 individuals from 10 different breeds, what thresholds are more appropriate for MAF and HWE?
Thanks
CHAO 

Christopher Chang

unread,
Jun 21, 2023, 2:11:07 PM6/21/23
to plink2-users
1. This depends on what kind of analysis you are performing.  But one related rule of thumb is that MAF estimates start becoming unstable below 1 / sqrt(# of allele observations).  If you have 200 diploid individuals, that's 400 allele observations, so MAF estimates are reasonably stable down to 1 / sqrt(400) = 0.05, and you can at least do a reasonable job of distinguishing MAF >0.05 from MAF <0.05 variants.  In contrast, you want at least several thousand samples if you care about doing a good job of distinguishing MAF 0.005 from MAF 0.01 from MAF 0.02.

2. Yes, stratification can result in fewer heterozygous genotypes; so plink 2.0 --hwe's 'keep-fewhet' modifier can be used to perform HWE filtering in only the other direction.  With only 200 samples, a threshold of 0.001 is reasonable.

Chao Fang

unread,
Jun 21, 2023, 2:22:01 PM6/21/23
to Christopher Chang, plink2-users
Thanks,I will use your recommend method run this two function

Chao

Best regard

Christopher Chang <chrch...@gmail.com>于2023年6月21日 周三20:11写道:
--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/d028e5f7-cac1-47e5-b318-7d97e8eaf40en%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages