STRUCTURE analyses with large set of markers - ways of subsetting?

102 views
Skip to first unread message

Peter Rešutík

unread,
Jun 15, 2021, 12:30:09 PM6/15/21
to structure-software
Dear STRUCTURE community,

I’m conducting an analysis of concordance between membership coefficients of a cohort of ca. 4000 (human) samples using on one hand a large set of bi-allelic markers (~600k) and on the other hand a much smaller set of ca. 150 markers that were specifically chosen to be ancestry informative. Because some of the “ancestry informative markers” are tri-allelic I don’t see other option than to use STRUCTURE since, to my knowledge, both ADMIXTURE and FastSTRUCTURE are not able to take other than bi-allelic markers into account.

I’m subsetting the original set of 600k bi-allelic loci by applying quality filters (geno, mind, maf, and pruning using PLINK) down to ca. 55k loci. However, with 50k burnin + 50k MCMC cycles based on my estimation the analyses for one run with 55k loci and 4000 individuals would take approximately 40 days, which is not feasible given the time constraints of my project. I have seen other threads in this forum that suggest reduction of the loci and claim that, if the loci are not under selection and are not in linkage equilibrium, the result should not differ much (are there also quantitative studies that support this statement?). My question is also how to best subset such set of ca. 55k loci and how far down should I subset it? I was thinking about using smartPCA that reports most informative SNPs for each principal component (so called eigbestsnp) and to reduce down to ca. 10k loci. Have you ever encountered such approach and would this be something you would recommend?

On a side note, I was wondering if you are aware of concordance studies between STRUCTURE and ADMIXTURE. In theory I could run the reduced set of 55k bi-allelic loci with ADMIXTURE and then compare the results with STURCTURE (using the smaller set of 150 markers). However, I would then be comparing results of two different softwares that in theory both optimize for HWE but use a different optimization algorithm. What would you think about such approach?

I’m happy to provide more information if anything is unclear.

Thank you and best regards,
Peter

Reply all
Reply to author
Forward
0 new messages