PCA in plink

203 views
Skip to first unread message

lxi...@gmail.com

unread,
Jun 13, 2023, 11:14:22 PM6/13/23
to plink2-users
Q1:
For a bfile with n=3222 individuals and m=677543 variants, the .eigenvec file from --pca contains 3222 rows and 20 columns. I want to know whether the values in .eigenvec represent PC scores of the n*m matrix (i.e., obtain m*m covariance matrix first, do eigendecomposition, and project data onto eigenvectors) or loadings (coefficients) of the eigenvector of a n*n matrix (i.e., obtain n*n covariance matrix first, do eigendecomposition)? 

Q2:
--pca -n can be used to specify the number of PCs returned. If I want to obtain PCs that explain 95% of variance in the data, I may need all eigenvalues to do the calculation. Is there a way to obtain all eigenvalus? Probably through -n 3222 in my case? Will this slow down the calculation in plink?

Thanks you.

Xiaolv

Zuxi Cui

unread,
Jun 14, 2023, 11:53:14 AM6/14/23
to lxi...@gmail.com, plink2-users
What’s the purpose of your PCA?
If your goal is to capture as much as the variants regardless of size of matrix you get, why not use identical data?
If your goal is to capture a considerable percentage of variants using a small matrix, 20 eigenvalues are usually more than enough.

My guess is you are going to use them for ancestry separation or association adjustments. For ancestries, usually first 3 PCs present what you want. For gwas, you can justify the number of PCs by indexing inflation factor.

Terry 



On Jun 13, 2023, at 23:14, lxi...@gmail.com <lxi...@gmail.com> wrote:

Q1:
--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/f004fb68-d5d7-400d-8054-423c7597ebf9n%40googlegroups.com.

lxi...@gmail.com

unread,
Jun 16, 2024, 9:47:12 AM6/16/24
to plink2-users
Since there are 677543 variants in my genotype data, each eigenvector should be of dimension 677543. Let's say if I requested 20 eigenvectors, then the matrix of eigenvectors should be of dimension 677543 * 20. I am not sure in the case of more columns than rows (i.e., more variants than individuals), if the dimension of eigenvector become truncated by the number of rows, which is 3222 for my data. Correct me if I am wrong.

While principal component scores, which is the projected value of each individual onto each eigenvector, should be of dimension 3222*20.

Given that the  .eigenvec file is of dimension 3222*20 (not including the first two columns of individual identifiers), I want to make sure if the numbers in eigenvec file mean eigenvector truncaed in dimension by sample size OR principal component scores along the eigenvectors?

In addition, I want to ask when correcting for PCs in GWAS, should I include  eigenvector truncaed in dimension by sample size OR principal component scores along the eigenvectors as covariates?

Xiaolv
Reply all
Reply to author
Forward
0 new messages