PCA variant weights vs sample weights

Gabriel Doctor

unread,

Feb 19, 2024, 11:50:05 AM2/19/24

to plink2-users

Hi Christopher,

As I understand it, the plink.eigenvec table shows the the weighting that each sample contributes to principal component 1...n, calculated from a variance-covariance matrix wich takes SAMPLES as variables and genotypes as observations. The values reported in the plink.eigenvec file are not equivalent to the projection of the samples onto the PC axis.

My question then is about the allele-wts or var-wts modifier. Surely to calculate allele weighting, would require VARIANTS to be treated as the variables and samples as observation in the creation of the covariance matrix. Or is there some other way of calculating these from the original sample variance/covariance matrix? If you can point me to a discussino of this I'd be grateful.

Best wishes
Gabriel

Christopher Chang

unread,

Feb 19, 2024, 4:05:34 PM2/19/24

to plink2-users

--pca's var-wts modifier follows the SNP-weights computation described in Chen CY et al. (2013) "Improved ancestry inference using weights from external reference panels" and previously implemented in EIGENSOFT. allele-wts generalizes the computation to handle multiallelic variants properly; var-wts will generate obviously-distorted results if you try to use it on "split" multiallelic variants.

Christopher Chang

unread,

Feb 19, 2024, 4:06:57 PM2/19/24

to plink2-users

(clarification: for --pca on multiallelic variants, you must use the allele-wts modifier, *and* your multiallelic variants can't be split.)

Gabriel Doctor

unread,

Mar 13, 2024, 10:05:47 AM3/13/24

to plink2-users

Hi Christopher,

I'm sorry to bang on about this again, but I believe that the documentation for plink --pca is wrong in its explanation of what is being shown in the plink.eigenvec table.

It says here (https://www.cog-genomics.org/plink/2.0/formats#eigenvec) that in the plink.eigenvec table: "The first columns contain the sample ID, and the rest are principal component *weights*". (Documentation also refers a “sample weight” in the discussion of the allele-wts modifier.)

This is incorrect. The values for each PC reported in the standard plink.eigenvec table represent normalised PC *scores* for that variant (ie. the linear combination of SNPs as variables) and not PC weights.

I performed a test to confirm this – if interested I can share with you the R markdown file. Using dummy data, performing --pca in plink, then performing pca on the same data in R, taking care with row and column orientation. In R I performed PCA twice, on snps-as-variables and samples-as-variables. Plotting the samples scores of snps-as-variables (which leads to snp weights and samples scores) clearly looks more similar to the plot of plink.eigenvec values than the plotting the samples weights of samples-as-variables.

(This is a relief as this is how this data is commonly interpreted.)

May I suggest a simple update of the documentation to state clearly that plink.eigenvec shows the score for each sample, and to correct the error in this discussion. I have found that I'm not the only person whose been confused here.

Best wishes and thanks for your ongoing support to this excellent software.

Christopher Chang

unread,

Mar 13, 2024, 12:31:55 PM3/13/24

to plink2-users

Different authors use different terminology here, but okay, I see that your usage is common, and have updated the documentation.

Gabriel Doctor

unread,

Mar 13, 2024, 12:53:53 PM3/13/24

to plink2-users

Brilliant thanks

I think it clears things up especially as most introductory guides to PCA still teach the eigen-decomposition method rather than SVD (which is how eigensoft / gcta performs it). Starting with the sample covariance table (GRM), the eigen-decomposition method would only give you sample weights that contribute to snp PC scores and not sample scores.

Reply all

Reply to author

Forward