How does pca in plink handle missing data

Ollie White

unread,

Jun 20, 2020, 8:11:20 AM6/20/20

to plink2-users

Hello,

Does anyone know how plink handles missing data for a pca analysis? I think it replaces missing data with an average value but can't find this referenced anywhere.

Best wishes

Ollie

Christopher Chang

unread,

Jun 20, 2020, 8:21:27 AM6/20/20

to plink2-users

From the --pca documentation: "The randomized algorithm always mean-imputes missing genotype calls. For comparison purposes, you can use the 'meanimpute' modifier to request this behavior for the standard computation."

The standard computation is based on a GRM where the (sample A, sample B) entry is based on just the variants where neither sample A nor sample B have a missing genotype. In theory, this matrix is not guaranteed to remain positive semidefinite, but in practice that isn't a problem unless your dataset is borked for other reasons.

Ollie White

unread,

Jun 20, 2020, 8:47:11 AM6/20/20

to plink2-users

Hi Christopher,

Many thanks for the link and reply, that is just the detail I needed

Best wishes

Ollie

Reply all

Reply to author

Forward