projection difference between XV and UD

116 views
Skip to first unread message

LME

unread,
May 26, 2016, 12:45:06 PM5/26/16
to flashpca-users
Hello,
I'd like to use the flashpca per-snp loadings from a reference sample to project a new sample onto the same PCs.

As a test, I compared the projections flashpca output as the pcs to what I computed via eqn (3) of Abraham and Inyoue 2014 (P=UD=XV).

When I compare the pcs output from flashpca to P=UD, I get a perfect match.

However, when I compute P=XV (standardized genotypes x per-snp loadings from flashpca), I get values that are different by a constant, ~5.3 in my data. The values are nearly perfectly correlated, but just different by this value, which is about the same for the first 10 PC axes. If I standardize the genotypes with sqrt(2pq), rather than sqrt(pq) as in eqn. (4) of Abraham and Inyoue 2014, this constant changes to about 3, but there remains this difference between the P=XV and P=UD=flashpca output.

Has anyone else found this to be the case? Are the per-snp loadings scaled in some way?
Thanks for any info.

Gad Abraham

unread,
May 27, 2016, 1:26:27 AM5/27/16
to LME, flashpca-users
Hi Luke,

Is this in R? A small example would be helpful here.

Gad
> --
> You received this message because you are subscribed to the Google Groups
> "flashpca-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to flashpca-user...@googlegroups.com.
> To post to this group, send email to flashpc...@googlegroups.com.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/flashpca-users/72a1830f-02b2-469e-b0ea-e468b3d9abb9%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

LME

unread,
Jun 2, 2016, 2:25:04 PM6/2/16
to flashpca-users

Hi Gad,

Sorry for the delay in responding. I've been out of town.
It's using flashpca-1.1.2.1
Below I've gone through the example data distributed with the package. The original flashpca and plink commands are pasted below. To compare P=UD vs P=XV vs pcs.txt flashpca output, I used R, with the commands pasted into the email below.

Thanks for your help.
Best,
Luke


Here's the comparison using the example data in the distribution. 
These are the exact commands I've used, including the R commands, which I pasted below:


cd flashpca-1.1.2.1/
mkdir test
cd test
mkdir svd

#Run flashpca:
../flashpca --bfile ../HapMap3/data --seed 1001 --ndim 10 --method svd --outload loadings ----numthreads 10 > pca.outerr &

mv * svd


#Get data into 0/1/2 genotype format and get the MAF of each SNP:
plink-1.9 --bfile ../HapMap3/data --recode A --out data
plink-1.9 --bfile ../HapMap3/data --freq --out data.frq



#################### THE REST IS AN R script to compare XV, UD, and PC scores from flashpca ######################### 
##### Here, I've just pasted the commands from the Rscript instead of attaching the file


#!/usr/bin/Rscript


frq<-read.table("data.frq.frq",header=T) ### MAF for all snps

evec<-read.table(paste0(method,"/eigenvectors.txt"),header=F) ### U matrix of SVD where X=U*D*t(V)
evec<-data.matrix(evec) 

eval<-read.table(paste0(method,"/eigenvalues.txt"),header=F) ### These are the D values (sqrt of the eigenvalues from eigen decomposition eqn 2, or the singular values D from SVD eqn 1)
eval<-as.vector(eval[,1])

pcs<-read.table(paste0(method,"/pcs.txt"),header=F) ### PCA scores from flashpca used to plot. should be P=UD or P=XV from eigendecomp or SVD
pcs<-data.matrix(pcs)

loads<-read.table(paste0(method,"/loadings"),header=F) ### Flashpca output of the snp loadings
loads<-data.matrix(loads)

D<-diag(eval)




############  P=UD  ################
LD<-evec%*%D #### =U%*%D in eqn 3, THIS IS THE SAME AS THE PCS.TXT output, i.e., the PC scores to plot
print("factor of pcs/LD:")
print(apply(pcs/LD,2,mean))





############  P=XV  ################
## reading in the raw 0/1/2 genotype data:
colClasses<-c("character","character","integer","integer","integer",rep("numeric",nrow(loads)))
raw<-read.table("data.raw",header=T,colClasses=colClasses)
GT<-t(data.matrix(raw[,7:ncol(raw)]))


GTR<-GT ### BECAUSE THERE"S MISSING DATA, replacing missing GTs with the mean:
GTM<-apply(GT,1,mean,na.rm=T)
for(rr in 1:length(GTM)){
    GTR[rr,is.na(GTR[rr,])]<-GTM[rr]
}

GTsd<-(GTR-2*frq$MAF)/((frq$MAF*(1-frq$MAF))) ## Matches eqn 4, though typically divide by sqrt(2pq).  Using 2pq changes the ~5x difference to ~3x

sc<-t(GTsd)%*%loads ### THIS IS P=XV (eqn 3)

write.table(sc,paste0(method,"/ScoresProjections.txt"),quote=F,col.names=F,row.names=F)

jpeg(paste0(method,"/PCscore_vs_XV.jpg"),height=10,width=5,units='in',res=400)
par(mfrow=c(5,2))
for(pp in 1:10){
    plot(sc[,pp],pcs[,pp],main=paste0("PC axis ",pp),pch=20,cex=.4)
}
dev.off()



### OFF BY A CONSISTANT VALUE FOR EACH PC AXIS:
print("factor of sc/pcs:")
print(apply(sc/pcs,2,mean))

print("factor of sc/LD:")
print(apply(sc/LD,2,mean))

Gad Abraham

unread,
Jun 5, 2016, 8:10:54 PM6/5/16
to LME, flashpca-users
Luke,

This is indeed a scaling issue, I will get back with details a bit later.

Gad
> https://groups.google.com/d/msgid/flashpca-users/96cbe59c-71a1-4fac-ac75-c00699d129b2%40googlegroups.com.

Gad Abraham

unread,
Jun 7, 2016, 8:13:42 PM6/7/16
to LME, flashpca-users
There are three issues here:

1) By default, flashpca computes the eigen-decomposition of X^T X / (n
- 1), where n is sample size. So to get the same projection as
flashpca, you'd need to divide XV by sqrt(n-1). I will add a command
line option to the next release to turn this on or off (decompose X^T
X instead).

2) There is a bug in the projection code, where the eigenvalues are
being used instead of the singular values. So the scale of each of the
PCs is off. In your code, "eval" is from eigenvalues.txt, which are
the eigenvalues, not the singular values, so you need to take sqrt
first.

I've opened a GitHub issue for that, and will be fixed in the next
release, which I hope to get out soon.

3) The default standardisation is (x_{ij} - 2*p_j) / sqrt(p * (1 -
p)), from "Population structure and eigenanalysis" Patterson et al
2006. I will add the standardisation (x_{ij} - 2*p_j) / sqrt(2 * p *
(1 - p)) to the next release.

ps: version 1.1.2 is quite old, better to upgrade as there are bug fixes etc.
Reply all
Reply to author
Forward
0 new messages