Sorry for the delay in responding. I've been out of town.
It's using flashpca-1.1.2.1
Below I've gone through the example data distributed with the package. The original flashpca and plink commands are pasted below. To compare P=UD vs P=XV vs pcs.txt flashpca output, I used R, with the commands pasted into the email below.
Thanks for your help.
Best,
Luke
Here's the comparison using the example data in the distribution.
These are the exact commands I've used, including the R commands, which I pasted below:
cd flashpca-1.1.2.1/
mkdir test
cd test
mkdir svd
#Run flashpca:
../flashpca --bfile ../HapMap3/data --seed 1001 --ndim 10 --method svd --outload loadings --v --numthreads 10 > pca.outerr &
mv * svd
#Get data into 0/1/2 genotype format and get the MAF of each SNP:
plink-1.9 --bfile ../HapMap3/data --recode A --out data
plink-1.9 --bfile ../HapMap3/data --freq --out data.frq
#################### THE REST IS AN R script to compare XV, UD, and PC scores from flashpca #########################
##### Here, I've just pasted the commands from the Rscript instead of attaching the file
#!/usr/bin/Rscript
frq<-read.table("data.frq.frq",header=T) ### MAF for all snps
evec<-read.table(paste0(method,"/eigenvectors.txt"),header=F) ### U matrix of SVD where X=U*D*t(V)
evec<-data.matrix(evec)
eval<-read.table(paste0(method,"/eigenvalues.txt"),header=F) ### These are the D values (sqrt of the eigenvalues from eigen decomposition eqn 2, or the singular values D from SVD eqn 1)
eval<-as.vector(eval[,1])
pcs<-read.table(paste0(method,"/pcs.txt"),header=F) ### PCA scores from flashpca used to plot. should be P=UD or P=XV from eigendecomp or SVD
pcs<-data.matrix(pcs)
loads<-read.table(paste0(method,"/loadings"),header=F) ### Flashpca output of the snp loadings
loads<-data.matrix(loads)
D<-diag(eval)
############ P=UD ################
LD<-evec%*%D #### =U%*%D in eqn 3, THIS IS THE SAME AS THE PCS.TXT output, i.e., the PC scores to plot
print("factor of pcs/LD:")
print(apply(pcs/LD,2,mean))
############ P=XV ################
## reading in the raw 0/1/2 genotype data:
colClasses<-c("character","character","integer","integer","integer",rep("numeric",nrow(loads)))
raw<-read.table("data.raw",header=T,colClasses=colClasses)
GT<-t(data.matrix(raw[,7:ncol(raw)]))
GTR<-GT ### BECAUSE THERE"S MISSING DATA, replacing missing GTs with the mean:
GTM<-apply(GT,1,mean,na.rm=T)
for(rr in 1:length(GTM)){
GTR[rr,
is.na(GTR[rr,])]<-GTM[rr]
}
GTsd<-(GTR-2*frq$MAF)/((frq$MAF*(1-frq$MAF))) ## Matches eqn 4, though typically divide by sqrt(2pq). Using 2pq changes the ~5x difference to ~3x
sc<-t(GTsd)%*%loads ### THIS IS P=XV (eqn 3)
write.table(sc,paste0(method,"/ScoresProjections.txt"),quote=F,col.names=F,row.names=F)
jpeg(paste0(method,"/PCscore_vs_XV.jpg"),height=10,width=5,units='in',res=400)
par(mfrow=c(5,2))
for(pp in 1:10){
plot(sc[,pp],pcs[,pp],main=paste0("PC axis ",pp),pch=20,cex=.4)
}
dev.off()
### OFF BY A CONSISTANT VALUE FOR EACH PC AXIS:
print("factor of sc/pcs:")
print(apply(sc/pcs,2,mean))
print("factor of sc/LD:")
print(apply(sc/LD,2,mean))