PCA results in ggplot2 ? (biplot)

3,659 views
Skip to first unread message

mateus...@gmail.com

unread,
Dec 6, 2010, 6:01:49 PM12/6/10
to ggp...@googlegroups.com
Hi

For a simple PCA from a random data

x = c(runif(50,min=0,max=1),runif(50,min=0.5,max=3));

dim(x) = c(10,10)
x = apply(x,2,function(x) x-mean(x)/sd(x))
sx = svd(x);

factor = c(rep(1,5),rep(0,5));
scores = sx$u %*% t(diag(sx$d))
loadings = sx$v

Is there a tool to make properly scalled biplot in ggplot2 (with factor, scores and loadings) ?
I know I can plot scores as geom_point and loadings as geom_bar already.

Thanks in advance,
Mateusz

Brandon Hurr

unread,
Dec 7, 2010, 4:01:12 AM12/7/10
to mateus...@gmail.com, ggp...@googlegroups.com
Matuesz,

You can see my ramblings on the matter here. I still don't like how the rescale that I performed distorted the graph, but the associations that were there in the biplot were also there in the ggplot2(biplot). I haven't yet had the time to try what the statistician said should work without distortion, but I might have some time this week. I have also switched from the builtin princomp() to pca() in pcaMethods. 

Using your data here's what I would do currently:

require(pcaMethods)
require(ggplot2)

x = c(runif(50,min=0,max=1),runif(50,min=0.5,max=3))
dim(x) = c(10,10)

PCA<- pca(x, method="nipals", scale="uv", nPcs=3, center=T, completeObs=T, cv="q2")
Q2<-Q2(PCA)
Q2
PCA@R2
#Need to find the point at which Q2 no longer increases
#bio data R2~= 0.3 and Q2 <= R2 +/- 0.2 is fine. 
PCA@scores
PCA@loadings

#loadings are matrix, need data.frame for ggplot
loadings<-as.data.frame(PCA@loadings)
#carry over names of rows
loadings$Names<-rownames(loadings)

#scores=matrix too
scores<-data.frame(PCA@scores)

#pure number scaling not very good but works... for now
#rescaling Loadings to scores
loadings[[1]]<-rescale(loadings[[1]], to = c(min(scores[[1]]),max(scores[[1]])))
loadings[[2]]<-rescale(loadings[[2]], to = c(min(scores[[2]]),max(scores[[2]])))
loadings[[3]]<-rescale(loadings[[3]], to = c(min(scores[[3]]),max(scores[[3]])))

ggplot()+
geom_point(data=scores, aes(x=PC1, y=PC2))+
geom_segment(data=loadings, aes(x=0, y=0, xend=PC1, yend=PC2)
, arrow=arrow(length=unit(0.2,"cm")), alpha=0.25)+
geom_text(data=loadings, aes(x=PC1, y=PC2, label=Names), alpha=0.5, size=3)+
scale_colour_discrete("Variety")+
scale_x_continuous("Principal Component 1")+
scale_y_continuous("Principal Component 2")+
theme_bw()
Brandon


--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: http://gist.github.com/270442
 
To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

pca.biplot.example.png

mateus...@gmail.com

unread,
Dec 7, 2010, 12:22:30 PM12/7/10
to Brandon Hurr, ggp...@googlegroups.com
Thanks,

That's useful.
PS.You should include explained variance at each axis as additional information.


Brandon Hurr

unread,
Dec 7, 2010, 12:55:39 PM12/7/10
to mateus...@gmail.com, ggp...@googlegroups.com
I'd love to, but how would you suggest doing so?
-B

Brandon Hurr

unread,
Dec 7, 2010, 3:54:49 PM12/7/10
to mateus...@gmail.com, ggplot2
Mateusz, 

Good idea. I refined it a little and it looks nice. 

scale_x_continuous(sprintf("PC1 (%s%%)", round(PCA@R2[1],digits=2)*100))+
scale_y_continuous(sprintf("PC2 (%s%%)", round(PCA@R2[2],digits=2)*100))+
theme_bw()
Thanks for sharing. 

-B


On Tue, Dec 7, 2010 at 20:15, mateus...@gmail.com <mateus...@gmail.com> wrote:
I think by

scale_x_continuous(sprintf("PC1 (%s)", round(PCA@R2cum[1],digits=2)))+
scale_y_continuous(sprintf("PC2 (%s)", round(PCA@R2cum[2],digits=2)))


2010/12/7 Brandon Hurr <bhi...@gmail.com>
1.png

mateus...@gmail.com

unread,
Dec 7, 2010, 4:01:32 PM12/7/10
to Brandon Hurr, ggplot2
Thanks for Your biplot method :)

2010/12/7 Brandon Hurr <bhi...@gmail.com>

mateus...@gmail.com

unread,
Dec 8, 2010, 4:57:50 AM12/8/10
to Brandon Hurr, ggplot2
Dear Brandon,

I was trying to make this biplot, but I run into problems:

1) I cannot use at the same time colour=factor(data$Treatment) and shape=factor(data$Subject) in geom_point(...)
Only one either colour or shape works.

2) When I put everything in function body, somehow it complains about missing labels in geom_text
Error: geom_text requires the following missing aesthetics: label
When I run code line by line it works...

3) Is the scaling loadings the correct way to do it ?

Here is the code:

x = read.csv('~/Desktop/dataset.csv');
factor = x[,1:5];

# Center and scale data
data = apply(x[,-(1:5)],2,function(x) (x - mean(x))/sd(x));

 
pca_biplot <- function(factor, data) {
    require(ggplot2);
    svd = svd(data);
    scores = as.data.frame(svd$u %*% t(diag(svd$d)));
    variances = svd$d;
    loadings = as.data.frame(svd$v);
    rownames(loadings) = colnames(data);


    loadings[[1]]<-rescale(loadings[[1]], to = c(min(scores[[1]]),max(scores[[1]])))
    loadings[[2]]<-rescale(loadings[[2]], to = c(min(scores[[2]]),max(scores[[2]])))

    ggplot() +
    geom_point(data=scores, aes(x=V1, y=V2, colour=factor(factor$Subject))) +
    scale_shape_manual(value=1:11) +
    geom_segment(data=loadings, aes(x=0, y=0, xend=V1, yend=V2) , arrow=arrow(length=unit(0.2,"cm")), alpha=0.25) +
    geom_text(data=loadings, aes(x=V1, y=V2,label=rownames(loadings)), alpha=0.5, size=2.2) +
    scale_colour_discrete("Variety") +
    scale_x_continuous(sprintf("PC1 (%s%%)", round(variances[1]^2 / sum(variances^2) * 100,digits=2)))+
    scale_y_continuous(sprintf("PC2 (%s%%)", round(variances[2]^2 / sum(variances^2) * 100,digits=2)))+
    theme_bw();
}

pca_biplot(factor,data);

Do You know how to fix it ?

Hadley Wickham

unread,
Dec 8, 2010, 8:48:35 AM12/8/10
to Brandon Hurr, mateus...@gmail.com, ggplot2
> #pure number scaling not very good but works... for now
> #rescaling Loadings to scores
> loadings[[1]]<-rescale(loadings[[1]], to =
> c(min(scores[[1]]),max(scores[[1]])))
> loadings[[2]]<-rescale(loadings[[2]], to =
> c(min(scores[[2]]),max(scores[[2]])))
> loadings[[3]]<-rescale(loadings[[3]], to =
> c(min(scores[[3]]),max(scores[[3]])))
> ggplot()+
> geom_point(data=scores, aes(x=PC1, y=PC2))+
> geom_segment(data=loadings, aes(x=0, y=0, xend=PC1, yend=PC2)
> , arrow=arrow(length=unit(0.2,"cm")), alpha=0.25)+
> geom_text(data=loadings, aes(x=PC1, y=PC2, label=Names), alpha=0.5, size=3)+
> scale_colour_discrete("Variety")+
> scale_x_continuous(sprintf("PC1 (%s%%)", round(PCA@R2[1],digits=2)*100))+
> scale_y_continuous(sprintf("PC2 (%s%%)", round(PCA@R2[2],digits=2)*100))+
> theme_bw()

You should also use coord_equal to ensure that distances are preserved
accurately.

Hadley

--
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

Brandon Hurr

unread,
Dec 8, 2010, 11:13:24 AM12/8/10
to mateus...@gmail.com, ggplot2
Matuesz, 

See comments inline...

On Wed, Dec 8, 2010 at 09:57, mateus...@gmail.com <mateus...@gmail.com> wrote:
Dear Brandon,

I was trying to make this biplot, but I run into problems:

1) I cannot use at the same time colour=factor(data$Treatment) and shape=factor(data$Subject) in geom_point(...)
Only one either colour or shape works.
 
When you are using ggplot you supply either the entire plot or each geom_* with a "data=". Hadley has recommend against using $ within a plot call to reference a column in your dataframe. Sadly, I do not have your dataset so I cannot say with certainty what is happening. 

Overall, it seems to me that you are trying to use color and shape to differentiate between certain samples of your data (rows in your original dataset). Normally this works since shape and color can be used together (I believe).  The way I've done this with my volatile data is to use cbind() to attach the descriptive information to the PCA scores (e.g. Variety, Treatments, cell lines, etc...). This way when you do your call it's simply...

geom_point(data=scoresExtended, aes(x=PC1, y = PC2, colour=Treatment, shape=Treatment)+ ...
 

2) When I put everything in function body, somehow it complains about missing labels in geom_text
Error: geom_text requires the following missing aesthetics: label
When I run code line by line it works...
 
I can't get it to work line by line without your data, but I'm guessing that what ggplot really wants is for you to make a new column in your loadings data.frame with your labels that you can reference as a column. It very easily could be something else entirely though since I'm just guessing. 
 
3) Is the scaling loadings the correct way to do it ?

I'm pretty sure it isn't. It doesn't look bad and so I stuck with it, but it is by no means the best way to do it. The reason for the scaling is because ggplot does not do more than one scale per axis. Run a princomp() on your data and plot the data using biplot() and you'll see what I mean about two scales. 

If you look at the original post about doing biplots the scaling was my main concern. If you have a suggestion, my ears are open. The statistician at my job suggested the following:

I tried 3 methods and in the end invented my own:

 

(1)    Multiply the loadings by the standard deviation of the scores (this is how SIMCA-P does it)

(2)    Multiply all the loadings by the square root of the number of variables

(3)    Divide the scores by the variance of each score column, or the eigenvalue (this is Richard Brereton’s method)

 

None of these give you what I think you want which is a superimposed plot.

 

My method, which is aesthetically pleasing at least, is to divide both the scores and the loadings by their row or column standard deviation.

 
He sent me a graph and it very closely resembled the biplot() output. Much more so than my version using simple number scaling. 

HTH,

Brandon

Brandon Hurr

unread,
Dec 8, 2010, 11:17:06 AM12/8/10
to Hadley Wickham, mateus...@gmail.com, ggplot2
Thanks Hadley. Do I understand correctly that coord_equal() simply locks the aspect ratio so that you can't skew your data when adjusting the window size?

B

Hadley Wickham

unread,
Dec 13, 2010, 1:44:44 PM12/13/10
to Brandon Hurr, mateus...@gmail.com, ggplot2
Yes, that's right.
Hadley

Gabrakadabra

unread,
Dec 16, 2010, 7:42:45 AM12/16/10
to ggplot2
I thing your scaling of the loadings will make them point a wrong
direction, compared to the output from biplot(). Compare your output
with the output from

x.pc <- princomp(data,scores=T,cor=T)
biplot(x.pc,choices=c(1,2)

Shouldn't you scale the loadings with the same value?

Regards Gabriel

On 8 Dec, 10:57, "mateusz.ka...@gmail.com" <mateusz.ka...@gmail.com>
wrote:
Reply all
Reply to author
Forward
0 new messages