PCA for high huge data

Dao Thanh Tuan

unread,

Nov 13, 2009, 2:56:50 AM11/13/09

to Face Recognition Research Community

Dear all,
I'm building an algorithm using PCA, in which I need to use PCA as
data dimensionality reduction. However, I have to deal with huge data,
meaning the dimension of my data can range to 1 - 2 million. As far as
I know, PCA/LDA is almost impossible to deal with that high
dimension.
Would you know any solution for my problem? Could I improve PCA to
work with that high dimensional data or just use another data
reduction method? And if the second choice is available, anyone knows
any method that can work with that high dimension?
Thanks and regards,
Tuan.

Anouar

unread,

Nov 13, 2009, 3:13:04 AM11/13/09

to face...@googlegroups.com

Try to look to the snapshot method for the pca

2009/11/13, Dao Thanh Tuan <snow.w...@gmail.com>:

> --
>
> You received this message because you are subscribed to the Google Groups
> "Face Recognition Research Community" group.
> To post to this group, send email to face...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/face-rec?hl=.
>
>
>

--
Envoyé avec mon mobile

chris

unread,

Nov 13, 2009, 10:53:03 AM11/13/09

to Face Recognition Research Community

Awesome. I've been needing to know this, too. It appears that SciPy
(for Python) has scipy.linalg.eigh(), which allows you to specify how
many eigenvalues to return. You can specify such by using the
"eigvals" parameter. It appears that the eigh() function is an any-
time implementation of PCA, thus allowing PCA to run much faster.

On Nov 13, 3:13 am, Anouar <me.ano...@gmail.com> wrote:
> Try to look to the snapshot method for the pca
>

> 2009/11/13, Dao Thanh Tuan <snow.whit...@gmail.com>:

rathi

unread,

Nov 14, 2009, 9:02:33 PM11/14/09

to face...@googlegroups.com

hai,

may be u can try using SAS package for this much size of data.

regrds

rathi

kamal shah

unread,

Nov 16, 2009, 5:51:31 AM11/16/09

to face...@googlegroups.com

hi

How many diffrent subjects you have?

If you can reduce the poses per subject than database can be manageble or you can use transforms to reduce the data

Kamal Shah

--- On Fri, 13/11/09, chris <obs...@gmail.com> wrote:

The INTERNET now has a personality. YOURS! See your Yahoo! Homepage.

Dao Thanh Tuan

unread,

Nov 17, 2009, 1:01:28 AM11/17/09

to Face Recognition Research Community

Hi all,
Thanks for so many helpful suggestions.
I've been looking around and the nearest proposal seem to be able to
deal with thousands of dimensions. Typically the cost of constructing
the covariance matrix is so high, or the limit of memory when the
number of individuals in the dataset goes more than several
thounsands. So I decide to concatenate many PCAs like this: I separate
my data in to many parts, each part consists of small number of
dimension, say 1000. Then I will apply PCA for each part, get like 100
as the result, then concatenate all the results and separate them
again and apply PCA again, and so on until i have very small number of
dimension.
I want to ask you if you think that method makes sense? My data is
strongly supposed to be highly correlated , since it's randomly
generated from the same prototype.
Thanks and regards.

kamal shah

unread,

Nov 17, 2009, 2:32:12 AM11/17/09

to face...@googlegroups.com

Hi

if it is highly correlated data then it should work. This method seems to be like data mapping method. You can study some dataware housing algo also because there they take care for huge data like what you are suggesting

kamal

--- On Tue, 17/11/09, Dao Thanh Tuan <snow.w...@gmail.com> wrote:

From: Dao Thanh Tuan <snow.w...@gmail.com>
Subject: Re: PCA for high huge data
To: "Face Recognition Research Community" <face...@googlegroups.com>

--

You received this message because you are subscribed to the Google Groups "Face Recognition Research Community" group.
To post to this group, send email to face...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/face-rec?hl=.

kamal shah

unread,

Nov 17, 2009, 2:37:42 AM11/17/09

to face...@googlegroups.com

--- On Tue, 17/11/09, kamal shah <shah....@yahoo.com> wrote:

hulijo

unread,

Nov 17, 2009, 4:39:46 AM11/17/09

to Face Recognition Research Community

Hi all,
I am nor sure I understand your problem. What are you working on? The
dimensionality should never explode like that. If you have lets say
imillion face images of a few tousand people and the images are lets
say 100x100 pixels, then the covariance matrix needed for PCA would be
10000x10000, which should be feasable to hold in RAM. On the other
hand, if you jave less images, e.g., 5000, and the dimensionality of
your images is huge 10000x10000 (100 Mpixels) then the covariance
matrix needed for PCA would be 5000x5000. So what are you working on,
could you give more details on you problem with respect to
dimensionality of the images and the number of samples you have.

Regarding your separation technique the problem is the same. It is
impossible to say if it is OK if you do not give details about your
experimental setup.

Regards,

Vito

On 17 nov., 08:37, kamal shah <shah.ka...@yahoo.com> wrote:
> --- On Tue, 17/11/09, kamal shah <shah.ka...@yahoo.com> wrote:

>
> From: kamal shah <shah.ka...@yahoo.com>
> Subject: Re: PCA for high huge data
> To: face...@googlegroups.com
> Date: Tuesday, 17 November, 2009, 1:02 PM
>
> Hi
>
> if it is highly correlated data then it should work. This method seems to be like data mapping method. You can study some dataware housing algo also because there they take care for huge data like what you are suggesting
>
>
> kamal
>

> --- On Tue, 17/11/09, Dao Thanh Tuan <snow.whit...@gmail.com> wrote:

>
> From: Dao Thanh Tuan <snow.whit...@gmail.com>
> Subject: Re: PCA for high huge data
> To: "Face Recognition Research Community" <face...@googlegroups.com>
> Date: Tuesday, 17 November, 2009, 11:31 AM
>
> Hi all,
> Thanks for so many helpful suggestions.
> I've been looking around and the nearest proposal seem to be able to
> deal with thousands of dimensions. Typically the cost of constructing
> the covariance matrix is so high, or the limit of memory when the
> number of individuals in the dataset goes more than several
> thounsands. So I decide to concatenate many PCAs like this: I separate
> my data in to many parts, each part consists of small number of
> dimension, say 1000. Then I will apply PCA for each part, get like 100
> as the result, then concatenate all the results and separate them
> again and apply PCA again, and so on until i have very small number of
> dimension.
> I want to ask you if you think that method makes sense? My data is
> strongly supposed to be highly correlated , since it's randomly
> generated from the same prototype.
> Thanks and regards.
>
> --
>
> You received this message because you are subscribed to the Google Groups "Face Recognition Research Community" group.
> To post to this group, send email to face...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/face-rec?hl=.
>

> The INTERNET now has a personality. YOURS! See your Yahoo! Homepage.
>

> --You received this message because you are subscribed to the Google Groups "Face Recognition Research Community" group.

> To post to this group, send email to face...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/face-rec?hl=.
>

> The INTERNET now has a personality. YOURS! See your Yahoo! Homepage.http://in.yahoo.com/

Dao Thanh Tuan

unread,

Nov 18, 2009, 1:38:21 AM11/18/09

to Face Recognition Research Community

Hi,
Thanks Hulijo and Kamal for your suggestions.
Let me describe my problem a little more in detail. As Kamal realized,
it's a mapping problem. It's not that I want to PCA on the pixels. I
have done some stuff to extract the features of the images already.
1. Now assume I have a set of 64 features, which is 0 or 1.
2. I use some algorithm that randomly perturbs those features. And now
i have another set of features, which is very huge, say 1 million.
3. I can not apply typical PCA to this kind of dataset, whose features
is 1 million, because, as you said, it will not be able to construct
the covariance matrix.
4. One thing I am sure about the huge dataset now is:
i. It has a lot redundancy
ii. Probably features are higly correlated
5. My proposal is the divide the 1 million dimension into 1000
subsets, and apply PCA to each parts, then reunion, then divide, then
PCA, then reunion.... until we have acceptably small dimension.

Do you think that will make sense, and what I could additionally do to
improve the result?

Thanks and regards,
Tuan

Xue

unread,

Nov 18, 2009, 3:44:57 AM11/18/09

to face...@googlegroups.com

I think is idea at the end will bring new random features which may be un related at all to human face as what orginal eigenfaces do

--- On Wed, 11/18/09, Dao Thanh Tuan <snow.w...@gmail.com> wrote:

From: Dao Thanh Tuan <snow.w...@gmail.com>
Subject: Re: PCA for high huge data
To: "Face Recognition Research Community" <face...@googlegroups.com>

Alberto Escalante

unread,

Nov 18, 2009, 4:31:46 AM11/18/09

to face...@googlegroups.com

Dear Tuan,

the kind of data analysis you want to do can be done using hierarchical networks. Take a look at the mdp library:
http://mdp-toolkit.sourceforge.net/

then take a look at: mdp.nodes.PCANode, and mdp.hinet.Layer and mdp.hinet.Switchboard

Regards,
Alberto

elkhiyari

unread,

Nov 19, 2009, 9:19:25 AM11/19/09

to Face Recognition Research Community

You may want to look at Singular Value Decomposition (SVD) for your
PCA implementation instead of the Covariance matrix method. SVD is
more stable and works well better for high dimensional data. if you
have Matlab use the svd() command

Hachim

hulijo

unread,

Nov 22, 2009, 7:56:47 AM11/22/09

to Face Recognition Research Community

You have again not specified the number of samples you have. If would
have to arrange your data into a matrix, how large would that be.
64x100000? If so, tah you simply compute the 64x64 covariance matrix
and combine it with the data to produce the principal components.
While I agree that SVD would be an alternative, as it owrk on non-
square matrices, it is still computationaly expansive for very large
datasets. I suggests you speciify both the data dimensionality as well
as the number of samples you have. It shoul be feasable to perform PCA
on large sets as well.

On Nov 18, 9:44 am, Xue <humanfacew...@yahoo.com> wrote:
> I think is idea at the end will bring new random features which may be un related at all to human face as what orginal eigenfaces do
>

> --- On Wed, 11/18/09, Dao Thanh Tuan <snow.whit...@gmail.com> wrote:

Dao Thanh Tuan

unread,

Nov 24, 2009, 12:49:24 AM11/24/09

to Face Recognition Research Community

Hi all,
Yes, SVD so far seems to be the only way for me in case I use PCA
(it's the fastest way to compute exact PCA so far, right?).
My matrix size is : 3000 x 1000000 , too big for doing anything.
So I'm going to make 1000 of matrices size 3000x1000.
In here, 3000 is my sample size, and 1000000 is the number of
dimensions.
Thanks all.

nilima kachare

unread,

Nov 24, 2009, 12:54:21 AM11/24/09

to face...@googlegroups.com

Hi friend,

Can u forward me the code for SVD???

Thanks and Regards

To unsubscribe from this group, send email to face-rec+u...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/face-rec?hl=en.

--
Nilima Kachare
MTech -CSE,
College of Engg, Pune

Zou Wilman

unread,

Nov 24, 2009, 2:20:30 AM11/24/09

to face...@googlegroups.com

it is easy to decompose a matrix only consisted of 3000 samples.

try to use C++ but not matlab to do that~

wilman

To unsubscribe from this group, send email to face-rec+u...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/face-rec?hl=en.

--
W.W. ZOU
HKBU, Ph.D Candidate

Reply all

Reply to author

Forward