In class, we are taught that PCA can be formulated as minimizing projection residual.
Here are two formulations:
1) Without centering (zero mean shift)
min sum_i (|| W*W^T*x_i - x_i ||^2)
s.t. W^T*W = I
,where x_i is the ith feature vector, W is d*k, where k is the number of PCs we want.
2) With mean shift, I guess the optimization could be formulated as
min sum_i (|| W*W^T*(x_i - m) + m - x_i ||^2
s.t. W^T*W = I
,where both W and m is the optimization variable. If we formulate the lagrangian and stationary condition would give us that m is the mean of x_i, and then it's basically
a PCA with centered data (mean shift to zero)
But I think the second formulation is weird, think of the case that most data comes from the same vector v with tiny little pertubations (v + eps), then doing PCA will give us useless PCs which acts as a characterization of noise. On the other hand, the
first formulation gives the first PC as v.
Thanks!