Gradient plateaus

76 views
Skip to first unread message

daniel hernandez

unread,
Feb 26, 2016, 12:31:36 PM2/26/16
to Manopt

Hi all,


I have a factor model where I’m assuming data Y is generated in terms of latent variables X according to:


Y = CX + s


where s is some gaussian noise. Given Y, I have implemented an EM algorithm to optimize for X, C and s. In the case of C, I want to optimize restricted to the orthogonal matrices, hence manopt.


I find that:


1.- When I run manopt on simulated data, using the SAME values for the number of latent dimensions I used in the simulation, manopt does its job perfectly and the EM algorithm converges to the correct minimum.


2.- When I run manopt on simulated data, but imposing DIFFERENT values for the number of latent dimensions (a situation more akin to what happens with true data), it has some convergence issues. Specifically, there are cases in which the gradient stabilizes at a plateau for many outer iterations. Eventually though, it does converge. However, I have been able to considerably improve this problem by tweaking Delta_bar and Delta0.


3.- When I run manopt on real data, the gradient stays in the plateau forever. Changes on Delta_bar and/or Delta0 seem to be irrelevant.


So my question is, is there something else I could do to achieve convergence. Or is there any reason why it would be impossible for this problem to converge?


best, Daniel

Nicolas Boumal

unread,
Feb 26, 2016, 8:26:34 PM2/26/16
to Manopt
Hello Daniel,

I believe I am missing something to understand your question correctly.

What are the dimensions of the objects you work with? Are X, Y and s vectors? Or are they matrices? Do they all have the same size?

In what norm do you optimize?

(It feels to me that there are too many parameters compared to the number of measurements for this estimation task to be well-defined; bust most likely I am missing something).

Cheers,
Nicolas

daniel hernandez

unread,
Feb 28, 2016, 2:53:06 AM2/28/16
to Manopt
Hi Nicolas,

Sorry, I was not explicit enough.

Y is a k x N matrix and X is a j x N matrix with k > j and N is an integer denoting the number of observations. 

Thus, C is a k x j matrix where the j columns are k-dimensional orthonormal vectors. I want to optimize along this Stiefel manifold using manopt.

There is a bit more to it since the actual data is noisy. The noise I assume is Gaussian and I also want to optimize for the variances. So in truth, I am using a product manifold where the other factor is a k dimensional Euclidean manifold corresponding to the variances.

Since Y = CX + s where s is Gaussian noise, the distribution of Y is itself Gaussian.

In order to optimize this problem, I use an EM algorithm. I only use manopt in the M step. The inputs to the M step are the current values of the X and Y matrices. The cost I want to optimize is simply the log likelihood of (Y, X) which depends on the unknowns C and s. This is a quadratic function on C and a somewhat more complicated function of the variance s.

Nonetheless, as I say, I run this algorithm on data I generated myself and it works fine (although I had to tweal Delta_bar and Delta0). But when I use it on real data, it's doesn't converge for as long as I have been able to wait. 

BM

unread,
Feb 28, 2016, 11:15:08 AM2/28/16
to Manopt
Hello Daniel,

Thanks for the explanation. 

Do you jointly optimize X as well? Also, could you share your code snippet that deals with the Manopt part.

In my experience, the plateau effect could be an ill-conditioning issue arising from the variable C. Since the cost function is quadratic on C, would it be possible to find the second order derivative with respect to C only (assume that S is fixed)? In case that is possible, one could look at preconditioning the algorithm. 

Regards,
Bamdev

Nicolas Boumal

unread,
Feb 29, 2016, 10:33:26 AM2/29/16
to Manopt
Thanks for the clarifications.

I agree with Bamdev, it would be good to see the exact cost functions at play, because things remain a bit fuzzy for me. In particular, I am under the impression that perhaps your cost function in terms of C is ||CX-Y||_F (Frobenius norm); is that right? because if it is so, this should have a closed form solution in terms of the SVD. But you mentioned there being a more complicated component "function of the variance s" (but s is the noise matrix, right?)

Cheers,
Nicolas

daniel hernandez

unread,
Mar 20, 2016, 3:03:23 PM3/20/16
to Manopt
Dear Nicolas, Bamdev,

I'm sorry for the long time, it took me to answer, I had some personal issues I had to deal with that prevented me from spending time in this project. 

Please find below the segment of code including the cost functions and gradient Im using. This is a simplified version of the function I'm actually interested in but still displays the same nonconvergence problem (absolute value of the gradient does not diminish). I haven't implemented the Hessian, I willdo so today given that it is the most straightforward thing to do. Please note that I am tweaking Delta0 and Delta_bar. As I said, that helped in some other tests.

Is there any other reason why manopt may be failing to converge?

function [tempW, tempnoise, Tcost] = Mstep0(Y, Wold, X, CovX, Wabs, eps)

% M step for the EM algorithm for the case with no shared dimensions

%

% Inputs:

%       Y       -> nxN matrix of data measurements

%       X       -> mxN matrix of latent variable measurements

%       CovX    -> mxm matrix of total covariance of latent variables

%       Wold    -> OLD orthogonal W matrix (component of C). This

% Outputs (results from the optimization): 

%       tempW -> Orthogonal matrix that is a component of C

%       tempnoise -> noise vector.

    

%% Define useful variables

[num_total, N] = size(Y);

[twice_num_lat, ~] = size(X);

num_neu = num_total;

Yn = Y(1:num_neu,:);

L = eye(twice_num_lat);

 

%% Define manifold, cost and gradient (add Hessian?)

tuple.Wtemp = stiefelfactory(num_neu, twice_num_lat);

tuple.epsntemp = euclideanfactory(num_neu, 1);

 

manifold = productmanifold(tuple);

 

%% Compute-only-once variables (these are combinations of variables that appear in the cost and that are constant throughout each call to  Mstep0)

YnToWabs = Yn*X'*L'*Wabs';

WabsToWabs = Wabs*L*CovX*L'*Wabs';

YnYn = Yn*Yn';

 

    function [c, store] = cost(T, store)

        Wtemp = T.Wtemp;

        epsntemp = T.epsntemp;

        if ~all(isfield(store, {'eYn', 'eW'}))

            store.eYn = diag(epsntemp)*YnToWabs;

            store.eW = diag(epsntemp)*Wtemp*WabsToWabs;

        end

        eYn = store.eYn;

        eW = store.eW;

        

        %%    COST FUNCTION

        c = -trace(eYn*Wtemp') + trace(eW*Wtemp')/2  + trace(YnYn*diag(epsntemp))/2 - ...

            N*sum(log(epsntemp))/2 ;

    end


    %  GRADIENT 

    function [g, store] = grad(T, store)

        Wtemp = T.Wtemp;

        epsntemp = T.epsntemp;

        if ~all(isfield(store, {'eYn', 'eW'}))

            store.eYn = diag(epsntemp)*YnToWabs;

            store.eW = diag(epsntemp)*Wtemp*WabsToWabs;

        end

        eYn = store.eYn;

        eW = store.eW;

        

        egrad.Wtemp = -eYn + eW;

        egrad.epsntemp = -diag(YnToWabs*Wtemp') + diag(Wtemp*WabsToWabs*Wtemp')/2 + ...

            diag(YnYn)/2 - (N/2)*ones(num_neu, 1)./epsntemp;

                

        g = manifold.egrad2rgrad(T, egrad);

    end

 

%% Problem definition

problem.M = manifold;

problem.cost = @cost; 

problem.egrad = @grad;

% d = problem.M.dim();

 

checkgradient(problem)

    

%% Solver

Told.Wtemp = Wold;

Told.epsntemp = eps(1:num_neu);

options.tolgradnorm = 1e-07;

options.maxiter = 1000;

options.maxinner = 100;

options.Delta_bar = 200;

options.Delta0 = 50;

[T, Tcost, ~, ~] = trustregions(problem, Told, options);

    

%% M-step assignments

tempW = T.Wtemp;

tempepsn = T.epsntemp;

tempnoise = tempepsn;

 

end

BM

unread,
Mar 21, 2016, 12:47:06 AM3/21/16
to Manopt
Hello Daniel, 

Thanks for the code. 

Could also send some sample input data - Y, Wold, X, CovX, Wabs, eps - so that it allows us to do some experimentation?

Regards,
BM

daniel hernandez

unread,
Mar 21, 2016, 5:28:00 PM3/21/16
to Manopt
Hi Bamdev,

Thanks for the reply. As for posting some input data, I need to ask since the data is not public. I will get back to you

Best, Daniel

BM

unread,
Mar 21, 2016, 10:59:18 PM3/21/16
to Manopt
Hello Daniel, 

Private data is not required, synthetic input data would suffice. As we are (at least I am) not familiar with the problem, could you suggest us ways to generate the same, e.g, Y = randn(n, N). 

Regards,
Bamdev

daniel hernandez

unread,
Mar 22, 2016, 2:25:34 AM3/22/16
to Manopt
As I said before, the point is that the optimization seems to work, after tweaking Delta0 and Delta_bar, for the synthethic data I generated. It fails to converge for the real stuff.

Anyway, here is some simulated data.

If you run an iteration of the function above on this, it will converge. Hopefully however, you will notice that the gradient stalls in a plateau for a while before converging, like so

acc       k:     1     num_inner:     3     f: +1.200970e+03   |grad|: 8.789260e+02   reached target residual-kappa (linear)
acc       k:     2     num_inner:     5     f: +5.994562e+02   |grad|: 4.773828e+02   reached target residual-kappa (linear)
REJ TR-   k:     3     num_inner:     3     f: +5.994562e+02   |grad|: 4.773828e+02   negative curvature
REJ TR-   k:     4     num_inner:     3     f: +5.994562e+02   |grad|: 4.773828e+02   negative curvature
acc       k:     5     num_inner:     3     f: +3.549255e+02   |grad|: 4.533871e+02   negative curvature
REJ TR-   k:     6     num_inner:     3     f: +3.549255e+02   |grad|: 4.533871e+02   negative curvature
acc TR+   k:     7     num_inner:     1     f: +1.642852e+02   |grad|: 1.344847e+02   exceeded trust region
acc TR+   k:     8     num_inner:     3     f: +6.857179e+01   |grad|: 7.153111e+01   exceeded trust region
REJ TR-   k:     9     num_inner:     3     f: +6.857179e+01   |grad|: 7.153111e+01   exceeded trust region
acc       k:    10     num_inner:     3     f: +5.170228e+01   |grad|: 8.143036e+01   exceeded trust region
acc TR+   k:    11     num_inner:     5     f: +3.814884e+01   |grad|: 5.318295e+01   exceeded trust region
acc       k:    12     num_inner:     9     f: +2.597676e+01   |grad|: 1.094646e+01   reached target residual-kappa (linear)
acc       k:    13     num_inner:    10     f: +2.551829e+01   |grad|: 1.278534e+00   reached target residual-kappa (linear)
acc       k:    14     num_inner:    13     f: +2.551283e+01   |grad|: 1.081235e-01   reached target residual-kappa (linear)
acc       k:    15     num_inner:     7     f: +2.551279e+01   |grad|: 9.587191e-03   reached target residual-kappa (linear)
acc       k:    16     num_inner:    14     f: +2.551279e+01   |grad|: 5.474783e-05   reached target residual-theta (superlinear)
acc       k:    17     num_inner:    20     f: +2.551279e+01   |grad|: 5.922343e-09   reached target residual-theta (superlinear)

With real data, instead of ultimately converging, it remains in the plateau for as long as I can keep it running.
sim_data.mat
Message has been deleted

daniel hernandez

unread,
Mar 22, 2016, 2:31:38 AM3/22/16
to Manopt
I forgot, the correspondence between the variables in the file and the inputs to the function Mstep0 above are

Var in file      Input

Ydat             Y
tempW            Wold
expX             X
expCovX          CovX
Wabs             Wabs
tempnoise        eps

Nicolas Boumal

unread,
Mar 22, 2016, 2:40:42 PM3/22/16
to Manopt
Hello,

The plateau is not abnormal; it's not something you want, but it's not abnormal. It happens when your initial guess is fairly far away from the critical point you will eventually converge to: far away, the gradient is pretty big is vast areas; the algorithm will need to make a few steps before it enters the local converge rate regime, at which point the gradient will finally start to decrease very fast.

The observation that on real data the plateau is a stronger problem could indicate either that the initial guess is even worse for real data, or that with the cost function with real data it is even more difficult to reach the local regime.

This being said, barring numerical issues and possible bugs in the code, RTR converges regardless of initialization under mild conditions; the conditions notably entail some smoothness in the cost and boundedness of the sublevel sets, and those are pretty standard. It doesn't guarantee that you get there fast though, unfortunately. Did you try running RTR for a very long time?

Nicolas
Reply all
Reply to author
Forward
0 new messages