Optimizing using productmanifold with multiple summations in cost and grad

solomo...@gmail.com

unread,

May 21, 2015, 12:19:44 PM5/21/15

to manopt...@googlegroups.com

Hello,

I'm working on a project that seems like it would lend itself well to implementation with Manopt but my familiarity with optimizing over a manifold with a complicated objective function is limited. If there are any experts that could assist, I would greatly appreciate it.

A. Objective Data and Parameters

Data: Panel of descriptor data, x, where the number of observations per date is not balanced. y_x target for prediction for each observation on each date.

Parameters to learn: 1. Feature dictionary, D (matrix), that is fixed (neither time or observation varying, though this may be relaxed in future iterations). The columns are constrained individually to have unit length. 2. y_d a time series of vectors that are unconstrained and live in euclidean space.

B. Objective Function (Cost)

Within each date, for simplicity of explanation, the objective function is characterized as:

argmin{D,y_d(t)} 0.5 * (1/n(t)) sum([y_x(t) - x(t) * D * y_d(t)]^2)

Aggregating over all dates, the objective becomes the simple average over all periods (I wont write it out because it looks messy outside of latex style print).

C. Gradient

One of the issues with this formulation is that the matrix derivative is not defined for x(t) * D * y_d(t), where the forms are matrix/matrix/vector. It is however defined for the case where x(t) is a vector. Hence, for any given date slice, I believe I can average the derivatives for each observation in an explicit form to get the derivative I care about for each date, and similarly average over all dates to get the derivative for the full panel.

Road blocks:

1. Specifying Cost and Gradient.

I believe specifying the manifold can be done as easily as:
manifold = productmanifold(struct('D', obliquefactory(x_cols,20), 'y_d', euclideanfactory(x_rows,1)));

But specifying the cost and gradient are more complicated. Is it ok to write a function cost with a loop in it a then reference the handle? Similarly for the gradient, I believe I need to specify the update for both the variables differently since they are being optimized over different manifolds but they are similarly complicated and I do not quite understand how to combine them.

One of the vexing issues is that the y_d vector is time specific and can be updated with a simpler cost/grad structure, but the update for D needs the data for all dates. For this reason I think that specifying the updates for D, y_d(t) each date is preferred, then then averaging over all t in order to get the correct update for D.

2. Solving.

Once I have the problem specified, I would like to solve it by gradient descent. I would like to get back the euclidean update for each step, define a learning rate and take the 'step', and continue to iterate. Is it possible to do this? Or is there a more efficient way.

Nicolas Boumal

unread,

May 21, 2015, 3:46:40 PM5/21/15

to manopt...@googlegroups.com

Dear Solomon,

Thank you for posting your question here.

I'll need more time to parse through your question adequately. It is still not clear to me what your cost function is. I understand that your variables are D and y_d. So the cost should be:

f(D, y_d) = ..... (an expression that evaluates to a real number.)

It will have a gradient with respect to D (a matrix) and with respect to y_d (a vector). Using productmanifold, points and vectors (including gradients) are represented as structures, with fields D and y_d.

You asked the following question: "Is it ok to write a function cost with a loop in it a then reference the handle?"

The answer is yes: you need to pass a function handle ; it does not have to be defined "inline": it is absolutely ok (and often necessary) to define the cost / gradient functions in separate functions, and pass handles to those.

I do not understand this, specifically:

"One of the issues with this formulation is that the matrix derivative is not defined for x(t) * D * y_d(t), where the forms are matrix/matrix/vector. It is however defined for the case where x(t) is a vector."

Your cost function should always return a real number. And consequently, it should always have a gradient (if it is smooth). What am I missing?

Cheers,

Nicolas

Message has been deleted

solomo...@gmail.com

unread,

May 21, 2015, 9:45:47 PM5/21/15

to manopt...@googlegroups.com

Yes, let me clarify.

The cost function is the mean squared reconstruction error that evaluates to a scalar.

Because the full cost (and objective) is a function of a full data panel (multiple dates where each date is characterized by a rectangular matrix of observations along the rows and characteristics along the columns), I will write out the cost for just one date with the necessary extension being an operation to average over all dates. Please interpret the '(t)' as a time subscript.

cost(t) = (1/2n) * e(t)'e(t)

where e(t) is the reconstruction error for one date.

e(t) = [y_x(t) - x(t) * D * y_d(t)]

The objective is then to solve argmin{D,y_d} cost()

In response to the question about my statement about the derivative 'x(t) * D * y_d(t)' in my original question:

In order to obtain the derivative of the cost with respect to D it is necessary to differentiate the term 'x*D*y_d' with respect to the middle term, 'D' (chain rule). This derivative has a very simple matrix form when x is a row vector. This is from the Matrix Cookbook by Peterson & Pederson, page 10, eq. 70.

It is presented in the text as d(a'Xb)/dX = ab', where a=column vec, X=matrix, b=column. (the book is available online as the first search result for the book title.)

However, the x in my problem is not a vector, it is a matrix. For this reason, I believe it is necessary to break it into the column elements, find the derivative associated with each column, then average over the set of these.

*edited to correct a number of mistakes (apologies for filling up the mailbox of any thread subscribers)

Nicolas Boumal

unread,

May 22, 2015, 1:24:29 PM5/22/15

to manopt...@googlegroups.com

When you say that the x in your problem is not a vector, what does that imply about the size of e(t)? Is that always a vector? If it is not, then e(t)' * e(t) is not a scalar. Perhaps you mean to take the trace of it? (that would return the squared Frobenius norm of e(t)).

solomo...@gmail.com

unread,

May 22, 2015, 1:58:35 PM5/22/15

to manopt...@googlegroups.com

Hey Nicolas,

Thanks for taking the time to respond with your thoughts and expertise.

x can be thought of as a three dimensional panel. The slices of x, denoted x(t) when sliced along the date dimension, are rectangular matrices.

e(t) will always be a column vector of the reconstruction error for date t, e(t)'*e(t) will always be a scalar.

solomo...@gmail.com

unread,

May 22, 2015, 2:14:04 PM5/22/15

to manopt...@googlegroups.com

please forgive any errors in the following. I believe this reasonably defines the structure for the cost, and the gradients for the euclidean and non-euclidean components. these are for a single period only.

function c_t = cost_t(x_t, D, yx_t, yd_t)
n_obs_t = size(x_t,1);

e_t = (yx_t - x_t*D*yd_t); % error column vector
c_t = (e_t' * e_t); % cost (sum squared error)
c_t = (1/(2*n_obs_t)) * c_t; % average
end

function g_d_t = grad_d_t(x_t, D, yx_t, yd_t)
n_obs_t = size(x_t,1);

e_t = (yx_t - x_t*D*yd_t); % error column vector

g_d_t = zeros(size(D)); % preallocate grad. matrix
for i = 1:n_obs_t
% accumulated gradients for all observations in time t

% I do not know a simple matrix formula for this grad
% that would allow me to avoid the for-loop
% though i imagine one could be formulated

% e_t(i) scalar
% x_t(i,:) row vector
% yd_t column vector

g_d_t = g_d_t + (-1) * e_t(i) * (x_t(i,:)' * yd_t'); % grad(i,t)
end

g_d_t = (1/n_obs_t) * g_d_t; % average
end

function g_yd_t = grad_yd_t(x_t, D, yx_t, yd_t)
n_obs_t = size(x_t,1);

e_t = (yx_t - x_t*D*yd_t); % error column vector

g_yd_t = (-1) * (x_t * D)' * e_t ; % grad(t)

g_yd_t = (1/n_obs_t) * g_yd_t; % average
end

Nicolas Boumal

unread,

May 27, 2015, 5:31:45 AM5/27/15

to manopt...@googlegroups.com

Thank you for the details.

Is your question that something doesn't work, or that you'd like to make it faster?

The for-loop is not too much of a problem for preliminary testing. This can always be optimized later on, when the other aspects of the problem are more stable (often times, the exact cost function changes as we adapt the model of our problem). This being said, I often find that the tools multiprod, multitrace, multitransp, etc. that ship with Manopt are a great help to vectorize operations on 3D matrices (not sure you have that here). Likewise, bsxfun is a great Matlab function to vectorize common operations. Lastly, writing down the formulas for the gradient, (as a sum of terms), may help identify that the sum really comes down to a matrix product of some sort. That often happens.

Cheers,

Nicolas

solomo...@gmail.com

unread,

May 27, 2015, 10:49:47 AM5/27/15

to manopt...@googlegroups.com

Thanks Nicolas,

My main reason for writing them out was to get feedback as to whether the form was acceptable as a first step. I'm used to optimization setups that either 1. require only the objective function and solve numerically, or 2. require obj. and gradient in matrix form for efficiency. This is a new concept, being able to provide the grad as a functional reference.

Nicolas Boumal

unread,

May 27, 2015, 12:03:44 PM5/27/15

to manopt...@googlegroups.com

I see. Well then, indeed I confirm that there is no harm in having loops and even if/else blocks in the code for your cost and gradient functions, as long as the resulting function has the required mathematical properties (typically, sufficient smoothness). If you find that your current code solves your problem, then the only reason for putting in the extra effort to get rid of loops would be to gain in terms of computation time (and perhaps clarity, sometimes).

Did you use checkgradient, to make sure the cost and gradient seem compatible? (I may have asked you that already).

(We do recognize that the necessity to write explicit code for the gradient is a pain, sometimes. For now though, we do not have a simple solution to spare this effort to the user.)