R operator for fast exact multiplication by Hessian

James Bergstra

unread,

Dec 14, 2010, 11:03:35 PM12/14/10

to thean...@googlegroups.com, George Dahl, Aaron Courville

I was talking with George Dahl at NIPS and he was suggesting the next killer-feature for Theano - the 'R' operator.

Hessians have come up in the mailing list from time to time, and I understand that this operator is a key part of James Martens' Hessian-free algorithm. It looks to me like the 'R' operator would be pretty straightforward to implement as a graph-transformation macro in Theano. If anyone's interested in implementing this functionality, I'm offering encouragement and technical support.

I'm also making this a new ticket in our TRAC.

Olivier Grisel

unread,

Dec 15, 2010, 5:30:40 AM12/15/10

to thean...@googlegroups.com, George Dahl, Aaron Courville

2010/12/15 James Bergstra <james.b...@gmail.com>:

I agree this would be a killer feature. Isn't David working on
something related?

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

David Warde-Farley

unread,

Dec 15, 2010, 5:47:18 AM12/15/10

to thean...@googlegroups.com, Olivier Grisel, George Dahl, Aaron Courville

Razvan and I had talked about implementing specifically the Gauss-Newton version but we haven't gotten anywhere, mostly because of other time constraints. Maybe over the winter break I'll be able to finally digest those papers; the Schraudolph one is a bit dense.

David

James Bergstra

unread,

Dec 15, 2010, 1:39:14 PM12/15/10

to thean...@googlegroups.com, Olivier Grisel, George Dahl, Aaron Courville

There are also some interesting tips on metaoptimize

http://metaoptimize.com/qa/questions/3730/questions-re-pearlmutters-paper-on-fast-multiplication-by-the-hessian

--
http://www-etud.iro.umontreal.ca/~bergstrj

Yoshua Bengio

unread,

Dec 15, 2010, 1:58:01 PM12/15/10

to thean...@googlegroups.com

It's indeed a good idea!

-- Yoshua

Yoshua Bengio

unread,

Dec 15, 2010, 1:50:36 PM12/15/10

to thean...@googlegroups.com

It's indeed a good idea!

-- Yoshua

On 14-Dec-10, at 11:03 PM, James Bergstra wrote:

Brian Vandenberg

unread,

Feb 23, 2011, 4:21:54 PM2/23/11

to thean...@googlegroups.com

I don't have a lot of free time to help with coding, but if you guys still have some questions on Pearlmutter & Schraudolph's papers I'd be happy to help out.

As David mentioned, these papers are a little dense (Schraudolph's more-so than Pearlmutters), mostly because the terminology and notation are fairly abstract, but I think I've finally figured out most of it. There's a few details I'm no doubt missing, but I have enough to put it down in code.

Email may not be the best medium to carry on a discussion about this (latency of responses, math notation in emails, etc), so if anyone wants to meet in a chat room or whatever, I'd be happy to help in whatever way I can.

-Brian

Razvan Pascanu

unread,

Feb 27, 2011, 5:18:07 PM2/27/11

to thean...@googlegroups.com, Brian Vandenberg

I intend to give it a try to implement the R operator, and therefore might need your help, but I have an exam in a week from now ( 8th March) for which I really need to learn, so I will start working on this only afterwards. I mainly wrote the email such that you know that your proposal did not went unnoticed ;).

If anyone wants to join in the effort let me know :).

Razvan

David Warde-Farley

unread,

Feb 27, 2011, 5:51:27 PM2/27/11

to thean...@googlegroups.com, Brian Vandenberg

Yeah, a good number of Theano developers are taking the UdeM AI area predoc next week. There's a sprint tentatively planned for the week after though.

Brian Vandenberg

unread,

Feb 28, 2011, 11:55:31 AM2/28/11

to thean...@googlegroups.com

Sounds good. I wish I was in the middle of doing that, but I'm afraid it'll be a few years before I have the time to start a PhD.

If I can find the time this week I'll start working on a write-up in latex. I have Matlab code written for most of it, but it's not owned by me so I'll just have to do some pseudo code or something.

-Brian

Brian Vandenberg

unread,

Mar 7, 2011, 11:49:19 PM3/7/11

to thean...@googlegroups.com

Here's the write-up I promised:

http://upload.decay.org/1063_r6YwKOuypMEKSD9l_hessian-free.tex

http://upload.decay.org/1064_8uVC9ZQ1aTWUYjZq_hessian-free.pdf

-Brian

Brian Vandenberg

unread,

Mar 8, 2011, 12:27:25 AM3/8/11

to thean...@googlegroups.com

As a follow-up to this, there's one more thing to consider when looking at this.

Pearlmutter came up with an interesting way to combine calculations that allows you to get an f1 pass at the same time as performing an f0 pass, and similarly allow you to get an r2 pass when performing an r1 pass. This is Nic Schraudolph's description of the process (personal email):

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Pearlmutter's trick: Change the code computing your function/gradient to work in the complex domain. Let dx denote a perturbation on input x. Stick the perturbation into the imaginary channel, scaled down by a factor of 1e100 - in other words, the new complex input is x + 1e-100*i*dx. Now if you compute f in the complex domain, you get 1e-100*df(x) in the imaginary component. In short, an f_1 pass can hitch a ride in the imaginary channel of an f_0 pass, and likewise for an r_2 pass with an r_1 (gradient) pass.

This works because differential arithmetic is almost the same as complex arithmetic, the only difference being that in the complex product rule, the product of the imaginary components contributes (with a minus sign) to the real result; this term is zero in the differential product rule. By scaling the imaginary component with 1e-100, the offending term is scaled down by 1e-200, close enough to zero for practical purposes.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

I ran across a slide-show written by Pearlmutter describing this trick awhile back, but I'm having trouble locating it. It was for a summer school class on machine learning, I think.

I'm a little dubious on its usefulness, but I also haven't implemented/tried it (and it may be that it'd be far more useful for other algorithms). To calculate Gv you don't need an r2 pass, but you repeatedly perform r1/f1 passes during conjugate-gradient (or whatever). At least for implementing James Martens' algorithm, I perform f0 and r1 on the entire data-set then don't touch the f0 pass again until the next weight update/epoch.

If you implement something to calculate Hv, then you should be able to skip the f1 pass altogether and calculate Hv in a single r1 pass -- so that'd be handy.

-Brian

Yoshua Bengio

unread,

Mar 8, 2011, 7:09:04 AM3/8/11

to thean...@googlegroups.com

Brian,

Can you please explain how you can get a gradient *vector* out of this
complex fprop pass?
I don't get it.
If f is a scalar and x is a vector, the backprop should yield a
vector, but a complex fprop like you
describe will only yield the scalar that is the sum of that vector in
df(x).

-- Yoshua

Brian Vandenberg

unread,

Mar 8, 2011, 11:06:50 AM3/8/11

to thean...@googlegroups.com

Yoshua,

It's a shortcut, basically. In the case of the loss function and the output-layer non-linearities not matching, you do a full forward pass completely through the loss function to generate intermediate results necessary for later passes. To back-prop, you start with a scalar, u=1, and back-prop that through the loss function as well as the network.

In the matching loss function case, you don't have to do the full forward pass; instead you can stop at the point Schraudolph describes in section 2.1 of his paper, where the jacobian of M(N) is: JM' = Az + b; where in the logistic, softmax, and linear output cases A = the identity matrix, and b = (-1)grad(f), so instead of back-propagating a scalar I instead can back-prop that gradient.

If the output non-linearity & loss function match, the math works out the same either way you do it ... though, I'm not entirely sure how to handle biases yet.

-Brian

On Tue, Mar 8, 2011 at 5:09 AM, Yoshua Bengio <ben...@iro.umontreal.ca> wrote:

Brian,

Can you please explain how you can get a gradient *vector* out of this complex fprop pass?
I don't get it.
If f is a scalar and x is a vector, the backprop should yield a vector, but a complex fprop like you
describe will only yield the scalar that is the sum of that vector in df(x).

-- Yoshua

On 8-Mar-11, at 12:27 AM, Brian Vandenberg wrote:

As a follow-up to this, there's one more thing to consider when looking at this.

Pearlmutter came up with an interesting way to combine calculations that allows you to get an f1 pass at the same time as performing an f0 pass, and similarly allow you to get an r2 pass when performing an r1 pass. This is Nic Schraudolph's description of the process (personal email):

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Pearlmutter's trick: Change the code computing your function/gradient to work in the complex domain. Let dx denote a perturbation on input x. Stick the perturbation into the imaginary channel, scaled down by a factor of 1e100 - in other words, the new complex input is x + 1e-100*i*dx. Now if you compute f in the complex domain, you get 1e-100*df(x) in the imaginary component. In short, an f_1 pass can hitch a ride in the imaginary channel of an f_0 pass, and likewise for an r_2 pass with an r_1 (gradient) pass.

Brian Vandenberg

unread,

Mar 8, 2011, 12:28:24 PM3/8/11

to thean...@googlegroups.com

Oh, one more thing I meant to explain.

The error function E is a scalar, and its gradient w/respect to W is another matrix whose size is the same as the transpose of W.

I've found very little readable material on calculus as it relates to more complex structures like matrices, but luckily someone out there has produced an incredibly readable resource on the subject: http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/calculus.html. I found it while reading a book on dynamics & estimation theory (ISBN 158488391X).

The equations in my write-up are sub-scripted, presumably because it's easier to follow, but my code looks more like this (with a lot more asserts and whatnot):

function [JTV] = r1( ... )

dx = ds{end} .* u;

for kk = (num_layers-1):-1:2

JTV{kk} = y{kk}' * dx;

dx = ds{kk} .* (W{kk} * dx');

end;

JTV{1} = y{1}' * dx;

end

JTV (the output of the back-prop pass) is an array of smaller weight matrices that connect each layer together as opposed to having one huge W with lots of zeros. The arithmetic used to calculate JTV{...} is an outer-product.

Each row of y{...} is a sample. y{...} and dx always have the same number of rows, so size(y'x) = size(W) (for the particular W{...} being updated).

disclaimer: I might've messed up the expressions above a little, I'm doing this from memory. I'm banking on your intuition filling in where I screw up.

Lastly, sorry about that last email. Apparently, the grammar of me is good today.

-Brian

Yoshua Bengio

unread,

Mar 8, 2011, 6:15:32 PM3/8/11

to thean...@googlegroups.com

I still do not understand. Are you saying this only applies to linear
predictors (up to the output non-linearity / loss)?

So you still need to backprop? I thought you said the complex-thing
allows one to avoid the back-prop, hence my surprise.

-- Yoshua

Brian Vandenberg

unread,

Mar 8, 2011, 6:48:24 PM3/8/11

to thean...@googlegroups.com

OH. I thought you were confused on the other stuff. My apologies.
Scratch what I said.

The document I referenced earlier is this:

http://cnl.salk.edu/~schraudo/teach/MLSS4up.pdf

Near the end (page 14), he gives some details on how this can work.

A couple of things stand out:

* The f0 and f1 passes both produce intermediate outputs of the same
dimension for each layer. R{x} and R{y} are zero in layer 1, but that
doesn't affect the size of the results.
* The same sort of thing applies to the r1 and r2 passes. The
'output' of the r1 pass is dEdx, dEdy, and dEdW.

From what Barak/Nic have said, I think this means if an r1 pass is
performed using weights WW = W + i * V, then the resulting weight
gradient will have J^Tu in the real part, and Hv in the imaginary
part.

Unfortunately, I haven't looked into this too deeply (lack of free
time) so I've mostly set it aside in the back of my mind for now.

-Brian

On Tue, Mar 8, 2011 at 4:15 PM, Yoshua Bengio <ben...@iro.umontreal.ca> wrote:
>
> I still do not understand. Are you saying this only applies to linear predictors (up to the output non-linearity / loss)?
>
> So you still need to backprop? I thought you said the complex-thing allows one to avoid the back-prop, hence my surprise.
>
> -- Yoshua
>
> On 8-Mar-11, at 11:06 AM, Brian Vandenberg wrote:
>
>> Yoshua,
>>
>> It's a shortcut, basically. In the case of the loss function and the output-layer non-linearities not matching, you do a full forward pass completely through the loss function to generate intermediate results necessary for later passes. To back-prop, you start with a scalar, u=1, and back-prop that through the loss function as well as the network.
>>
>> In the matching loss function case, you don't have to do the full forward pass; instead you can stop at the point Schraudolph describes in section 2.1 of his paper, where the jacobian of M(N) is: JM' = Az + b; where in the logistic, softmax, and linear output cases A = the identity matrix, and b = (-1)grad(f), so instead of back-propagating a scalar I instead can back-prop that gradient.
>>
>> If the output non-linearity & loss function match, the math works out the same either way you do it ... though, I'm not entirely sure how to handle biases yet.
>>
>> -Brian
>>
>> On Tue, Mar 8, 2011 at 5:09 AM, Yoshua Bengio <ben...@iro.umontreal.ca> wrote:
>>
>> Brian,
>>
>> Can you please explain how you can get a gradient *vector* out of this complex fprop pass?
>> I don't get it.
>> If f is a scalar and x is a vector, the backprop should yield a vector, but a complex fprop like you
>> describe will only yield the scalar that is the sum of that vector in df(x).
>>
>> -- Yoshua
>>
>>
>> On 8-Mar-11, at 12:27 AM, Brian Vandenberg wrote:
>>
>> As a follow-up to this, there's one more thing to consider when looking at this.
>>
>> Pearlmutter came up with an interesting way to combine calculations that allows you to get an f1 pass at the same time as performing an f0 pass, and similarly allow you to get an r2 pass when performing an r1 pass. This is Nic Schraudolph's description of the process (personal email):
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

>> Pearlmutter's trick: Change the code computing your function/gradient to work in the complex domain. Let dx denote a perturbation on input x. Stick the perturbation into the imaginary channel, scaled down by a factor of 1e100 - in other words, the new complex input is x + 1e-100*i*dx. Now if you compute f in the complex domain, you get 1e-100*df(x) in the imaginary component. In short, an f_1 pass can hitch a ride in the imaginary channel of an f_0 pass, and likewise for an r_2 pass with an r_1 (gradient) pass.

Yoshua Bengio

unread,

Mar 8, 2011, 11:00:26 PM3/8/11

to thean...@googlegroups.com

Ok. I see. But this only helps for 'forward-mode' AD, i.e., to compute
the *product*
of a vector by the Jacobian of any function. This it can't be used to
(efficiently) compute gradients (i.e. the
stuff we need for gradient-based optimization).

It might be useful in the R{.} computation though, I don't have enough
intuitions there.

-- Yoshua

Brian Vandenberg

unread,

Mar 9, 2011, 2:13:39 AM3/9/11

to thean...@googlegroups.com

It should be possible with the r1/r2 passes as well though. Using
this trick on the f0 pass gets you the f1 pass in one pass instead of
two. Likewise, an r1 pass gets you an r2 pass in one pass.

This doesn't get you the back prop passes for free (so to speak)
from the forward passes. My apologies if I gave that impression.

-Brian

>>>> Pearlmutter's trick: Change the code computing your function/gradient to

Razvan Pascanu

unread,

Mar 11, 2011, 4:21:35 PM3/11/11

to thean...@googlegroups.com, James Bergstra, Guillaume Desjardins, David Warde-Farley, Yoshua Bengio, Brian Vandenberg

Thanks for the write-up. It's pretty good. I wanted to keep everyone informed on the progress in this direction.

We've had a meeting yesterday and talked at length the R operator, and I think we reached a few conclusions. What follows might be a bit more tuned towards my intuitions, but I hope you all agree with me.

What we want is an R operator in Theano as the grad function is, i.e. that can be applied to any function, so that we do not need to derive the algorithm specifically for each class of f ( class of f here is an architecture mlp, deep aa, rnn, etc). We also want to make sure that whatever graph is generated is computationally optimal, i.e. it close to what James Martens and others get by hand writing it.

Theano right now implements the grad operation which is in some regard similar to R operator. The way that the gradients are computed in Theano is by having each op implement a method called ``grad``. ``grad`` is a function that given a vector ``v`` return ``v*J`` where ``J`` is the jacobian of the op. I.e. if we have an op that implements the function

g: R^m -> R^n

then grad is the following function :

grad : R^m x R^n -> R^m

grad( input, v) = v^T \dot J

(v is transposed here because mathematical formalism requires it to be, i.e. a vector is always a column vector, and you need to transpose it to be a row vector )

where J = \frac{ \partial g(input) }{ \partial input }

We were all hoping that we can take advantage of this grad function and use it to compute the R operator, which is defined as :

r(input, v) : R^m x R^m -> v^T \dot J^T = J \dot v

I don't think there is a way though :(, because any such ``grad`` does not compute the Jacobian J and then the dot product, but rather that product directly. So there is no way to get J and then transpose it. And because matrix vector product is not commutative (meaning that in correct math notation v \dot J doesn't make sense since this v is a column, so you have dimension mismatch), there is not much we can do.

Note though that if J is diagonal matrix ( which happens for elemwise operations, for subtensor and I think other cases like softmax),

then J = J^T so ``grad`` is the same as the ``r`` operation.

I think the only clean way to go about it, is go through all ops and add a method ``r`` ( or maybe ``r_operator`` if ``r`` is to short) that implements what we want. For those where ``r`` is the same as ``grad`` we can define ``r`` as grad : i.e.

def r(..):

return self.grad(..)

and keep all the conventions that we have for grad ( if method ``r`` doesn't exist we throw an exception, if op is not differentiable we return None). Because ``r`` is so similar to ``grad`` I think that implementing all this methods shouldn't be too hard, but rather something similar to grad.

Now in order to reach the full speed of what a hand derived algorithm will do, beside optimize the graph with the optimizations we already have, there is one step missing, namely caching intermediate results and reusing them. This is what James proposed in a different email on this list, and I think is the only thing that the handcrafted code does and Theano will not do.

Does anyone has any comments on my proposal ?

Razvan

James Bergstra

unread,

Mar 11, 2011, 4:37:22 PM3/11/11

to Razvan Pascanu, thean...@googlegroups.com, Guillaume Desjardins, David Warde-Farley, Yoshua Bengio, Brian Vandenberg

Thanks for the writeup in Theano terms, that's a big step toward getting this implemented!

What does my previous comment about reusing inputs have to do with the R operator? Would the R operator not just build a bigger graph (as the grad operator does) in which any results used in multiple places will be re-used without re-computation in the normal way?

James

--
http://www-etud.iro.umontreal.ca/~bergstrj

David Warde-Farley

unread,

Mar 11, 2011, 4:51:29 PM3/11/11

to thean...@googlegroups.com, Razvan Pascanu, Guillaume Desjardins, David Warde-Farley, Yoshua Bengio, Brian Vandenberg

On Fri, Mar 11, 2011 at 04:37:22PM -0500, James Bergstra wrote:
> Thanks for the writeup in Theano terms, that's a big step toward getting
> this implemented!
>
> What does my previous comment about reusing inputs have to do with the R
> operator? Would the R operator not just build a bigger graph (as the grad
> operator does) in which any results used in multiple places will be re-used
> without re-computation in the normal way?

That's what I thought too, except that Guillaume has been trying to grok the
math a little more adn I think our understanding of it as of yesterday may
have been somewhat incomplete. He's composing a write-up on the subject as we
speak.

In particular, using the "R" operator in the backward pass necessitates
altering the way a forward pass works in a way I don't totally understand
yet.

Reply all

Reply to author

Forward