I agree this would be a killer feature. Isn't David working on
something related?
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
Razvan and I had talked about implementing specifically the Gauss-Newton version but we haven't gotten anywhere, mostly because of other time constraints. Maybe over the winter break I'll be able to finally digest those papers; the Schraudolph one is a bit dense.
David
-- Yoshua
-- Yoshua
On 14-Dec-10, at 11:03 PM, James Bergstra wrote:
Can you please explain how you can get a gradient *vector* out of this
complex fprop pass?
I don't get it.
If f is a scalar and x is a vector, the backprop should yield a
vector, but a complex fprop like you
describe will only yield the scalar that is the sum of that vector in
df(x).
-- Yoshua
Brian,
Can you please explain how you can get a gradient *vector* out of this complex fprop pass?
I don't get it.
If f is a scalar and x is a vector, the backprop should yield a vector, but a complex fprop like you
describe will only yield the scalar that is the sum of that vector in df(x).
-- Yoshua
On 8-Mar-11, at 12:27 AM, Brian Vandenberg wrote:
As a follow-up to this, there's one more thing to consider when looking at this.
Pearlmutter came up with an interesting way to combine calculations that allows you to get an f1 pass at the same time as performing an f0 pass, and similarly allow you to get an r2 pass when performing an r1 pass. This is Nic Schraudolph's description of the process (personal email):
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pearlmutter's trick: Change the code computing your function/gradient to work in the complex domain. Let dx denote a perturbation on input x. Stick the perturbation into the imaginary channel, scaled down by a factor of 1e100 - in other words, the new complex input is x + 1e-100*i*dx. Now if you compute f in the complex domain, you get 1e-100*df(x) in the imaginary component. In short, an f_1 pass can hitch a ride in the imaginary channel of an f_0 pass, and likewise for an r_2 pass with an r_1 (gradient) pass.
So you still need to backprop? I thought you said the complex-thing
allows one to avoid the back-prop, hence my surprise.
-- Yoshua
The document I referenced earlier is this:
http://cnl.salk.edu/~schraudo/teach/MLSS4up.pdf
Near the end (page 14), he gives some details on how this can work.
A couple of things stand out:
* The f0 and f1 passes both produce intermediate outputs of the same
dimension for each layer. R{x} and R{y} are zero in layer 1, but that
doesn't affect the size of the results.
* The same sort of thing applies to the r1 and r2 passes. The
'output' of the r1 pass is dEdx, dEdy, and dEdW.
From what Barak/Nic have said, I think this means if an r1 pass is
performed using weights WW = W + i * V, then the resulting weight
gradient will have J^Tu in the real part, and Hv in the imaginary
part.
Unfortunately, I haven't looked into this too deeply (lack of free
time) so I've mostly set it aside in the back of my mind for now.
-Brian
On Tue, Mar 8, 2011 at 4:15 PM, Yoshua Bengio <ben...@iro.umontreal.ca> wrote:
>
> I still do not understand. Are you saying this only applies to linear predictors (up to the output non-linearity / loss)?
>
> So you still need to backprop? I thought you said the complex-thing allows one to avoid the back-prop, hence my surprise.
>
> -- Yoshua
>
> On 8-Mar-11, at 11:06 AM, Brian Vandenberg wrote:
>
>> Yoshua,
>>
>> It's a shortcut, basically. In the case of the loss function and the output-layer non-linearities not matching, you do a full forward pass completely through the loss function to generate intermediate results necessary for later passes. To back-prop, you start with a scalar, u=1, and back-prop that through the loss function as well as the network.
>>
>> In the matching loss function case, you don't have to do the full forward pass; instead you can stop at the point Schraudolph describes in section 2.1 of his paper, where the jacobian of M(N) is: JM' = Az + b; where in the logistic, softmax, and linear output cases A = the identity matrix, and b = (-1)grad(f), so instead of back-propagating a scalar I instead can back-prop that gradient.
>>
>> If the output non-linearity & loss function match, the math works out the same either way you do it ... though, I'm not entirely sure how to handle biases yet.
>>
>> -Brian
>>
>> On Tue, Mar 8, 2011 at 5:09 AM, Yoshua Bengio <ben...@iro.umontreal.ca> wrote:
>>
>> Brian,
>>
>> Can you please explain how you can get a gradient *vector* out of this complex fprop pass?
>> I don't get it.
>> If f is a scalar and x is a vector, the backprop should yield a vector, but a complex fprop like you
>> describe will only yield the scalar that is the sum of that vector in df(x).
>>
>> -- Yoshua
>>
>>
>> On 8-Mar-11, at 12:27 AM, Brian Vandenberg wrote:
>>
>> As a follow-up to this, there's one more thing to consider when looking at this.
>>
>> Pearlmutter came up with an interesting way to combine calculations that allows you to get an f1 pass at the same time as performing an f0 pass, and similarly allow you to get an r2 pass when performing an r1 pass. This is Nic Schraudolph's description of the process (personal email):
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>> Pearlmutter's trick: Change the code computing your function/gradient to work in the complex domain. Let dx denote a perturbation on input x. Stick the perturbation into the imaginary channel, scaled down by a factor of 1e100 - in other words, the new complex input is x + 1e-100*i*dx. Now if you compute f in the complex domain, you get 1e-100*df(x) in the imaginary component. In short, an f_1 pass can hitch a ride in the imaginary channel of an f_0 pass, and likewise for an r_2 pass with an r_1 (gradient) pass.
It might be useful in the R{.} computation though, I don't have enough
intuitions there.
-- Yoshua
This doesn't get you the back prop passes for free (so to speak)
from the forward passes. My apologies if I gave that impression.
-Brian
>>>> Pearlmutter's trick: Change the code computing your function/gradient to
That's what I thought too, except that Guillaume has been trying to grok the
math a little more adn I think our understanding of it as of yesterday may
have been somewhat incomplete. He's composing a write-up on the subject as we
speak.
In particular, using the "R" operator in the backward pass necessitates
altering the way a forward pass works in a way I don't totally understand
yet.