CAE implementation

338 views
Skip to first unread message

Ian Goodfellow

unread,
Jul 17, 2012, 2:04:41 PM7/17/12
to pylearn-dev
I think the CAE implementation in pylearn2 is fairly inefficient and
I'm not sure why it is implemented the way it is.
Right now it explicitly computes the Jacobian for each example, which
ends up being a 3 tensor of shape
(batch_size, nhid, nvis). That's a lot of memory.

I don't see why we explicitly compute the Jacobian. If we compute the
contractive penalty directly we can get away
without using the full Jacobian but we can still be general about what
kind of non-linearity the autoencoder uses.

Let J be the Jacobian of h with respect to v.
Let C_k = Fro(J)^2 be the contractive penalty on example k.
C_k = sum_i sum_j ( d h_kj / d v_ki )^2
Assuming the autoencoder hidden layer is based on an elemwise nonlinearity, this
= sum_i sum_j ( d h_kj / d a_kj )^2 ( d a_kj / d v_ki)^2
(where a_j is the input to h_j before the nonlinearity)
= sum_i sum_j ( d h_kj / d a_kj )^2 W_ij^2
= sum_i sum_j g_kj W_ij^2
(where g = T.sqr(T.grad( h.sum(), a)) )
= sum_i g_kj sum_j W_ij^2
= T.dot( g_k, w)
(where w = T.sqr(W).sum(axis=0) )

so the whole penalty C = T.dot( g, w).sum()

Is pylearn2 trying to be more general than that, or did I make a
mistake in my derivation somewhere?

Pascal Lamblin

unread,
Aug 2, 2012, 11:06:59 PM8/2/12
to pylea...@googlegroups.com
On Tue, Jul 17, 2012, Ian Goodfellow wrote:
> Assuming the autoencoder hidden layer is based on an elemwise nonlinearity,
> [...]
> Is pylearn2 trying to be more general than that, or did I make a
> mistake in my derivation somewhere?

I think David used that assumption to make CAEs more efficient in a
previous implementation in the past, I don't know what happened to that
code.

I would think that pylearn2 does, indeed, try to be more general,
and consider the case where the non-linearity is not an elemwise
function, but can have some kind of normalization, or interaction
between neighbouring units, etc.

If there is a way to select between both implementation, that would be
guaranteed not to give erroneous results (even when using something like
softmax as non-linearity), I would be all for it.

--
Pascal

Yoshua Bengio

unread,
Aug 16, 2012, 6:46:56 AM8/16/12
to pylea...@googlegroups.com
I think it is fine to have two implementations. An efficient specific one and an inefficient generic one.

-- Yoshua

Ian Goodfellow

unread,
Aug 16, 2012, 8:14:43 AM8/16/12
to pylea...@googlegroups.com
I haven't looked at this for a few weeks, but I think the current one already
assumes that the nonlinearity is elementwise. I think it uses this assumption
in order to compute the Jacobian without using scan.

The problem is that it computes the CAE penalty by first computing the
Jacobian and then computing its Frobenius norm. theano isn't smart enough
to realize that this causes a big memory bottleneck so it doesn't optimize
this down into something more sensible. Instead we should just compute
the CAE penalty directly in order to avoid computing a 3-tensor.

I think it should be
T.dot( T.grad(H.sum(), z) , T.sqr(W).sum(axis=0) )
but someone should check that since that's just off the top of my head.
(The T.grad(H.sum(),z) is the part that assumes elementwise nonlinearity
but I think that's already used to compute the Jacobian in the current
implementation)

Frédéric Bastien

unread,
Aug 16, 2012, 11:29:51 AM8/16/12
to pylea...@googlegroups.com
On Thu, Aug 16, 2012 at 8:14 AM, Ian Goodfellow
<goodfel...@gmail.com> wrote:
> I haven't looked at this for a few weeks, but I think the current one already
> assumes that the nonlinearity is elementwise. I think it uses this assumption
> in order to compute the Jacobian without using scan.
>
> The problem is that it computes the CAE penalty by first computing the
> Jacobian and then computing its Frobenius norm. theano isn't smart enough
> to realize that this causes a big memory bottleneck so it doesn't optimize
> this down into something more sensible. Instead we should just compute
> the CAE penalty directly in order to avoid computing a 3-tensor.

Can you tell what optimization is missing in Theano that would help that?

Fred

David Warde-Farley

unread,
Aug 16, 2012, 3:20:04 PM8/16/12
to pylea...@googlegroups.com
I guess that what we'd need is an optimization that recognizes when a
broadcast is immediately followed by a reduction, and does that
reduction in-place instead.

So a simple case would be

x = tensor.matrix('x')
y = tensor.vector('y')
z = ((x.dimshuffle(0, 1, 'x') * y)**2).sum()

Currently if x is (1000,) and y is (500, 500), to compute z you need
to allocate a (500, 500, 1000) before the sum takes place. It's
obvious how to do better, but it involves doing the elemwise and the
reduction "at the same time".

Frédéric Bastien

unread,
Aug 16, 2012, 3:44:48 PM8/16/12
to pylea...@googlegroups.com
Hi,

Theano do some optimization of reduction and broadcasted dimensions,
but as you said, in this case, the only optimization is to do the sum
and elemwise at the same time.

Ian told me he will look more into the math later today, so we will
know I saw another option.

If you end up having a profile that show this is a bottleneck, it
would be useful to know what % is spend in the elemwise and in the
sum.

thanks

Fred

David Warde-Farley

unread,
Aug 16, 2012, 8:07:29 PM8/16/12
to pylea...@googlegroups.com
On Thu, Aug 16, 2012 at 3:44 PM, Frédéric Bastien <no...@nouiz.org> wrote:

> If you end up having a profile that show this is a bottleneck, it
> would be useful to know what % is spend in the elemwise and in the
> sum.

I don't think the problem was ever computation time. The problem is memory.

David

Ian Goodfellow

unread,
Aug 16, 2012, 9:41:17 PM8/16/12
to pylea...@googlegroups.com
pylearn2 computes the Jacobian as:
Z = V W + b
H = g(Z)
G = T.grad( H.sum(), Z) #this assumes elemwise nonlinearity
J = W.dimshuffle('x',0,1) * G.dimshuffle(0,'x',1)

the penalty is then sq(J).sum(axis=1,2).mean()

let's forget about the mean issue for the moment and just
worry about computing sq(J).sum().

sum_i,j,k W_jk^2 G_ik^2
= sum_i,k G_ik^2 sum_j W_jk^2
= sum_i, k G_ik^2 sq(W).sum(axis=0)_k
= sum_i dot( sq(G), sq(W).sum(axis=0) )_i

OK, so it looks like what we want is an optimization that replaces
T.sqr( W.dimshuffle('x',0,1) * G.dimshuffle(0,'x',1) ).sum(axis=(1,2))
with
T.dot( T.sqr(G), T.sqr(W).sum(axis=0) )

I haven't actually tried running the pylearn2 code, but the optimization
should be faster as well as use less memory.

If we say W is of shape (v,h) and Z is of shape (b,h) then the original has
space and runtime requirements of O(bhv) while the final has space
and runtime requirements of O( bh + vh ).

Eng.Hani Almousli

unread,
Oct 10, 2012, 1:43:50 PM10/10/12
to pylea...@googlegroups.com
Hi,
 I would like  to  confirm what Ian said. I have applied both methods in theano and it seems that getting rid off dimshuffle   really makes things faster (at least 2 to 3 times)  because the hidden unit in my case is large, so it is not only better for reducing memory consumption but to get better performance as well.
Thanks...

Çağlar Gülçehre

unread,
Oct 11, 2012, 6:01:45 PM10/11/12
to pylea...@googlegroups.com
I agree with Hani, with the Ian's tip, I got a very significant
performance increase in my CA implementation as well.

2012/10/10 Eng.Hani Almousli <hani....@gmail.com>:
--
Caglar GULCEHRE

Ian Goodfellow

unread,
Oct 11, 2012, 6:07:54 PM10/11/12
to pylea...@googlegroups.com
Would either of you like to submit a pull request with the improved version?

Çağlar Gülçehre

unread,
Oct 12, 2012, 9:49:09 AM10/12/12
to pylea...@googlegroups.com
OK I'll do the pull request with the improved version.

2012/10/11 Ian Goodfellow <goodfel...@gmail.com>:
--
Caglar GULCEHRE

Ian Goodfellow

unread,
Oct 12, 2012, 9:59:38 AM10/12/12
to pylea...@googlegroups.com
Thanks.

Çağlar Gülçehre

unread,
Oct 15, 2012, 12:45:22 PM10/15/12
to pylea...@googlegroups.com
OK I've made the pull-request here (Also I had to fix bunch of bugs
in order to test it as well):

https://github.com/lisa-lab/pylearn/pull/127

2012/10/12 Ian Goodfellow <goodfel...@gmail.com>:
--
Caglar GULCEHRE

Ian Goodfellow

unread,
Oct 15, 2012, 12:55:51 PM10/15/12
to pylea...@googlegroups.com
ok, I've reviewed it. I think one of your bug fixes introduces a new bug.
I think we should just get rid of the ability to pass in a collection as a cost,
and use SumOfCosts instead.

Frédéric Bastien

unread,
Oct 23, 2012, 3:07:38 PM10/23/12
to pylea...@googlegroups.com
I made a ticket for this optimization:
https://github.com/Theano/Theano/issues/1022

It would be great if someone can do it. This way, we won't need to do
it manually each time.

Fred

On Thu, Aug 16, 2012 at 9:41 PM, Ian Goodfellow
<goodfel...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages