pylearn2 computes the Jacobian as:
Z = V W + b
H = g(Z)
G = T.grad( H.sum(), Z) #this assumes elemwise nonlinearity
J = W.dimshuffle('x',0,1) * G.dimshuffle(0,'x',1)
the penalty is then sq(J).sum(axis=1,2).mean()
let's forget about the mean issue for the moment and just
worry about computing sq(J).sum().
sum_i,j,k W_jk^2 G_ik^2
= sum_i,k G_ik^2 sum_j W_jk^2
= sum_i, k G_ik^2 sq(W).sum(axis=0)_k
= sum_i dot( sq(G), sq(W).sum(axis=0) )_i
OK, so it looks like what we want is an optimization that replaces
T.sqr( W.dimshuffle('x',0,1) * G.dimshuffle(0,'x',1) ).sum(axis=(1,2))
with
T.dot( T.sqr(G), T.sqr(W).sum(axis=0) )
I haven't actually tried running the pylearn2 code, but the optimization
should be faster as well as use less memory.
If we say W is of shape (v,h) and Z is of shape (b,h) then the original has
space and runtime requirements of O(bhv) while the final has space
and runtime requirements of O( bh + vh ).