Fwd: [rl-list] Deep Learning Overview - Draft

98 views
Skip to first unread message

Craig Corcoran

unread,
May 16, 2014, 2:24:10 PM5/16/14
to ut-f...@googlegroups.com
Hi Folks, 

This was just sent out on the RL listserv and may be of interest to the group. Its a very high level view of the history of neural networks and deep learning with lots of references (and no equations...).

Craig

---------- Forwarded message ----------
From: Schmidhuber Juergen <jue...@idsia.ch>
Date: Fri, May 16, 2014 at 11:52 AM
Subject: [rl-list] Deep Learning Overview - Draft
To: rl-...@googlegroups.com


Here is the preliminary draft of an invited Deep Learning overview - it also briefly discusses applications of deep (possibly recurrent) neural networks to  Reinforcement Learning:

http://www.idsia.ch/~juergen/DeepLearning15May2014.pdf

It mostly consists of references (about 800 entries so far). Important citations are still missing though. As a machine learning researcher, I am obsessed with credit assignment. In case you know of references to add or correct, please send them with brief explanations to jue...@idsia.ch (NOT TO THE ENTIRE LIST!), preferably together with URL links to PDFs for verification. Please also do not hesitate to send me additional corrections / improvements / suggestions / Deep Learning success stories with feedforward and recurrent neural networks. I'll post a revised version later. Thanks a lot!

Abstract. In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

Juergen Schmidhuber
http://www.idsia.ch/~juergen/
http://www.idsia.ch/~juergen/whatsnew.html






--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en
---
You received this message because you are subscribed to the Google Groups "Reinforcement Learning Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rl-list+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jason Liang

unread,
May 24, 2014, 11:58:59 PM5/24/14
to ut-f...@googlegroups.com
Thanks for the paper, it was very informative. Just wondering, have there any published work regarding evolutionary approaches to deep learning, for example like evolving convolutional neural networks?

Matthew Hausknecht

unread,
May 26, 2014, 12:15:56 AM5/26/14
to Jason Liang, ut-f...@googlegroups.com
Thanks Craig for sharing this resource. It has a wealth of citations for just about any type of NN related research. I'm not sure how I feel about his new notation / CAP analysis. 

Jason - Check out the section at the end of the paper - 6.7. One of the author's students evolved a rather large network for the TORCS racing game that used visual-like input. I'm not aware of explicitly convolutional networks being weight-evolved.


--
You received this message because you are subscribed to the Google Groups "FLARE" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ut-flare+u...@googlegroups.com.

Jason Liang

unread,
May 28, 2014, 11:25:05 PM5/28/14
to ut-f...@googlegroups.com, Jason Liang
Thanks Matt for point this out. It seems like there hasnt been a lot of work done on combining evolutionary algorithms with deep learning. I am wondering if it is possible to replace the gradient descent training of autoencoder type networks with evolutionary algorithms such as hyperneat or CMA-ES. The biggest challenge of course is the computational complexity of population based methods when compared gradient based methods. This could be a possible future area of research.

Matthew Hausknecht

unread,
May 28, 2014, 11:55:38 PM5/28/14
to Jason Liang, ut-f...@googlegroups.com
You're right - the largest challenge to evolving such networks would likely be the computational complexity of evolution versus gradient descent. I suspect this is why there isn't much previous work on the subject. Evolutionary type algorithms are often used to solve complex problems - ones in which it is impossible to compute gradients (and therefore impossible to do gradient descent). Nearly always, if you can express the problem in terms of a objective function for which you can compute gradients, gradient descent will solve the problem much faster than evolutionary methods.

Elad Liebman

unread,
May 29, 2014, 12:51:38 AM5/29/14
to Matthew Hausknecht, Jason Liang, ut-flare
>>Nearly always, if you can express the problem in terms of a objective function for which you can compute gradients, gradient descent will solve the problem much faster than evolutionary methods.

Just to way in on that last point, there's the question of what
"solving the problem" means. If it's finding a solution that's good
enough - that may very well be true. But one of the arguments for
evolutionary algorithms (or simulated annealing, or other approaches
with a strong stochastic component) is that they have (hypothetically,
at least) a better chance of avoiding local optima, whereas gradient
methods are much more susceptible to those if the function you're
optimizing isn't strictly concave or convex in the parameter space
you're optimizing over. But in practice it's true that these methods
can be extremely slow to converge (and judging convergence is also
hard, especially because of that stochastic property, you may seem to
plateau but then a single crossover can propel you to a whole new
level of fitness, but had you stopped a generation before you would
never have found it, or you could have missed it even if you ran it
for a hundred more generations - that's probably why people aren't
huge fans of these methods if they can be avoided).

Jason Liang

unread,
May 29, 2014, 1:02:35 AM5/29/14
to ut-f...@googlegroups.com
Yes, if the objective function is differentiable, gradient descent is always faster than black-box optimization methods like evolutionary algorithms. Also I suspect being stuck in a local minima for deep learning type problems is usually OK since too small of an error usually leads to overfitting. However evolutionary algorithms would be beneficial in three cases:

1) Your fitness landscape is so highly deceptive/problematic (lots of ravines, local minimas, plains) such that gradient descent finds an unsatisfactory solution or is very inconsistent at finding good solutions. I am pretty sure most deep learning type problems like image classficiation do not fall into this category, but there might be exceptions. 

2) Your activation function isnt differentiable, for example like the recently published maxout activation function.

2) Generatively evolving deep neural networks with minimal complexity necessary for the problem. I think a big problem of current deep learning approaches is finding the right architecture (necessary number of layers, hidden units) to do well on a task. Having TWEANN type algorithm that evolves both the topology and weights would be very beneficial. Also most deep learning network architectures follow the feedforward network layout very closely. It would be interesting to see if an arbitary network layout specifically evolved for some type of problem would be more powerful.

Jason Liang

unread,
May 29, 2014, 1:22:44 AM5/29/14
to ut-f...@googlegroups.com, Matthew Hausknecht, Jason Liang
I think a big reason why most people avoid evolutionary algorithms is because our understanding and theory of them is not as mature as that for gradient based methods. There arent any guarantees of convergence within some X timesteps like with gradient descent. Also most evolutionary algorithms require a lot of parameter tuning (there is a lot of active research on how to automatically tune parameters both in an online and offline manner). However evolutionary algorithms do have advantages over gradient based methods, like for example, evolutionary algorithms if run for an infinitely long time, will always find the global optima. 

Peter Stone

unread,
May 29, 2014, 10:02:47 AM5/29/14
to Jason Liang, ut-f...@googlegroups.com, Matthew Hausknecht

>
> I think a big reason why most people avoid evolutionary algorithms is because our
> understanding and theory of them is not as mature as that for gradient based
> methods. There arent any guarantees of convergence within some X timesteps like with
> gradient descent. Also most evolutionary algorithms require a lot of parameter tuning
> (there is a lot of active research on how to automatically tune parameters both in an
> online and offline manner). However evolutionary algorithms do have advantages over
> gradient based methods, like for example, evolutionary algorithms if run for an
> infinitely long time, will always find the global optima. 

So will brute-force search....

Peter
> >> a possible future area of research. be
> >>
> >>
> >> On Sunday, May 25, 2014 11:15:56 PM UTC-5, Matthew Hausknecht wrote:
> >>>
> >>> Thanks Craig for sharing this resource. It has a wealth of citations for
> >>> just about any type of NN related research. I'm not sure how I feel about
> >>> his new notation / CAP analysis.
> >>>
> >>> Jason - Check out the section at the end of the paper - 6.7. One of the
> >>> author's students evolved a rather large network for the TORCS racing game
> >>> that used visual-like input. I'm not aware of explicitly convolutional
> >>> networks being weight-evolved.
> >>>
> >>>
> >>> On Sat, May 24, 2014 at 10:58 PM, Jason Liang <jason...@gmail.com> wrote:
> >>>>
> >>>> Thanks for the paper, it was very informative. Just wondering, have
> >>>> there any published work regarding evolutionary approaches to deep learni
> >>>> for example like evolving convolutional neural networks?ng,
> >>>>
> >>>>
> >>>> On Friday, May 16, 2014 1:24:10 PM UTC-5, Craig Corcoran wrote:
> >>>>>
> >>>>> Hi Folks,
> >>>>>
> >>>>> This was just sent out on the RL listserv and may be of interest to the
> >>>>> group. Its a very high level view of the history of neural networks andd
> >>>>> learning with lots of references (and no equations...).eep
> >>>>>
> >>>>> Craig
> >>>>>
> >>>>> ---------- Forwarded message ----------
> >>>>> From: Schmidhuber Juergen <jue...@idsia.ch>
> >>>>> Date: Fri, May 16, 2014 at 11:52 AM
> >>>>> Subject: [rl-list] Deep Learning Overview - Draft
> >>>>> To: rl-...@googlegroups.com
> >>>>>
> >>>>>
> >>>>> Here is the preliminary draft of an invited Deep Learning overview - it
> >>>>> also briefly discusses applications of deep (possibly recurrent) neural
> >>>>> networks to  Reinforcement Learning:
> >>>>>
> >>>>> http://www.idsia.ch/~juergen/DeepLearning15May2014.pdf
> >>>>>
> >>>>> It mostly consists of references (about 800 entries so far). Important
> >>>>> citations are still missing though. As a machine learning researcher, I
> >>>>> obsessed with credit assignment. In case you know of references to add or am
> >>>>> correct, please send them with brief explanations to jue...@idsia.ch (NOT
> >>>>> THE ENTIRE LIST!), preferably together with URL links to PDFs for TO
> >>>>> verification. Please also do not hesitate to send me additional correcti
> >>>>> / improvements / suggestions / Deep Learning success stories withons
> >>>>> feedforward and recurrent neural networks. I'll post a revised version
> >>>>> later. Thanks a lot!
> >>>>>
> >>>>>
> >>>>> Abstract. In recent years, deep artificial neural networks (including
> >>>>> recurrent ones) have won numerous contests in pattern recognition and
> >>>>> machine learning. This historical survey compactly summarises relevant wo
> >>>>> much of it from the previous millennium. Shallow and deep learners arerk,
> >>>>> distinguished by the depth of their credit assignment paths, which are
> >>>>> chains of possibly learnable, causal links between actions and effects
> >>>>> review deep supervised learning (also recapitulating the history of. I
> >>>>> backpropagation), unsupervised learning, reinforcement learning &
> >>>>> evolutionary computation, and indirect search for short programs encod
> >>>>> deep and large networks.ing

Matthew Hausknecht

unread,
May 29, 2014, 4:58:06 PM5/29/14
to Jason Liang, ut-f...@googlegroups.com
A highly deceptive fitness landscape is problematic for nearly any type of optimization method. I'm not aware of any work comparing the robustness of evolutionary search to gradient descent in deceptive landscapes. I suspect both would have problems.

I believe the maxout activations are differentiable. However the derivative may not be smooth - similar to RELU activations.

You're right that finding the best architecture for a NN is often a challenge. Some of the problem is addressed by regularization which can encourage a large network to be sparse (eg many weights are zero or very small). Le et al have good results from creating large, overparameterized networks that are then highly regularized. In my mind this regularization is similar to forcing the network to identify and remove weights and nodes that aren't needed and hence reduce its complexity. 

Evolution of network topology and weights is possible - but runs into trouble as the number of parameters grows. Current NNs have tens of millions of weights - some maybe even billions. Searching a million dimensional space using evolution is a difficult proposition. This motivates indirect encoding methods such as HyperNEAT which reduce the dimension of the search space. 

Perhaps you could evolve networks that solved existing problems either using smaller architectures or with better performance that their gradient descent equivalents. My intuition is that this would be feasible for problems of reasonably low complexity for which networks consisting of tens-thousands of parameters can solve the problem. I would bet that for complex image recognition problems like Cifar/SVHN/Imagenet that any network would need at least a million parameters to get good performance on the task. 

To summarize, I wouldn't trust evolution in highly dimensional search spaces. If you really want to use evolution to tackle highly complex problems I would suggest either a) Using an indirect encoding to reduce search space dimensionality and hoping that it is capable of encoding interesting solutions to the problem or b) inventing new network structures which enable small networks to solve highly complex problems. One example would be a network node that performed convolution. 


Craig Corcoran

unread,
May 30, 2014, 7:14:54 PM5/30/14
to Matthew Hausknecht, Jason Liang, ut-f...@googlegroups.com
This discussion of the pros and cons of 0th-order (derivative free) / black-box / evolutionary optimization vs. gradient based optimization methods makes me wonder if there has been much work in combining the two. Gradient-based methods are great at finding local optima efficiently and many (most?) 0th order algorithms are designed to deal with non-convex losses. In the deep neural network setting we have a model which is differentiable, but has a non-convex loss with many local optima. Perhaps ideas from black box optimization (BBO), such as population-based search, can be combined with gradient information to reduce the required size of the population (and thus the computational expense). With some smoothness assumptions, you may only need one sample/population member per local minimum; gradient descent can find the local minimum and BBO can choose between the multiple potential global minima. Has anyone seen any work exploring similar ideas? I guess it would fall under gradient-based global optimization for smooth, non-convex losses.

Peter Stone

unread,
May 30, 2014, 7:26:10 PM5/30/14
to Craig Corcoran, Matthew Hausknecht, Jason Liang, ut-f...@googlegroups.com

Craig,

It's common with gradient-based methods to use random restarts, with the
hope that the new random starting points will land you in different
basins of attraction.

Is that what you have in mind?

Peter




> This discussion of the pros and cons of 0th-order (derivative free) / black-box / evolutionary optimization vs. gradient
> based optimization methods makes me wonder if there has been much work in combining the two. Gradient-based methods are
> great at finding local optima efficiently and many (most?) 0th order algorithms are designed to deal with non-convex
> losses. In the deep neural network setting we have a model which is differentiable, but has a non-convex loss with many
> local optima. Perhaps ideas from black box optimization (BBO), such as population-based search, can be combined with
> gradient information to reduce the required size of the population (and thus the computational expense). With some
> smoothness assumptions, you may only need one sample/population member per local minimum; gradient descent can find the
> local minimum and BBO can choose between the multiple potential global minima. Has anyone seen any work exploring similar
> ideas? I guess it would fall under gradient-based global optimization for smooth, non-convex losses.
>
>
>
> On Thu, May 29, 2014 at 3:58 PM, Matthew Hausknecht <[[matthew.h...@gmail.com]]> wrote:
>
> A highly deceptive fitness landscape is problematic for nearly any type of optimization method. I'm not
> aware of any work comparing the robustness of evolutionary search to gradient descent in deceptive landscapes. I
> suspect both would have problems.
>
>
> I believe the maxout activations are differentiable. However the derivative may not be smooth - similar to
> RELU activations.
>
>
> You're right that finding the best architecture for a NN is often a challenge. Some of the problem is
> addressed by regularization which can encourage a large network to be sparse (eg many weights are zero or very
> small). Le et al have good results from creating large, overparameterized networks that are then highly
> regularized. In my mind this regularization is similar to forcing the network to identify and remove weights and
> nodes that aren't needed and hence reduce its complexity. 
>
>
> Evolution of network topology and weights is possible - but runs into trouble as the number of parameters
> grows. Current NNs have tens of millions of weights - some maybe even billions. Searching a million dimensional
> space using evolution is a difficult proposition. This motivates indirect encoding methods such as HyperNEAT
> which reduce the dimension of thesearch space. 
> to the Google Groups "Reinforce ment Learning Mailing
> List" group.
>
> To unsubscribe from this group and stop receiving
> emails from it, send an email to
> rl-list+u...@googlegroups.com.
> For
> more options, visit
> [[https://groups.google.com/d/optout]].
>
>
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the
> Google Groups "FLARE" group.
>
>
> To unsubscribe from this group and stop receiving emails from
> it, send an email to ut-flare+u...@googlegroups.com.
> For
> more options, visit [[https://
> groups.google.com/d/optout]].
>
>
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "FLARE"
> To unsubscribe from this group and stop receiving emails from it, send an email to group. ut-flare+u..
> For more options, visit .@googlegroups.com. [[https://groups.google.com/d/optout]].
>
>
>
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "FLARE" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> [[ut-flare+unsubscribe@googlegrou
> For more options, visit ps.com]]. [[https://groups.google.com/d/optout]].
>
>
>
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "FLARE" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> [[ut-flare+u...@googlegroups.co
> For more options, visit m]]. [[https://groups.google.com/d/optout]].
>
>
>
>
>
>
>
> --
> > You received this message because you are subscribed to the Google Groups "FLARE" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> [[ut-flare+u...@googlegroups.com]].
> > For more options, visit [[https://groups.google.com/d/optout]].
> >

Elad Liebman

unread,
May 30, 2014, 7:32:01 PM5/30/14
to Craig Corcoran, Matthew Hausknecht, Jason Liang, ut-flare
I believe that methods such as CMA-ES
(http://en.wikipedia.org/wiki/CMA-ES) and XNES
(http://www.idsia.ch/~juergen/xNES2010gecco.pdf) combine an
evolutionary approach with gradient-based methods. But I don't think
that's quite what you mean.

As an aside, this is a very interesting and worthwhile discussion,
perhaps we should devote a meeting to exploring it further at some
point...

Elad

Aditya Rawal

unread,
May 30, 2014, 8:51:45 PM5/30/14
to Elad Liebman, Craig Corcoran, Matthew Hausknecht, Jason Liang, ut-flare
Shimon Whiteson's work on Lamarckian evolution - combining evolutionary optimization and gradient based learning is one such example?

In one of my experiments for image classification in MNIST, I had used NEAT to search NN topologies and Back-propagation to learn the weights. Combining the two optimization approaches did improve the accuracy, but with a run-time penalty. It was therefore difficult to scale this approach to larger problems. Perhaps, there is a better way of combining black-box optimization with gradient descent (which Craig is trying to suggest). I also agree with Matt that indirect encoding is a more scalable approach for large problems like Imagenet.  

Aditya Rawal
PhD student
Computer Science Department
UT Austin

Aditya Rawal
PhD student
Computer Science Department
UT Austin


Message has been deleted

Jason Liang

unread,
May 30, 2014, 9:41:25 PM5/30/14
to ut-f...@googlegroups.com, matthew.h...@gmail.com, jason...@gmail.com
I see evolutionary algorithms as a intermediate point on a spectrum between complete exploitation (following the gradient) and complete exploration (random search). Evolutionary algorithms, while not completely following the gradient, do tend to move their populations in the general direction of the gradient in a stochastic manner. Depending on the heuristics used and parameters, EAs can resemble gradient based search more or random search more. 

Another thing I would like to the point out is that a big appeal of EAs is that the fitness function (or objective function) can be arbitrary. It does not have to be smooth, differentiable, or convex, which allows us to be much more lenient in defining the fitness function. Still I can see the usefulness of explicitly knowing the gradient since directly moving in the direction of the gradient when the local fitness landscape is not deceptive can speedup convergence. 

Here is one possible way of combining gradient based search with evolutionary algorithms, while retain the advantages of both methods:

1) Train a neural network to approximate the fitness landscape of the problem. Every time you evaluate an individual on some fitness function, activate your neural network with the individual's genome as input, calculate the error between the neural network's output and the output of the fitness function, and backpropagate the error. Alternately, you could maintain a separate population of neural networks for fitness landscape approximation and coevolve them along with the main population. As your evaluate more and more individuals, your neural network should approximate the fitness landscape better.

2) Compute the gradient of the neural network's activation with respect to the input (using an individual as input) and bias the mutation of the individual in the direction of the gradient. The degree of bias is inversely proportional to the amount of error between the approximate fitness outputted by the neural network and the true fitness outputted by the fitness function. 

3)As you become more confident that the neural network is closely approximating the true fitness landscape, you can also reduce the amount of evaluations you do on the true fitness function and rely more on the approximate fitness instead. 

With such a method you get the following advantages:
1) optimize arbitrary, nonconvex, nondifferentiable functions
2) have individuals move directly in the direction of the gradient when the approximated fitness landscape is not deceptive
3) reduce number of evaluations of true fitness function (which might be very time consuming to evaluate)
4) start out with a high degree of exploration but rely more and more on exploitation as time goes on, which helps with convergence

Jason Liang

Here are some related papers on this method if you are interested:
http://epubs.surrey.ac.uk/7610/2/SC2005.pdf
http://www.researchgate.net/publication/220316557_A_Multi-Objective_Evolutionary_Algorithm_Using_Neural_Networks_to_Approximate_Fitness_Evaluations/file/9fcfd5139fb3c9a31b.pdf
http://www.soft-computing.de/willmes.jin.pdf
http://link.springer.com/chapter/10.1007/978-3-540-44511-1_14#page-1

Matthew Hausknecht

unread,
May 31, 2014, 5:37:16 PM5/31/14
to Jason Liang, ut-f...@googlegroups.com
It seems reasonable to use a NN as an approximator of a fitness landscape. Perhaps it could perform better than using a multivariate gaussian & covariance matrix as CMA-ES does. The main difficulty that I can see is the large number of parameters in the NN. For example if an individual consists of one million weights and the NN has a hidden layer of just 1000 units, then you're already dealing with a billion parameter NN. Also, I think it's safe to say that the number of parameters of the NN is going to be far larger than the number of datapoints - so it would likely be highly prone to overfitting. Finally, performance would likely depend on the hyperparameterization of the NN - which still must be manually defined. However, this seems like a reasonable and testable idea. If it outperformed CMA-ES across a number of domains then I suspect it would be publishable. 

Matthew Hausknecht

unread,
Jun 2, 2014, 2:20:14 AM6/2/14
to Jason Liang, ut-f...@googlegroups.com
Actually, in order to compute the gradient of the network wrt an individual input (ala step 2) you'd need a differentiable objective function. This would forfeit the arbitrary objective appeal of EAs.  

Jason Liang

unread,
Jun 2, 2014, 2:37:40 AM6/2/14
to ut-f...@googlegroups.com, jason...@gmail.com
I am not so sure I understand you. The neural network is approximating the real objective function, so even if real objective function is not differentiable, the neural network itself is. If the original objective function has a discontinuity, like a piecewise function, the neural network will still be smooth.

Also you are right the large number of parameters of the approximating neural network would be problematic, especially given the high dimensionality of the original deep neural network.

Matthew Hausknecht

unread,
Jun 2, 2014, 10:01:45 PM6/2/14
to Jason Liang, ut-f...@googlegroups.com
Yes, you're correct. It seems I've been using automatic differentiation for too long :)
Reply all
Reply to author
Forward
0 new messages