Stochastic gradient descent doubt

Rio2020

unread,

May 19, 2020, 4:14:21 AM5/19/20

to Machine Learning for Physicists

Hello, there is a theoretical point on which I get stuck. In stochastic gradient descent, we practically truncate the real cost function C(W) that could be composed of millions of pieces (sum over the red bars/samples in the video lectures, ) with around 100/1000 pieces. I cannot understand why such approximation can guarantee that the minimum of the real C(W) and the truncated one can be almost the same. Is out there some intuitive reasoning that can help to grasp this?

Many thanks

Florian Marquardt

unread,

May 19, 2020, 1:52:52 PM5/19/20

to Machine Learning for Physicists

Well, the point is that for each new step (each batch) we sample fresh. And if the steps are small enough (which they will be near the minimum and if the learning rate is small), then we effectively average over many more than just a single batch. This is how stochastic gradient descent can approximate the correct gradient.

It is the same as in other 'Monte Carlo' approaches: One has to average some function f(x), with an x that is a random variable, but instead of doing the exact integral p(x) f(x) dx, one just produces N random x values (according to the probability p(x)), and then averages over those. This can produce a good-enough approximation.

Rio2020

unread,

May 20, 2020, 4:56:14 AM5/20/20

to Machine Learning for Physicists

Ok ! Many thanks now it is clear!

Reply all

Reply to author

Forward