Hi all,
in lecture 7 prof. Marquardt proposes a homework about the random walker, which has not been discussed yet. The idea is to check how much the learning can improve by including the reward-baseline. I am posting my solution in case someone might be interested for comparison or for comments.
In slide 75 the reinforcement learning algorithm is implemented without any average on the batch (i.e. batchsize=1). But to implement a baseline, batch averaging is needed. Fig. 1 shows the learning process (without baseline) while averaging over a relatively small batch (batchsize=50). The learning process seems already pretty smooth, however, if we look at the actual increments per training step (second plot in Fig. 1), the noise is still large (often the increments are even negative). The variance follows well the analytical predictions shown in the lecture (solid lines in the third plot of Fig. 1).
As expected, adding the baseline to the rewards (it is one line of code) strongly suppresses the noise (the variance decreases dramatically). The plots in Fig. 2 show the result (only the baseline is added, all the other parameters are the same as before):
I found this quite interesting and it is probably something to keep in mind when working with more complicated situations, as less noise allows to increase the learning rate and reduce the batchsize.