Is There a Testing Phase in Reinforcement Learning?

David

unread,

Nov 8, 2009, 10:34:19 AM11/8/09

to Reinforcement Learning Mailing List

Hi

I have read the book "Reinforcement Learning: An Introduction " By
Richard S. Sutton, Andrew G. Barto. My understanding of Reinforcement
Learning (RL) is that an agent learns a policy that optimizes the
accumulated reward until by selecting the optimal action at each
state, starting at an initial state until the agent reaches a goal
state. This learning process with that specific initial state is
called an episode. After this episode, the agent starts at a different
initial state and learns another policy that optimizes the accumulated
reward. There can be as many episodes as you wish. After a specified
number of episodes, the problem is solved. So in RL, there is only a
learning or training phase and no a separate testing phase. This is in
contrast to supervised learning, after a classifer is learnt from a
training dataset, it is used to classify the instances of a testing
dataset to evaluate its classification performance.

Is my understanding of RL right?

David

Tom Dietterich

unread,

Nov 8, 2009, 3:51:51 PM11/8/09

to Reinforcement Learning Mailing List

Yes, RL is inherently an online learning setting. The agent never
stops learning, so there is never a separate test phase. The standard
way to compare algorithms is to compare cumulative reward curves (or
cumulative regret curves wrt some fixed policy such as the optimal
policy, when it is known).

But if we want to compare different RL algorithms to see which one
learns faster, it could also make sense to define a learning phase
followed by a test phase (in which learning is disabled and we then
evaluate the learned policies). The advantage of having a separate
test phase is that we can obtain more accurate estimates of the value
of the learned policy than we get from a single trajectory in the
online setting.

There are also situations in which one can define a notion of
"supervised RL". That is, for some set of trajectories, a teacher (or
an oracle of some kind) provides an optimal policy. It makes sense to
see how well this information can be used by an RL algorithm to learn
a policy and then evaluate it in a separate test phase (where learning
is disabled).

Finally, in some cases, one wishes to separate the question of
efficient exploration from the question of efficient learning. In
such cases, one can imagine applying a fixed exploration policy to
generate a set of trajectories and then holding those trajectories
fixed while comparing a set of RL algorithms all trained on the same
trajectories. Again, a separate test phase might be appropriate.

Summary: in general, RL is an online learning process, so an end-to-
end evaluation of an RL system is an online evaluation. But for
experimental or engineering purposes, offline learning and evaluation
can help control variability and guide research.

Thomas G. Dietterich Voice: 541-737-5559
School of EECS FAX: 541-737-1300
1148 Kelley Engineering Center URL: http://eecs.oregonstate.edu/~tgd
Oregon State University, Corvallis, OR 97331-5501

Rob Zumwalt

unread,

Nov 8, 2009, 4:29:21 PM11/8/09

to rl-...@googlegroups.com

In addition, there are some fields where production use should not
include exploration. Learning the (near) optimal policy is accomplished
in training, using exploration. Exploration and learning may be turned
off for production use.

In this sort of regime, comparing the learned policies (from different
RL techniques) in an out-of-sample test can be useful.

-Rob

Curt Welch

unread,

Nov 9, 2009, 10:51:29 AM11/9/09

to rl-...@googlegroups.com

On Sun, Nov 8, 2009 at 10:34 AM, David <dtian...@googlemail.com> wrote:
>
> Hi
>
> I have read the book "Reinforcement Learning: An Introduction " By
> Richard S. Sutton, Andrew G. Barto. My understanding of Reinforcement
> Learning (RL) is that an agent learns a policy that optimizes the
> accumulated reward until by selecting the optimal action at each
> state, starting at an initial state until the agent reaches a goal
> state.

Many RL problems have no goal state. The goal of RL agenets is to
produce higher rewards in the future than in the past.

> This learning process with that specific initial state is
> called an episode. After this episode, the agent starts at a different
> initial state and learns another policy that optimizes the accumulated
> reward.

It's uncommon to throw away the old policy and learn a new one if that
is what you are suggesting.

The general idea is usually to keep improving the current policy
forever. The only reason the learned policy is thrown away is when
testing the learning ability of a given algorithm.

> There can be as many episodes as you wish. After a specified
> number of episodes, the problem is solved.

Only trivial RL domains have a solution. Most real world RL domains
are never solved - and can never be solved. That's because the domain
is more complex than the learning algorithm can every hope to fully
understand (aka learn, aka optimized a policy for).

For example, tic tac toe seem simple enough to learn to play
perfectly. But that is seldom true because you are not just learning
the game, you are also learning the behavior of your opponent. If
your opponent is not always playing perfect moves, you can win a game
every once in a while by playing moves that are most likely to trick
your opponent. But you have to learn what moves are most likely to
trick a given opponent.

In addition, if the RL agent is playing an opponent that also learns,
it has to learn not only how to play, but how to adjust it's own play
based on how it's opponent's play will change over time though
learning. So the optimal policy would have to include full knowledge
of how the other agent learns. This is seldom practical - leaving the
problem that there is now optimal solution.

Any agent that has to interact with the real world faces an unsolvable
learning problem. That is, no agent that's part of the universe, can
fully understand and predict, everything that will happen to it that
effects the rewards it will get. So these types of RL problems are
always a question of how well it can do with limited knowledge because
perfect knowledge is not possible.

> So in RL, there is only a
> learning or training phase and no a separate testing phase.

Generally. But it is normally possible, and sometimes useful. as
others have already pointed out, to turn off learning and test the
agent without learning enabled. However, the more normal case is for
the agent to learn from every action.

> This is in
> contrast to supervised learning, after a classifer is learnt from a
> training dataset, it is used to classify the instances of a testing
> dataset to evaluate its classification performance.
>
> Is my understanding of RL right?
>
> David

Yes, most of what you wrote is basically correct. However, RL in my
view is a far broader and more abstract idea that how you are trying
to portray it.

Reinforcement learning covers a very wide range of systems that
attempt to maximize a reward signal though interaction with an
environment. That's about all you can say about. It's so broad, that
if you are creative enough, you can actually show that a rock is a
reinforcement learning machine. However, it's framed a bit more
narrowly in the Sutton and Barto book as a finite state machine which
operates in discrete time steps.

If you are trying to understand the difference between what we talk
about as supervised learning vs reinforcement learning, a key
difference in my view is that in supervised learning, the trainer must
know some of the correct answers whereas in RL, the trainer is not
required to know any of the correct answers.

Curt

Reply all

Reply to author

Forward