Many RL problems have no goal state. The goal of RL agenets is to
produce higher rewards in the future than in the past.
> This learning process with that specific initial state is
> called an episode. After this episode, the agent starts at a different
> initial state and learns another policy that optimizes the accumulated
It's uncommon to throw away the old policy and learn a new one if that
is what you are suggesting.
The general idea is usually to keep improving the current policy
forever. The only reason the learned policy is thrown away is when
testing the learning ability of a given algorithm.
> There can be as many episodes as you wish. After a specified
> number of episodes, the problem is solved.
Only trivial RL domains have a solution. Most real world RL domains
are never solved - and can never be solved. That's because the domain
is more complex than the learning algorithm can every hope to fully
understand (aka learn, aka optimized a policy for).
For example, tic tac toe seem simple enough to learn to play
perfectly. But that is seldom true because you are not just learning
the game, you are also learning the behavior of your opponent. If
your opponent is not always playing perfect moves, you can win a game
every once in a while by playing moves that are most likely to trick
your opponent. But you have to learn what moves are most likely to
trick a given opponent.
In addition, if the RL agent is playing an opponent that also learns,
it has to learn not only how to play, but how to adjust it's own play
based on how it's opponent's play will change over time though
learning. So the optimal policy would have to include full knowledge
of how the other agent learns. This is seldom practical - leaving the
problem that there is now optimal solution.
Any agent that has to interact with the real world faces an unsolvable
learning problem. That is, no agent that's part of the universe, can
fully understand and predict, everything that will happen to it that
effects the rewards it will get. So these types of RL problems are
always a question of how well it can do with limited knowledge because
perfect knowledge is not possible.
> So in RL, there is only a
> learning or training phase and no a separate testing phase.
Generally. But it is normally possible, and sometimes useful. as
others have already pointed out, to turn off learning and test the
agent without learning enabled. However, the more normal case is for
the agent to learn from every action.
> This is in
> contrast to supervised learning, after a classifer is learnt from a
> training dataset, it is used to classify the instances of a testing
> dataset to evaluate its classification performance.
> Is my understanding of RL right?
Yes, most of what you wrote is basically correct. However, RL in my
view is a far broader and more abstract idea that how you are trying
to portray it.
Reinforcement learning covers a very wide range of systems that
attempt to maximize a reward signal though interaction with an
environment. That's about all you can say about. It's so broad, that
if you are creative enough, you can actually show that a rock is a
reinforcement learning machine. However, it's framed a bit more
narrowly in the Sutton and Barto book as a finite state machine which
operates in discrete time steps.
If you are trying to understand the difference between what we talk
about as supervised learning vs reinforcement learning, a key
difference in my view is that in supervised learning, the trainer must
know some of the correct answers whereas in RL, the trainer is not
required to know any of the correct answers.