I had a somewhat related problem once, as we used the information-gain as
reward for a planned exploration scheme.
First it is obvious that the average of the observed rewards is not the
apropriate concept.
In the information-gain setting we can build a better estimate of the reward function
by using the most recent reward of a transition ( r(s,a,s') ) instead of
the average reward for that transition (s,a,s').
In a less discontinuous "mouse and cheese" the the most recent reward
might be an acceptable estimate of the reward function as well
(e.g. if there is lots of cheese in a cell and the mouse can eat only a
tiny bit of it, or if the cells with cheese are neighbored in larger
region ...).
The question is, can you formulate an adequate estimate of the reward
function in your setting?
-
Steffen
--
Dr. Steffen Udluft
Siemens AG
Corporate Research and Technologies
Intelligent Systems & Control
> --
> You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en
>
regards
Francisco Martinez Gil
This reminds of the trap domain [1], which is a grid world with one state
containing a flag. The agent has to collect the flag and deliver it to the
goal state. The domain is a 3x3 grid, so it would have 9 states on first
sight. However, one must also consider the position of the flag, it can
either be in the position of the flag state or it can be carried by the
agent. In any case, there are two possibilities, so the state space gets
doubled and has 18 states.
In your mouse and cheese environment, one could for each possible cheese
position add a state variable denoting whether cheese is there or not. The
agent would receive the high reward when it gets to a state with the
cheese. Once the cheese is eaten, the state changes. With the state coded
this way, the environment would be stationary. If, however, you omit this
information, the state space would not satisfy the Markov property.
Probably one could call the environment non-stationary then. One can
always reconstruct the Markov property by considering not just the current
state, but also a number of previous ones, thereby of course extremely
blowing up the state space.
If you think that your environment is non-stationary, maybe you can change
the state representation so that the non-stationarity becomes observable
(like putting the information whether there is cheese in a field into the
state). You could also look at approaches that explicitely deal with
partially observable MDPs, because this is exactly what you have in your
description of the mouse and cheese environment.
You could as well tell us more about your actual problem, maybe then
people on the list could give suggestions. Unfortuntaly, I know almost
nothing about POMDPs ...
Hope that helps,
Alex
[1] Dearden, R. and Friedman, N. and Andre, D. Model based Bayesian
exploration. Proceedings of the fifteenth Conference on Uncertainty in
Artificial Intelligence, pp. 150--159, 1999
> --
> You received this message because you are subscribed to the "Reinforcement
> Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en
--
One straightforward solution is to augment the state space by all possible cheese configurations. If the number of cheese is small, this may be an option (like in the taxi domain). Suppose that you have n possible number of cheese (at specific locations), then the state space increases exponentially by a factor of 2^n and you hit the curse of dimensionality. I'm not sure if it is possible to find irrelevant state variables and perform state abstraction on this new transformed representation. At least the problem becomes MDP again.
Ersin
--
What should the mouse learn? What is the goal of the problem? What can
the mouse observe?
It may be a good idea to first specify the full problem (what is the
situation, what should be learned, etc.). E.g., if cheese may or may
not appear at any time at any given state with uniform probabilities,
the mouse can learn nothing and reinforcement learning is not
applicable. If the appearance of cheese only occurs at the start of an
episode (and the mouse gets more than one episode to learn!) and
cheese can occur more often at some locations than others, the remarks
of others (regarding non-Markov, non-stationarity properties of the
problem) apply.
Best,
Hado
By "an estimate of the reward function" you mean to be able to have a
good approximation of the reward as a function of the state? If that's
the case, the answer is yes. In fact the reward is precisely a
function of the state, however, as with the cheese, I only get the
reward the first time I visit the state.
Thanks. At least in principle this approach should work. As Esrsin
pointed out, augmenting the state space in this way implies that the
number of states is 2^N*|S|, where N is the number of cells with
cheese, and |S| is the number of states.
> You could as well tell us more about your actual problem, maybe then
> people on the list could give suggestions. Unfortuntaly, I know almost
> nothing about POMDPs ...
The actual problem is more complicated. I get a reward, let say
between 0 and 1 the first time a visit a state, and 0 otherwise. Using
the approach described above, that would mean 2^|S| * |S| states. I
think I can threshold some of the rewards (let say everything smaller
than 0.5 is 0), so N is not that big.
Alejandro.
> What should the mouse learn?
Where the cheese is.
> What is the goal of the problem?
To find as much cheese as possible, in a fixed amount of time.
>What can the mouse observe?
Where it is (in which cell it is), and if there is cheese or not in
that cell. If there is cheese, it will eat the cheese.
> It may be a good idea to first specify the full problem (what is the
> situation, what should be learned, etc.). E.g., if cheese may or may
> not appear at any time at any given state with uniform probabilities,
> the mouse can learn nothing and reinforcement learning is not
> applicable. If the appearance of cheese only occurs at the start of an
> episode (and the mouse gets more than one episode to learn!) and
> cheese can occur more often at some locations than others, the remarks
> of others (regarding non-Markov, non-stationarity properties of the
> problem) apply.
The cheese locations are fixed at the beginning of each episode. May
be with some non-uniform distribution, that put the pieces of cheese
close to the "kitchen".
Alejandro.
Then it's not "precisely a function of the state". If the state is
not a predictor of the cheese (which it is not in your case since the
there is nothing about the state that tells the mouse if there is
cheese soon to come or not), then the state alone is not enough to
define the odds of finding cheese.
The state of the ENVIRONMENT (in your simulation) includes information
about where the cheese is located and where the mouse is located as
well as the full topology of the 2D grid word. The mouse however (per
your description) does not have access to much of that information at
all. As such the "state signal" sent to the mouse, does NOT
"precisely" (even in the least) define the reward. Don't confuse your
idea of "state of the environment" with the agents "state signal". If
they are one and the same (aka the full state of the environment is
encoded in the state signal), then you have a Markov problem. But if
the agent's state signal does not give it full information, then it's
some form of a non-Markov problem - which can sometimes be partially
solved by RL if there is at least a non uniform distribution of reward
probability across the state space of the "state signal".
To make any good head way on such a problem, the mouse needs to track
where it's been and attempt to cover as much virgin ground as possible
in the limited time. However, the knowledge of how to do that is not
encoded in the state signal, so the agent can't be driven simply by
the state signal. It must construct it's own internal concept of
state based on history - which is typical of all real world non-Markov
problems. And in general, such approach only works if the engineer
building the algorithm knows that such an approach applies to this
environment, so the problem is typically only well solved by hard
coding knowledge about the behavior of the environment into the RL
agent (aka that it's a 2D grid word, and that rewards can only happen
once per state in an episode).
If you don't hard-code that knowledge into the solution used by the
agent, then the problem becomes many orders of magnitude harder for
the RL agent to solve.
Also, keep in mind, that if the agent has an option to not move, and
if not moving means it prevents the -1 reward, then the optimal
solution for the world could be to stand still and do nothing if it
takes greater than 10 moves on average to find cheese.
--
Curt Welch
I am sorry. I was not clear with what I said. The mouse and cheese
environment is a simplification of the actual problem I want to work
on. I am using the mouse and cheese because it captures the main
problem I have, and I think it is easier to understand. When I said
that the reward is a function off the state, I was not thinking about
the mouse and cheese case, I was thinking on the original problem.
Sorry again for the confusion.
Alejandro.
It might be worthwhile to revisit the definitions of state,
stationarity and Markovness.
>> So for the mouse and cheese environment, if I define the state as
>> the cell position, that will not allow me to predict future rewards. If I
>> augment the state with the list of cheese pieces that have been eaten, then
>> I will be able to predict the distribution of the future observations. With
>> this definition, I can call the environment Markov.
>
> Well, this only works if you assume that the cheese position is
> randomly chosen, say, at the beginning of the episodes.
> Under this additional assumption, yes.
I'd just like to remark that a list of past locations of cheese pieces
is not sufficient. Like Amit noted earlier, the mouse should know
about all the locations it visited this episode, otherwise it will be
tempted to return over and over again to locations that did not have
cheese this episode, but often had cheese in the past. (You probably
caught this, but just in case :)
Unfortunately, that would imply that you would need to store a
potentially large list of visited locations (or a vector of all
locations indicating whether or not they have been visited). Even for
a relatively small NxM grid (e.g., 4x4), this means the size of the
state space potentially increases to NM(2^(NM-1)).
(I got this number by multiplying all NM possible current locations
with all 2^(NM-1) possible configurations of previous locations. E.g.,
for a 4x4 grid: 16*(2^15) = 524,288. The actual state space will be
smaller, because many configurations will never occur - for instance
because the mouse cannot jump over intermediate locations - but for
reasonable sized NxM grids, the state space can still be enormous.)
A large state space is not a problem in theory, but it may severely
slow learning if you use tabular value-based temporal-difference
reinforcement learning. (E.g., Q-learning, Sarsa, and similar
algorithms.)
Assuming the distribution of the cheese is stationary over different
episodes, you may have a better option. Depending on the actual
problem you are trying to solve, it may be easier to learn a model:
for each location store the estimated probability of cheese (i.e.,
simply the number of times cheese was seen divided by the number of
episodes - not the number of visits! - on which that location was
visited) and for each location and action store the probability of
transitioning to another location. (This last step may be trivial,
e.g., if moving 'left' from (1,1) deterministically gets you to
(0,1).)
Then, you can plan ahead on each step, using the expected rewards
(which follow form the probabilities of cheese). To avoid revisiting
locations unnecessarily, you still have to store a list of all visited
locations. You can use this list to set the expected rewards for those
locations to -1 in the planning phase, because you know there cannot
be cheese there.
The planning may take longer (could be too long, depending on the size
of your problem), but the learning will be faster and quite reliable,
since only location-dependent rewards have to be estimated.
One final remark: for the planning phase, you should not use dynamic
programming with the locations as states and the expected rewards,
since this would not take into account the change in reward after a
visit to each location. However, simple heuristic planners probably
perform quite well and maybe dedicated planners exist for problems
with disappearing resources, but I am no expert on that.
Best,
Hado
It might be worthwhile to revisit the definitions of state and
stationary Markov environments.
Cheers,
- Csaba
> Assuming the distribution of the cheese is stationary over different
> episodes, you may have a better option. Depending on the actual
> problem you are trying to solve, it may be easier to learn a model:
> for each location store the estimated probability of cheese (i.e.,
> simply the number of times cheese was seen divided by the number of
> episodes - not the number of visits! - on which that location was
> visited) and for each location and action store the probability of
> transitioning to another location. (This last step may be trivial,
> e.g., if moving 'left' from (1,1) deterministically gets you to
> (0,1).)
[...]
> One final remark: for the planning phase, you should not use dynamic
> programming with the locations as states and the expected rewards,
> since this would not take into account the change in reward after a
> visit to each location. However, simple heuristic planners probably
> perform quite well and maybe dedicated planners exist for problems
> with disappearing resources, but I am no expert on that.
One could use dynamic programming if the problem would be re-stated in
each time step, everytime with a changed reward function. Actually,
re-planning would only be necessary once the reward function changes,
i.e., a piece of cheese was eaten or a cell that was expected to contain
cheese was found empty. Maybe using prioritized sweeping would speed it up
somewhat.
Still, all we need to find is the optimal order of cells and then visit
them in that order. Assuming the cheese probability distribution to be
stationary, that order would be the same for each episode. So some
dedicated planning algorithm would probably be more suitable here.
However, note that in general it's not trivial to find the order in which
to visit the states; for example, if the cell containing cheese with the
highest probability is far away, it might be better to consider other
cells first. I'm not quite sure, but this somewhat looks similar to the
traveling salesman problem, which is NP-hard and therefore an exact
solution cannot be determined efficiently (only in exponential time).
Cheers,
Alex
The mouse and cheese
environment is a simplification of the actual problem I want to work
on. I am using the mouse and cheese because it captures the main
problem I have, and I think it is easier to understand.
You might find hints about “natural solutions” interesting. See two recent papers and there references therein.
Rat exps with cheesboard:
The reorganization and reactivation of hippocampal maps predict spatial memory performance
David Dupret,1 Joseph O'Neill,1 Barty Pleydell-Bouverie1 & Jozsef Csicsvari1
Journal name: Nature Neuroscience
Volume: 13, Pages: 995–1002 Year published: (2010) DOI: doi:10.1038/nn.2599
Potential solutions and dissociation of problems in humans:
Glascher J & Daw ND & Dayan P & O'Doherty JP (2010)
States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning.
Neuron 10.1016/j.neuron.2010.04.016
http://www.gatsby.ucl.ac.uk/~dayan/papers/glasch10.pdf
Have fun!
Andras
_____________________
András Lőrincz
Department of Software Technology and Methodology
Eötvös Loránd University
--