A mouse and cheese environment, or how to deal with a non-stationary world.

Alejandro

unread,

May 23, 2011, 10:02:20 AM5/23/11

to rl-...@googlegroups.com

Hi:

Consider a "mouse and cheese" environment. The mouse is in a grid world, and there are some pieces of cheese in some of the cells. The mouse get a reward of 10 when it finds a piece of cheese, and a reward of -1 otherwise. The mouse must eat the cheese when it finds it. This mean that it gets a reward of value 10 the first time it reach that cell, but it gets -1 if it revisit the cell in a later time. Assume the episode finish after N steps.

I think this means that this environment is non-stationary. Am I correct?

From what I've read, most (all?) of the RL methods assume a stationary environment, but you can use them in a non-stationary environment and "hope for the best". Any suggestions on how to deal with non-stationary environment, specially when this is due to the nature of the reward function?

Alejandro.

Steffen Udluft

unread,

May 23, 2011, 10:16:15 AM5/23/11

to rl-...@googlegroups.com

Hello,

I had a somewhat related problem once, as we used the information-gain as
reward for a planned exploration scheme.

First it is obvious that the average of the observed rewards is not the
apropriate concept.

In the information-gain setting we can build a better estimate of the reward function
by using the most recent reward of a transition ( r(s,a,s') ) instead of
the average reward for that transition (s,a,s').

In a less discontinuous "mouse and cheese" the the most recent reward
might be an acceptable estimate of the reward function as well
(e.g. if there is lots of cheese in a cell and the mouse can eat only a
tiny bit of it, or if the cells with cheese are neighbored in larger
region ...).

The question is, can you formulate an adequate estimate of the reward
function in your setting?

-
Steffen

--
Dr. Steffen Udluft
Siemens AG
Corporate Research and Technologies
Intelligent Systems & Control

> --
> You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en
>

Francisco Martinez Gil

unread,

May 23, 2011, 10:19:15 AM5/23/11

to rl-...@googlegroups.com

In my opinion, your process has not the markov property if the state is the
same before and after having eaten the cheese, because you get a reward of -1
in state S because you have been in state S before. Therefore the state S
has no all the information to take the decition in the second time (it
depends of the history of the sequence of visited states).

regards
Francisco Martinez Gil

Alexander Hans

unread,

May 23, 2011, 10:31:10 AM5/23/11

to rl-...@googlegroups.com

Hi Alejandro,

This reminds of the trap domain [1], which is a grid world with one state
containing a flag. The agent has to collect the flag and deliver it to the
goal state. The domain is a 3x3 grid, so it would have 9 states on first
sight. However, one must also consider the position of the flag, it can
either be in the position of the flag state or it can be carried by the
agent. In any case, there are two possibilities, so the state space gets
doubled and has 18 states.

In your mouse and cheese environment, one could for each possible cheese
position add a state variable denoting whether cheese is there or not. The
agent would receive the high reward when it gets to a state with the
cheese. Once the cheese is eaten, the state changes. With the state coded
this way, the environment would be stationary. If, however, you omit this
information, the state space would not satisfy the Markov property.
Probably one could call the environment non-stationary then. One can
always reconstruct the Markov property by considering not just the current
state, but also a number of previous ones, thereby of course extremely
blowing up the state space.

If you think that your environment is non-stationary, maybe you can change
the state representation so that the non-stationarity becomes observable
(like putting the information whether there is cheese in a field into the
state). You could also look at approaches that explicitely deal with
partially observable MDPs, because this is exactly what you have in your
description of the mouse and cheese environment.

You could as well tell us more about your actual problem, maybe then
people on the list could give suggestions. Unfortuntaly, I know almost
nothing about POMDPs ...

Hope that helps,

Alex

[1] Dearden, R. and Friedman, N. and Andre, D. Model based Bayesian
exploration. Proceedings of the fifteenth Conference on Uncertainty in
Artificial Intelligence, pp. 150--159, 1999

> --
> You received this message because you are subscribed to the "Reinforcement
> Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en

--

Ersin Basaran

unread,

May 23, 2011, 10:34:50 AM5/23/11

to Reinforcement Learning Mailing List

One straightforward solution is to augment the state space by all
possible cheese configurations. If the number of cheese is small, this
may be an option. Suppose that you have n possible number of cheese
(at specific locations), then the state space increases exponentially
by a factor of 2^n and you hit the curse of dimensionality. I'm not
sure if it is possible to find irrelevant state variables and perform
state abstraction on this new transformed representation. At least the
problem becomes MDP again.

ersin....@primeart.net

unread,

May 23, 2011, 10:27:12 AM5/23/11

to rl-...@googlegroups.com

One straightforward solution is to augment the state space by all possible cheese configurations. If the number of cheese is small, this may be an option (like in the taxi domain). Suppose that you have n possible number of cheese (at specific locations), then the state space increases exponentially by a factor of 2^n and you hit the curse of dimensionality. I'm not sure if it is possible to find irrelevant state variables and perform state abstraction on this new transformed representation. At least the problem becomes MDP again.

Ersin

--

Amit B

unread,

May 23, 2011, 10:06:10 AM5/23/11

to rl-...@googlegroups.com

Hi

you might want to consider to add an extra variable to your state - setting it to 1 if this cell has been visited during the episode and 0 otherwise

in such way so you can still remain in the stationary reward domain - with the expense of somewhat larger state space to learn.

Amit

Hado van Hasselt

unread,

May 23, 2011, 11:10:05 AM5/23/11

to rl-...@googlegroups.com

Hi Alejandro,

What should the mouse learn? What is the goal of the problem? What can
the mouse observe?

It may be a good idea to first specify the full problem (what is the
situation, what should be learned, etc.). E.g., if cheese may or may
not appear at any time at any given state with uniform probabilities,
the mouse can learn nothing and reinforcement learning is not
applicable. If the appearance of cheese only occurs at the start of an
episode (and the mouse gets more than one episode to learn!) and
cheese can occur more often at some locations than others, the remarks
of others (regarding non-Markov, non-stationarity properties of the
problem) apply.

Best,
Hado

Alejandro Weinstein

unread,

May 23, 2011, 1:03:36 PM5/23/11

to rl-...@googlegroups.com

On Mon, May 23, 2011 at 8:16 AM, Steffen Udluft
<steffen...@siemens.com> wrote:
> In a less discontinuous "mouse and cheese" the the most recent reward might
> be an acceptable estimate of the reward function as well
> (e.g. if there is lots of cheese in a cell and the mouse can eat only a tiny
> bit of it, or if the cells with cheese are neighbored in larger region ...).
>
> The question is, can you formulate an adequate estimate of the reward
> function in your setting?

By "an estimate of the reward function" you mean to be able to have a
good approximation of the reward as a function of the state? If that's
the case, the answer is yes. In fact the reward is precisely a
function of the state, however, as with the cheese, I only get the
reward the first time I visit the state.

Alejandro Weinstein

unread,

May 23, 2011, 1:14:18 PM5/23/11

to rl-...@googlegroups.com

On Mon, May 23, 2011 at 8:31 AM, Alexander Hans <al...@ahans.de> wrote:
> This reminds of the trap domain [1], which is a grid world with one state
> containing a flag. The agent has to collect the flag and deliver it to the
> goal state. The domain is a 3x3 grid, so it would have 9 states on first
> sight. However, one must also consider the position of the flag, it can
> either be in the position of the flag state or it can be carried by the
> agent. In any case, there are two possibilities, so the state space gets
> doubled and has 18 states.

Thanks. At least in principle this approach should work. As Esrsin
pointed out, augmenting the state space in this way implies that the
number of states is 2^N*|S|, where N is the number of cells with
cheese, and |S| is the number of states.

> You could as well tell us more about your actual problem, maybe then
> people on the list could give suggestions. Unfortuntaly, I know almost
> nothing about POMDPs ...

The actual problem is more complicated. I get a reward, let say
between 0 and 1 the first time a visit a state, and 0 otherwise. Using
the approach described above, that would mean 2^|S| * |S| states. I
think I can threshold some of the rewards (let say everything smaller
than 0.5 is 0), so N is not that big.

Alejandro.

Alejandro Weinstein

unread,

May 23, 2011, 1:31:47 PM5/23/11

to rl-...@googlegroups.com

On Mon, May 23, 2011 at 9:10 AM, Hado van Hasselt
<hado.va...@gmail.com> wrote:

> What should the mouse learn?

Where the cheese is.

> What is the goal of the problem?

To find as much cheese as possible, in a fixed amount of time.

>What can the mouse observe?

Where it is (in which cell it is), and if there is cheese or not in
that cell. If there is cheese, it will eat the cheese.

> It may be a good idea to first specify the full problem (what is the
> situation, what should be learned, etc.). E.g., if cheese may or may
> not appear at any time at any given state with uniform probabilities,
> the mouse can learn nothing and reinforcement learning is not
> applicable. If the appearance of cheese only occurs at the start of an
> episode (and the mouse gets more than one episode to learn!) and
> cheese can occur more often at some locations than others, the remarks
> of others (regarding non-Markov, non-stationarity properties of the
> problem) apply.

The cheese locations are fixed at the beginning of each episode. May
be with some non-uniform distribution, that put the pieces of cheese
close to the "kitchen".

Alejandro.

Curt Welch

unread,

May 23, 2011, 5:03:59 PM5/23/11

to rl-...@googlegroups.com

Then it's not "precisely a function of the state". If the state is
not a predictor of the cheese (which it is not in your case since the
there is nothing about the state that tells the mouse if there is
cheese soon to come or not), then the state alone is not enough to
define the odds of finding cheese.

The state of the ENVIRONMENT (in your simulation) includes information
about where the cheese is located and where the mouse is located as
well as the full topology of the 2D grid word. The mouse however (per
your description) does not have access to much of that information at
all. As such the "state signal" sent to the mouse, does NOT
"precisely" (even in the least) define the reward. Don't confuse your
idea of "state of the environment" with the agents "state signal". If
they are one and the same (aka the full state of the environment is
encoded in the state signal), then you have a Markov problem. But if
the agent's state signal does not give it full information, then it's
some form of a non-Markov problem - which can sometimes be partially
solved by RL if there is at least a non uniform distribution of reward
probability across the state space of the "state signal".

To make any good head way on such a problem, the mouse needs to track
where it's been and attempt to cover as much virgin ground as possible
in the limited time. However, the knowledge of how to do that is not
encoded in the state signal, so the agent can't be driven simply by
the state signal. It must construct it's own internal concept of
state based on history - which is typical of all real world non-Markov
problems. And in general, such approach only works if the engineer
building the algorithm knows that such an approach applies to this
environment, so the problem is typically only well solved by hard
coding knowledge about the behavior of the environment into the RL
agent (aka that it's a 2D grid word, and that rewards can only happen
once per state in an episode).

If you don't hard-code that knowledge into the solution used by the
agent, then the problem becomes many orders of magnitude harder for
the RL agent to solve.

Also, keep in mind, that if the agent has an option to not move, and
if not moving means it prevents the -1 reward, then the optimal
solution for the world could be to stand still and do nothing if it
takes greater than 10 moves on average to find cheese.

--
Curt Welch

Csaba

unread,

May 23, 2011, 5:12:24 PM5/23/11

to Reinforcement Learning Mailing List

I have tried to post this in the morning, but it did not go through..
Trying again through the google groups interface..
-------------------------------------------------------------------------------------------------------
Hi,

It might be worthwhile to revisit the definitions of state,
stationarity and Markovness.

In short, the state in an environment is a sufficient statistics that
makes it possible to predict the future, knowing the future actions.
Here, prediction means predicting the distribution of future
observables (such as the reward). A sufficient statistics is something
that you compute based on past observed values (more precisely, you
need a way of computing this something based on past values, i.e., a
mapping from past values).
If it is possible to define a state like this, the environment is
called Markov. Note that everything up to know is defined in terms of
actions and what is observed.

An environment is stationary Markov if the distribution of future
states given the current state and future actions is fixed, i.e., if
this conditional distribution does not change with time.
Otherwise, the environment is called nonstationary (Markov).

Based on the definitions, you are not quite free to choose whatever
you put into a state. In fact, any use of the word "state" which is
not in line with the above definition is an abuse of the concept of
state. Don't do it:)

If there is no suitable way of defining the state, the environment
might be partially observable, Markov. In this case, add variables
which are not observed such that together these and the observed
variables make it possible to define the state. (Thus, every Markov
environment is also a partially observable Markov environment.)

In the example below, assume a single episode. If the position of the
cheese is not observable (assuming that there is a cheese somewhere),
and nothing is assumed about where the cheese is initially, the
environment becomes partially observable, Markov: the cheese position
together with the observables (which cell is the mouse in, what is the
reward for the last transition, what is the time(!?)) make it possible
to define a state, whereas without the cheese position the state
cannot be defined to satisfy the definition of 'stateness' (try it!).

The concept of stationarity vs. non-stationarity still applies to
partially observable, Markov environments.

Now, will the environment below be nonstationary? Again, try the
definition. I guess you will see that the environment is stationary,
partially observable Markov.

However, you can "play games" sometimes with the assumptions. For
example, assume that we have multiple episodes and at the beginning of
every episode the position of the cheese is *randomly* chosen from
some *fixed* distribution (this is the new assumption). Then,
suddenly, the environment becomes stationary, Markov: the state is the
list of cells the mouse has visited, the time elapsed within the
episode and the identity of the cell where the mouse got a reward of
10 if this happened beforehand. Given this information and future
actions, one can indeed calculate the probability of any future
observation.

If the cheese position is chosen from a new distribution at the
beginning of the episodes, then the environment becomes nonstationary
Markov. Note that at the same time, the environment is also a
partially observable, stationary Markov environment. In fact,
nonstationarity can always be pushed to partial observability (but not
vice versa, as seen from the example above). In fact, nonstationarity
can always be eliminated by introducing a new clock! However, such
"tricks" are completely useless when it comes to practice. For
example, having a clock means that every time the word could be
completely different from what you have seen before. How useful is
this when it comes to learning? Not very useful. The other "trick" of
assuming a prior distribution for the cheese might be more useful, at
least when the environment is not very large.

So, what should you do? Mainly, be sensible. And don't be obsessed
with words like nonstationary vs. stationary, partially observable vs.
observable, non-Markov vs. Markov. There are many algorithms out there
for many variations of these problems, learn about them and pick the
ones which seem to be the best fit..

Cheers,
- Csaba

Alejandro Weinstein

unread,

May 23, 2011, 5:24:53 PM5/23/11

to rl-...@googlegroups.com

On Mon, May 23, 2011 at 3:03 PM, Curt Welch <cu...@kcwc.com> wrote:
>> By "an estimate of the reward function" you mean to be able to have a
>> good approximation of the reward as a function of the state? If that's
>> the case, the answer is yes. In fact the reward is precisely a
>> function of the state, however, as with the cheese, I only get the
>> reward the first time I visit the state.
>
> Then it's not "precisely a function of the state". If the state is
> not a predictor of the cheese (which it is not in your case since the
> there is nothing about the state that tells the mouse if there is
> cheese soon to come or not), then the state alone is not enough to
> define the odds of finding cheese.

I am sorry. I was not clear with what I said. The mouse and cheese
environment is a simplification of the actual problem I want to work
on. I am using the mouse and cheese because it captures the main
problem I have, and I think it is easier to understand. When I said
that the reward is a function off the state, I was not thinking about
the mouse and cheese case, I was thinking on the original problem.
Sorry again for the confusion.

Alejandro.

Alejandro

unread,

May 23, 2011, 9:24:00 PM5/23/11

to rl-...@googlegroups.com

On Monday, May 23, 2011 3:12:24 PM UTC-6, Csaba wrote:

It might be worthwhile to revisit the definitions of state,
stationarity and Markovness.

Thanks for the clarification! Let me see if I got your points right.

First, for the given environment, I need to find the right definition of state. To check if it is correct, I need to be able to, given the state and the future actions, predict the distribution of the future states and rewards. So for the mouse and cheese environment, if I define the state as the cell position, that will not allow me to predict future rewards. If I augment the state with the list of cheese pieces that have been eaten, then I will be able to predict the distribution of the future observations. With this definition, I can call the environment Markov.

Depending on if the agent can observe the whole state or no, we are dealing with a POMDP or a MDP. And depending on if the conditional distributions of future states depend on time or not, we call it stationary or non-stationary.

Alejandro.

Csaba

unread,

May 23, 2011, 9:38:49 PM5/23/11

to Reinforcement Learning Mailing List

Hi Alejandro,

On May 23, 7:24 pm, Alejandro <alejandro.weinst...@gmail.com> wrote:
> On Monday, May 23, 2011 3:12:24 PM UTC-6, Csaba wrote:
>
> > It might be worthwhile to revisit the definitions of state,
> > stationarity and Markovness.
>
> Thanks for the clarification! Let me see if I got your points right.
>
> First, for the given environment, I need to find the right definition of
> state. To check if it is correct, I need to be able to, given the state and
> the future actions, predict the distribution of the future states and
> rewards.

Yep, almost. Predicting future observables is what required..

> So for the mouse and cheese environment, if I define the state as
> the cell position, that will not allow me to predict future rewards. If I
> augment the state with the list of cheese pieces that have been eaten, then
> I will be able to predict the distribution of the future observations. With
> this definition, I can call the environment Markov.

Well, this only works if you assume that the cheese position is
randomly chosen, say, at the beginning of the episodes.
Under this additional assumption, yes.

>
> Depending on if the agent can observe the whole state or no, we are dealing
> with a POMDP or a MDP.
> And depending on if the conditional distributions of
> future states depend on time or not, we call it stationary or
> non-stationary.

Yes.

Cheers,
Csaba

Hado van Hasselt

unread,

May 23, 2011, 11:28:26 PM5/23/11

to rl-...@googlegroups.com

Hi,

>> So for the mouse and cheese environment, if I define the state as
>> the cell position, that will not allow me to predict future rewards. If I
>> augment the state with the list of cheese pieces that have been eaten, then
>> I will be able to predict the distribution of the future observations. With
>> this definition, I can call the environment Markov.
>
> Well, this only works if you assume that the cheese position is
> randomly chosen, say, at the beginning of the episodes.
> Under this additional assumption, yes.

I'd just like to remark that a list of past locations of cheese pieces
is not sufficient. Like Amit noted earlier, the mouse should know
about all the locations it visited this episode, otherwise it will be
tempted to return over and over again to locations that did not have
cheese this episode, but often had cheese in the past. (You probably
caught this, but just in case :)

Unfortunately, that would imply that you would need to store a
potentially large list of visited locations (or a vector of all
locations indicating whether or not they have been visited). Even for
a relatively small NxM grid (e.g., 4x4), this means the size of the
state space potentially increases to NM(2^(NM-1)).

(I got this number by multiplying all NM possible current locations
with all 2^(NM-1) possible configurations of previous locations. E.g.,
for a 4x4 grid: 16*(2^15) = 524,288. The actual state space will be
smaller, because many configurations will never occur - for instance
because the mouse cannot jump over intermediate locations - but for
reasonable sized NxM grids, the state space can still be enormous.)

A large state space is not a problem in theory, but it may severely
slow learning if you use tabular value-based temporal-difference
reinforcement learning. (E.g., Q-learning, Sarsa, and similar
algorithms.)

Assuming the distribution of the cheese is stationary over different
episodes, you may have a better option. Depending on the actual
problem you are trying to solve, it may be easier to learn a model:
for each location store the estimated probability of cheese (i.e.,
simply the number of times cheese was seen divided by the number of
episodes - not the number of visits! - on which that location was
visited) and for each location and action store the probability of
transitioning to another location. (This last step may be trivial,
e.g., if moving 'left' from (1,1) deterministically gets you to
(0,1).)

Then, you can plan ahead on each step, using the expected rewards
(which follow form the probabilities of cheese). To avoid revisiting
locations unnecessarily, you still have to store a list of all visited
locations. You can use this list to set the expected rewards for those
locations to -1 in the planning phase, because you know there cannot
be cheese there.

The planning may take longer (could be too long, depending on the size
of your problem), but the learning will be faster and quite reliable,
since only location-dependent rewards have to be estimated.

One final remark: for the planning phase, you should not use dynamic
programming with the locations as states and the expected rewards,
since this would not take into account the change in reward after a
visit to each location. However, simple heuristic planners probably
perform quite well and maybe dedicated planners exist for problems
with disappearing resources, but I am no expert on that.

Best,
Hado

Csaba Szepesvari

unread,

May 23, 2011, 1:13:43 PM5/23/11

to rl-...@googlegroups.com, Alejandro

Hi,

It might be worthwhile to revisit the definitions of state and
stationary Markov environments.

Cheers,
- Csaba

Alexander Hans

unread,

May 24, 2011, 3:14:47 AM5/24/11

to rl-...@googlegroups.com

I realized something similar yesterday, too, but somehow didn't think it
through. After all, all the mouse has to learn is to visit the cells in
order of descending probability of finding cheese there. Of course making
the assumption that the distribution of the cheese is stationary and
doesn't change from episode to episode.

> Assuming the distribution of the cheese is stationary over different
> episodes, you may have a better option. Depending on the actual
> problem you are trying to solve, it may be easier to learn a model:
> for each location store the estimated probability of cheese (i.e.,
> simply the number of times cheese was seen divided by the number of
> episodes - not the number of visits! - on which that location was
> visited) and for each location and action store the probability of
> transitioning to another location. (This last step may be trivial,
> e.g., if moving 'left' from (1,1) deterministically gets you to
> (0,1).)

[...]

> One final remark: for the planning phase, you should not use dynamic
> programming with the locations as states and the expected rewards,
> since this would not take into account the change in reward after a
> visit to each location. However, simple heuristic planners probably
> perform quite well and maybe dedicated planners exist for problems
> with disappearing resources, but I am no expert on that.

One could use dynamic programming if the problem would be re-stated in
each time step, everytime with a changed reward function. Actually,
re-planning would only be necessary once the reward function changes,
i.e., a piece of cheese was eaten or a cell that was expected to contain
cheese was found empty. Maybe using prioritized sweeping would speed it up
somewhat.

Still, all we need to find is the optimal order of cells and then visit
them in that order. Assuming the cheese probability distribution to be
stationary, that order would be the same for each episode. So some
dedicated planning algorithm would probably be more suitable here.
However, note that in general it's not trivial to find the order in which
to visit the states; for example, if the cell containing cheese with the
highest probability is far away, it might be better to consider other
cells first. I'm not quite sure, but this somewhat looks similar to the
traveling salesman problem, which is NP-hard and therefore an exact
solution cannot be determined efficiently (only in exponential time).

Cheers,

Alex

Anna Koop

unread,

May 24, 2011, 6:36:55 PM5/24/11

to rl-...@googlegroups.com

On 2011-05-23, at 3:24 PM, Alejandro Weinstein wrote:

The mouse and cheese
environment is a simplification of the actual problem I want to work
on. I am using the mouse and cheese because it captures the main
problem I have, and I think it is easier to understand.

I have a video illustration of this problem that I recently posted online:

http://www.youtube.com/watch?v=NLOWqilDT3g

It shows a transient reward in a typical gridworld setup, with the agent learning a table-lookup value function with Dyna.

Blog post here if you prefer text:

http://annakoop.com/?p=166

Using only the grid location of the mouse as the index to a lookup table will be insufficient for representing the value function. You would need a history feature indicating cheese discovered/eaten (or transient reward found/consumed). If the grid location of the reward is observable then you could use that (making it more similar to the cat-and-mouse domain than the gridworld-maze domain, since that uses the location of both cat and mouse as the state signal).

Another possibility is to switch between two value functions---one based on searching out the reward and one based on "reward just found". But that will depend on the structure of the actual reward function.

You have to think about what possible information could help the agent predict reward. This will depend on the reward function, of course. If the reward function is defined over features that are not observable (or construct-able, or possible to remember) then the best it could possibly do will be quite different than the best it looks like it should do.

Non-stationarity comes in many different forms, as Csaba said, so it's not a particularly useful search term. Depending on the kind of non-stationarity, different assumptions work better than others. Most of the canonical examples involve drift in the distribution of the transition or reward function, which is quite different from a change triggered by an agent's action. Representationally speaking.

-Anna

----

Anna Koop

http://www.cs.ualberta.ca/~anna

RL&AI Research Group at the University of Alberta

Andras Lorincz

unread,

May 25, 2011, 2:56:22 AM5/25/11

to rl-...@googlegroups.com

You might find hints about “natural solutions” interesting. See two recent papers and there references therein.

Rat exps with cheesboard:

The reorganization and reactivation of hippocampal maps predict spatial memory performance

David Dupret,1 Joseph O'Neill,1 Barty Pleydell-Bouverie1 & Jozsef Csicsvari1

Journal name: Nature Neuroscience

Volume: 13, Pages: 995–1002 Year published: (2010) DOI: doi:10.1038/nn.2599

http://repository.peerproject.eu:8080/jspui/bitstream/123456789/8828/1/PEER_stage2_10.1038%252Fnn.2599.pdf

Potential solutions and dissociation of problems in humans:

Glascher J & Daw ND & Dayan P & O'Doherty JP (2010)

States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning.

Neuron 10.1016/j.neuron.2010.04.016

http://www.gatsby.ucl.ac.uk/~dayan/papers/glasch10.pdf

Have fun!

Andras

_____________________

András Lőrincz

http://nipg.inf.elte.hu

Department of Software Technology and Methodology

Eötvös Loránd University

--

Reply all

Reply to author

Forward