PSR rank estimation problem

#ZHENG LEI#

unread,

Dec 3, 2009, 4:05:24 AM12/3/09

to rl-...@googlegroups.com

Hi,

I am stuck with this problem for several weeks. I am posting here to see if any one can help.

I implemented a suffix tree based PSR algorithm according to Michael James’ thesis. I am testing the rank estimation in dynamical system without reset. Since there is no test result using system without reset in James’ Thesis, I decide to benchmark against Wolfe’s paper “Learning Predictive State Representations in Dynamical Systems Without Reset”. I am having trouble getting their results, although I am practically using the same algorithm. James’ thesis uses Chebyshev’s inequality to obtain an error bound of the system matrix obtained through sampling. The problem for me is that the bound does not seem to work. The confidence parameter used seems rather arbitrary. If too small, the rank estimation algorithm just blows out, and the estimated rank just keeps growing in each iteration. If too big, then rank is under-estimated. I’ve read several articles about rank estimation. One common conclusion seems that the rank estimation only works if the norm of the sampling error matrix is well below the smallest singular value of the matrix whose rank is to be estimated. But none of my simulation result meets that condition. I can derive the exact system matrix from the pomdp model (using the stationary distribution as the initial state distribution, which makes the result exact for sampling in a system without reset), and thus obtain the exact sampling error, but the error is always much larger (an order of magnitude at least) than the requirement, even with millions of samples in a simple environment such as the shuttle space environment.

Could anybody having experience in PSR give me some advice? How is your testing result with the rank estimation algorithm?

Regards

Zheng Lei

"José Antonio Martín H."

unread,

Dec 3, 2009, 2:07:54 PM12/3/09

to rl-...@googlegroups.com

Hi all.

Does anybody knows of any research about solving "non renewable reward"
problems?

I mean that there is a limited total reward in the environment and every
time the agent gets some reward the global level decrease.

One key point in this "sustainable" dynamic is that the Agent's
objective is not just to get all the future cumulative reward but
instead that the Agent only needs just some level of reward for which
the agent can feel satisfied. So if the agent get higher reward that its
level of satisfaction then it is squandering resources (reward).

Of course there are some possible variants such as that the non
renewable resources "reward" are renewed after some period of time and
so on.

Also the same situation with two competing agents seems to be very pretty.

bests,
-Jos�

Thomas G. Dietterich

unread,

Dec 3, 2009, 2:48:07 PM12/3/09

to rl-...@googlegroups.com

In this case, I think you want to make a distinction between the goal of
the agent (e.g., to live forever) and the "reward" in the environment.
Let's think of the "reward" in the environment as a resource that is
renewed only at a certain rate. If the resource is consumed too quickly,
it becomes temporarily exhausted, and the agent dies before the resource is
renewed. The agent receives an immediate positive reward when consuming
the resource but a large negative reward for dying.

For the single agent case, the optimal policy is obviously to consume the
resource at exactly the rate it is renewed. If there are fluctuations in
the renewal process (e.g., good and bad crop years), then the optimal
policy requires introducing some form of buffering (storage of grain). If
storage is subject to decay in the quality of the resource (also
stochastic), then this adds further complexity.

Thomas G. Dietterich Voice: 541-737-5559
School of EECS FAX: 541-737-1300
1148 Kelley Engineering Center URL: http://eecs.oregonstate.edu/~tgd
Oregon State University, Corvallis, OR 97331-5501

> -José
>
>
>
>
>
>
> --
> You received this message because you are subscribed to the
> "Reinforcement Learning Mailing List" group.
> To post to this group, send email to rl-...@googlegroups.com
> To unsubscribe from this group, send email to
> rl-list-u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/rl-list?hl=en

winmail.dat

Csaba Szepesvari

unread,

Dec 3, 2009, 6:25:31 PM12/3/09

to rl-...@googlegroups.com

To model the aspect that eating with ones tummy full is not rewarding,
introduce a variable that describes how full your tummy is. Eating with
your tummy full can actually induce negative reward.
- Csaba

>> -Jos�

"José Antonio Martín H."

unread,

Dec 5, 2009, 10:18:52 AM12/5/09

to rl-...@googlegroups.com

Thanks a lot, that is exactly what I am looking for.
bests,
jamh
Thomas G. Dietterich escribi�:

>> -Jos�

"José Antonio Martín H."

unread,

Dec 5, 2009, 10:25:16 AM12/5/09

to rl-...@googlegroups.com

Dear Csaba
Yes it seems to work but is it not the idea I want to study (from a
theoretical point of view only) I want that the agent receive as a the
reward the pure environmental signal so I must use its own guesses to
infer that eating to much is not good.

bests,
jamh.

Michael Littman

unread,

Dec 5, 2009, 11:10:30 AM12/5/09

to rl-...@googlegroups.com

There has been some work on combining an evolution optimization level to construct the reward function from the environmental signals with an RL algorithm to optimize that reward function. That might fit your description as well.

Satinder Singh has a paper, but I can't seem to find it online. Another is older work I did with David Ackley:
http://scholar.google.com/scholar?cluster=7612659753305704566&hl=en&as_sdt=2000

Csaba Szepesvari

unread,

Dec 5, 2009, 11:12:09 AM12/5/09

to rl-...@googlegroups.com

Dear Jos�,

Jos� Antonio Mart�n H. wrote:
> Dear Csaba
> Yes it seems to work but is it not the idea I want to study (from a
> theoretical point of view only) I want that the agent receive as a the
> reward the pure environmental signal so I must use its own guesses to
> infer that eating to much is not good.
>

Interesting, I see.
Let's see how this could work:
Take an environment with a renewable source.
Let the amount available at time t be X_t; say X_{t+1} = X_t + Z_t,
where Z_t>=0 is a sequence of independent, identically distributed
random variables, the amount of growth, assuming that there is no
consumption (quite unrealistic, but OK for now). The state of the
environment is X_t.
The action of the agent, A_t>=0, is the amount consumed at time t.
Hence, X_{t+1} = max(0, X_t + Z_t - A_t). The reward could be the amount
consumed: R_t = X_t+Z_t-X_{t+1}. The goal could be to maximize the
average reward consumed per time step.
However, this is not sufficient. The optimal policy would be simply to
consume the maximum amount every time step: The dieing aspect is
missing. In order to model this we need the fullness of tummy variable,
F_t>=0. F_{t+1} = max(0,F_t + R_t - f), say F_0 = 1 and f>0 is the
average energy consumption of the agent per time step.
The agent dies when (say) F_t<1/2 and the goal becomes to maximize the
total reward before dieing. The state is now (X_t,F_t). It could be
interesting to look into what is the optimal policy in this case. If one
is lucky, it is even possible to construct the optimal policy explicitly
I mean, you may be able to determine the analytic form of the optimal
policy (which is often of the threshold form).

Models like this are called production/consumption models. Economists
and Operations Research people study them in slightly different contexts
(and probably in more complicated forms). Sheldon Ross has a nice little
book, Introduction to Stochastic Dynamic Programming, where he
calculates explicit solution to a bunch of similar problems. This is a
book from 1983 when it was still somewhat popular to derive analytic
expressions for the optimal policy. So, it might be that someone has
considered this problem and the optimal solution is known.

Bests,
Csaba

Richard L. Lewis

unread,

Dec 5, 2009, 12:14:24 PM12/5/09

to rl-...@googlegroups.com

Below is the reference for the paper Michael refers to, which you can get here:

http://www-personal.umich.edu/~rickl/recent.html

Singh, S., Lewis, R. L., and Barto, A. G. (2009). Where do rewards come from? In Proceedings of the Annual Conference of the Cognitive Science Society, pages 2601-2606, Amsterdam.

(We have a few more in the works but not ready for distribution...)

--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en

--------------------------

Richard L. Lewis ri...@umich.edu

Professor http://www-personal.umich.edu/~rickl/

Department of Psychology Voice: (734) 763-1466

University of Michigan Fax: (734) 763-7480

530 Church Street Office: East Hall 4428F

Ann Arbor, MI 48109-1043

Juergen Schmidhuber

unread,

Dec 7, 2009, 4:46:49 AM12/7/09

to rl-...@googlegroups.com

There are methods that evolve or otherwise search for reward functions
in the context of automatic subgoal generation. Perhaps the first
example is:

M. Wiering and J. Schmidhuber. HQ-Learning. Adaptive Behavior 6(2):
219-246, 1997

The system solves tasks by discovering useful reward functions for
several subordinate
reinforcement learning agents, essentially by searching for useful
subgoals to be reached
by these agents, until it finds a working subgoal combination.

More on this: http://www.idsia.ch/~juergen/subgoals.html

Cheers,
Juergen

Curt Welch

unread,

Dec 7, 2009, 12:34:43 PM12/7/09

to rl-...@googlegroups.com

2009/12/3 "José Antonio Martín H." <jama...@fdi.ucm.es>:

> Hi all.
>
> Does anybody knows of any research about solving "non renewable reward"
> problems?

Sorry no. I don't keep up with RL research so I can't help you there.

But I have a few comments about your thinking here.

The standard reinforcement learning abstraction of reward maximizing
can be made to fit I wide range of problems. I would say every
problem because I don't know of any it doesn't fit, but that would be
too broad a claim. You don't have to define a new type of
reinforcement learning in order to fit the type of problem you are
talking about. It's just wrong to think about it that way. You
simply define the reward signal to match your problem and then the
agent's goal remains the same as it always is in reinforcement
learning - which is reward maximizing.

> I mean that there is a limited total reward in the environment and every
> time the agent gets some reward the global level decrease.

That alone, as you specify it, doesn't change anything. The general
goal of the agent is still to get all the rewards as quickly as
possible. If there are 100 rewards in the environment and one RL
agent only gets 10 after interacting with the environment for hours,
and another agent gets all 100 in 10 seconds, the second agent is
clearly "better" than the first for that environment. This is just
the standard RL abstraction.

> One key point in this "sustainable" dynamic is that the Agent's
> objective is not just to get all the future cumulative reward but
> instead that the Agent only needs just some level of reward for which
> the agent can feel satisfied. So if the agent get higher reward that its
> level of satisfaction then it is squandering resources (reward).

If you want to implement a "feel satisfied" effect you do it by
changing the reward generator in the environment and not by changing
the RL abstraction itself to some new type of learning abstraction.

You can for example create an environment that has some type of token
that you want the agent to collect. We can call it food to parallel
what it seems like you are talking about here. We don't give the
agent a constant reward for every bit of food it collects (eats). If
we do, then the RL maximizing agent will eat as much as it can as fast
as it can without stopping. We instead could create something that
roughly parallels what life forms have to deal with - which is energy
collection and storage. Life forms store energy for later use, but
there's a limit to how much they can store. Actions consume energy
and when the agent runs low on energy it needs more.

Let me switch here from talking about life to talking about a robot
that we want to build which we will include an RL system so it can
adapt to it's environment.

We as the designers of the robot decide we want the robot to survive,
so we design it in a way that we think will maximize its odds of
survival. We create a reward signal that we think will make the odds
of survival easier for the reinforcement learning algorithm. Our
robot has a battery to store energy and the robot will need to
recharge it's battery to survive. We want the robot to learn to keep
its batteries charged so it will survive.

There are a lot ways to approach this problem in our design. We can
for example give it a reward signal that's a measure of the charge in
the battery. With a full charge, it will get a constant stream of
rewards. With a half-charge it will get half the rewards over time,
etc. With such a reward to deal with, the RL algorithm will in theory
learn to keep itself connected to the charger and to never move. If
it moves, it uses energy and gets less reward. If it disconnects from
the charger, its charge level will start to drop, and it will get less
rewards. The optimum (simple) behavior for this problem is for the
robot to learn to connect itself to the charger and never do anything
else.

Now, this robot won't be very "smart" about surviving a problem such
as a power failure because it's got not inherent motivation top ever
leave the home charger and go exploring and learning about the
environment - and learning for example that there are many other
places to plug in its charger. A robot that had gone exploring while
its battery was charged might have learned more about the environment,
so when that first power failure happened, it would already have
learned behaviors for driving next door and trying their power outlet.
A robot that is motivated to explore while it's got plenty of energy
might have a better chance of surviving. So we as the designs of the
robot have the option to change our design to motivate such behavior.
But we do this not by changing the RL abstraction, and not by changing
our learning algorithm in the robot, but instead, by changing the
reward signal we send the learning module. So from the perspective of
the learning module, we are changing the environment to make the robot
have different motivations.

We do it by adding your idea of "satisfied" to the design. Instead of
generating a reward signal based on charge level, we change the
hardware to create a reward signal based on rate of charge. Once the
batter is charged, the rate of charge goes to zero - and the rewards
for charging stop. If the robot sits there and does nothing, the
battery stays well charged, and the robot has to just wait for the
battery to drain by leakage before it can get more rewards for
charging. However, if it disconnects from the charger and starts
running in circles, it will drain the battery and then be able to get
more rewards, when it connects back to the charger. So now by
changing the reward signal, we have changed the orbots motivation from
"keep the battery charged" to "use as much energy as you can get".
And we have added the idea of "satisfied" to the design so it won't
just keep itself connected to the charger. If it gets a full battery,
it will stop consuming the limited research.

However, that design is actually worse about dealing with a limited
power resource because that second design will consumer as much
electricity as fast as it can. If the power suddenly runs out on the
robot, it will "die" shortly after that. But since the second design
is motivated to actually use its energy instead of conserve it, it's
more likely to develop exploring behaviors that could help it to
survive.

By making more adjustments to the reward signal generator (not to the
reinforcement learning algorithm we use or to the basic RL
abstraction), we can trade off the tendency of our robot to consume vs
conserve power to any level we like. We can keep the design for
giving it rewards for changing, but we can balance that with negative
rewards for using power in such a way to regulate it's typical usage
to some average level of power consumption that we feel is optimal for
the environment the robot will be trying to survive in. So if the
robot is on Mars and living off a limited solar cell power source, we
can regulate rewards so the robots typical behavior is to use only 70%
of the power we expect it to be able to get each day from the sun.
Building it that way motivates the robot to explore by regulates it's
use of power to fit what we expect to be available. We are
controlling in this way the robots tendency to trade off exploration
vs exploitation not by adjusting the learning algorithm, but by
instead, adjusting the reward signal we are sending to the learning
algorithm.

If you have a very very good learning algorithm, it will on its own,
given enough time, learn how to optimize its behavior to the
environment without us (acting as the designer) giving it "hints" by
designing a reward signal into the system that makes it naturally do
"what is right" for the environment. We can use a very simplistic
reward signal which gives it huge negative rewards for having it's
battery run down and let the RL algorithm figure everything else out.
But in this case, the robot would have to "die" (run out of power)
many many tine on it's own before it would be able to learn how to
prevent that from happening. If we were trying to make our robot
survive on its own on Mars, that wouldn't work, because we would have
to keep going up there and recharging its battery for years before it
learned how to do the right thing and conserve it's energy and
maximize its charge.

The stronger the learning algorithm, the less important the reward
signal becomes in terms of making it easy for the agent to learn to do
the right thing. But a weaker learning algorithm can be put to good
use in an agent if you build a lot of the "smarts" into the reward
signal instead of into the raw strength of the learning algorithm
itself.

All this happens however, without any need to define a new type of RL
abstraction, or a new type of reinforcement learning algoirthm. The
normal one always works. You don't need to re-define the interface
between the learning agent, and the reward generator to do any of
this. The interface is a reward signal which the agent tries to
maximize - end of story. You don't add "conserve your reward level to
x units of reward per time" into the interface definition and then
push that part of the requirement off on to the learning algorithm.
You keep that requirement implemented in the reward generator instead
and keep the work of the learning algorithm the same as always -
maximize the reward signal you are given.

All problems can be translated into RL problems like that. The
environment defines the problem, and the reward signal (from the
perspective of the learning algorithm) is part of the environment.
The reward signal is what defines the motivation of the learning agent
by translating environmental conditions into the reward signal which
the learning agent is trying to maximize.

Because all these problems can be translated into pure RL problems, we
have fractured the domain of learning in half by using the RL
abstraction. The first half of the problem is creating strong generic
RL learning algorithms with no a priori knowledge of what reward
signal or environment the algorithm will be used with. That's the
problem generally studied in RL research. The second half of the
problem is designing good reward signals for solving practical
problems using RL learning systems. This second half is where your
problem seems to lay in my view and it's one that is not typically
explored in RL research because it's really more of a practical
engineering problem than a theoretical machine learning problem.

By creating better RL algorithms, we are solving _all_ problems that
can be defined with the right reward generator.

> Of course there are some possible variants such as that the non
> renewable resources "reward" are renewed after some period of time and
> so on.
>
> Also the same situation with two competing agents seems to be very pretty.

You seem to be wandering off into what tends to be more of a game
theory area of study than RL.

When two RL agents compete against each other in the same environment
at the same time, then the first RL agent becomes part of the
environment the second RL agent is trying to understand and manipulate
and vice versa. The two agents typically each have their own reward
signal generators however and as such, are solving different problems.
They each for example might be trying to maximize the charge in their
own battery. Or they each might be trying to consume a maximum number
of food pellets. When their goals conflict, their learned behaviors
will naturally conflict (you get some limited form of what we could
call war between the agents).

If they both get the same reward signal, then they stop acting like
two completing agents, and start working together as if they were just
one agent.

When one RL system is trying to deal with an environment that has one
or more competing RL agents of the same power as part of the
environment, the problem becomes very difficult simply because the
environment becomes far more complex than the agent can possibly
model. One agent of a given power to model an environment can't hope
to model a complex environment, and another gent with the same sized
model at the same time. So you are looking at what happens when none
of the agents can fully "understand" what the other agents will do.
They have to use simplifying assumptions about the environment and
simply do the best they can. What sort of behaviors emerge as whole
from such systems of competing agents is a complex function of the
ability of the agents, and the motivations they have each been given,
and the nature of the rest of the environment they are interacting
with. It's generally so complex it's hard at times to find emergent
behaviors that even can be studied.

But my point in all this, is that pure RL research normally limits
itself to solving only the problem that is defined by the standard RL
abstraction of reward maximizing and it uses test environments that
are roughly suited to the learning power of the algorithm(s) being
tested. There is no need to redefine what RL is if you want to
explore the behavior and power of RL algorithms in a specific problem
domain created by one specific reward signal paired with one specific
environment. Defining a specific new problem domain doesn't change in
any way, what the RL algorithm is trying to do. It's still just
trying to discover the behaviors that work best for maximizing future
rewards.

> bests,
> -José

#ZHENG LEI#

unread,

Dec 8, 2009, 5:16:51 AM12/8/09

to rl-...@googlegroups.com

　Hi,

I am testing the PSR rank estimation algorithm. And from the literature, it seems that all proposed algorithms are using some singular value cutoff proposed by Golub & Van Loan (Matrix Computer 3rd ed., section 5.5.8).

There is a bug in my previous implementation, which is fixed now. I obtained the estimated system matrix through sampling, and also the exact system matrix through the given pomdp models, therefore, I have the exact error matrix. The average error is very small with large samples, something around 0.0007 with millions of samples. But the cutoff doesn’t seem to work at all. To illustrate my observations, let’s see a simple example.

For a matrix t = 1,0,0,0

0,1,0,0

0,0,1,0

0,0,0,0

Suppose, it is corrupted by an constant error 0.001 on all entries, except t(4,4), where the error is 0.002. So the matrix becomes t2 = 1.001, 0.001, 0.001, 0.001

0.001, 1.001, 0.001, 0.001

0.001, 0.001, 1.001, 0.001

0.001, 0.001, 0.001, 0.002

The average error is 0.0010625, norm(t2,inf)=1.004, so the cutoff is 0.001067, but the smallest singular value of t2 is close to 0.002.

To summarize, my observation is, if using the actual error, the published PSR rank estimation algorithm is very likely to overestimate the rank, and thus the estimated rank will grow in each iteration without termination. In practice, however, the error is estimated using Chebyshev’s inequality, which often overestimates the error, and in turn may or may not terminate the iteration and obtain a 'correct' rank. I cannot see how to calculate an effective error bound except with luck. I am also not quite sure I understand the derivation of the cutoff value in Golub’s short section. Is it just a heuristic?

Is there anybody tried the algorithm and would like to share your results? Am I missing something?

Now, I do find another algorithm that might work. It is from G. W. Stewart (Determining Rank in the Presence of Error), who propose to use the error matrix norm as a cutoff. This criterion works quite well according to my experiments using the actual error. So the problem is then, how can I estimate the error matrix norm? Anybody got any suggestions? We know that each matrix entry can be roughly treated as a binomial random variable and we have all the counts. Is there anyway to estimate the error norm?

Regards

Zheng Lei

Michael Littman

unread,

Dec 8, 2009, 9:07:19 AM12/8/09

to rl-...@googlegroups.com

I would have thought the key element is not the absolute error, but how quickly the singular values drop off. Have you made a "scree plot" (a plot of the singular values found by the algorithm)?

#ZHENG LEI#

unread,

Dec 8, 2009, 11:29:49 PM12/8/09

to rl-...@googlegroups.com

Finally, I got some response. Thanks for the reply! Yes, the singular value often does have a large drop at the correct rank position. But, we don’t know the rank in advance, and it has large drops at other places as well, which means we still need some guidance derived from the knowledge of the data. For example, the singular values of a rank 7 matrix with sampling error is something like,

1.4415212623268

0.712269148373198

0.496129575624723

0.242763801556421

0.0692843513186729

0.0231185390966675

0.00482166307232688

0.00225751769791029

…

There are two large drops, but none is correct. The actual matrix (with rounding error) has singular values as,

1.44114296316861

0.714772866090152

0.493542639058541

0.241856292269964

0.0696866722117172

0.0232027101185032

0.00469792834118186

5.77410323015977e-017

…

So, the actual matrix does have large drops at ‘wrong’ places. I have to admit, there are many times when the first large drops tells the correct rank, but I think my point is still valid. Without analyzing the data, we cannot judge the rank by drops alone.

William Uther

unread,

Dec 9, 2009, 12:20:02 AM12/9/09

to rl-...@googlegroups.com

Hi all,

I also found that singular values didn't drop off cleanly when I played with PSRs. There's actually some discussion of this in one of the NIPS papers this year (rank estimation in matrices with missing values - not PSRs specifically). Apparently if you have the right (well, wrong) sort of noise in your matrix entries then you can get this effect, and the system dynamics matrices that I remember had that distribution of missing values (from memory).

Cheers,

Will :-}

On 08/12/2009, at 8:29 PM, #ZHENG LEI# wrote:

> Finally, I got some response. Thanks for the reply! Yes, the singular value often does have a large drop at the correct rank position. But, we don’t know the rank in advance, and it has large drops at other places as well, which means we still need some guidance derived from the knowledge of the data. For example, the singular values of a rank 7 matrix with sampling error is something like,
> 1.4415212623268
> 0.712269148373198
> 0.496129575624723
> 0.242763801556421
> 0.0692843513186729
> 0.0231185390966675
> 0.00482166307232688
> 0.00225751769791029
> …
> There are two large drops, but none is correct. The actual matrix (with rounding error) has singular values as,
> 1.44114296316861
> 0.714772866090152
> 0.493542639058541
> 0.241856292269964
> 0.0696866722117172
> 0.0232027101185032
> 0.00469792834118186
> 5.77410323015977e-017
> …
> So, the actual matrix does have large drops at ‘wrong’ places. I have to admit, there are many times when the first large drops tells the correct rank, but I think my point is still valid. Without analyzing the data, we cannot judge the rank by drops alone.
>

#ZHENG LEI#

unread,

Dec 9, 2009, 12:37:06 AM12/9/09

to rl-...@googlegroups.com

　Hello William. Could you please give me the citation of the paper you mentioned? I'd like to read it. Since it's quite recent, I cannot find it online.

William Uther

unread,

Dec 9, 2009, 6:49:57 PM12/9/09

to rl-...@googlegroups.com

Hi,
I honestly can't remember exactly. I think it was either poster T7 (Linearly Constrained Bayesian matrix factorisation for blind source separation) or T8 (Matrix Completion from Noisy Entries). I looked at those papers though and I can't find the diagram I remember seeing on the poster that really demonstrated the issue.

Sorry I can't be more help,

Will :-}

Reply all

Reply to author

Forward