State/action dependent discounts

clement...@gmail.com

unread,

Dec 3, 2019, 10:45:50 PM12/3/19

to julia-pomdp-users

I'm trying to implement an extension to the POMDP.jl API that supports semi-Markov decision processes and other similar abstractions, but it seems that all solvers assume a constant discount factor. I ran into this issue when trying to implement an MDP transformation to combines an MDP with option to produce a SMDP. With state-dependent discounts, this would be fairly easy to do. In addition to modelling a larger class of problems, this also benefits other use-case like fitting values for termination probabilities that differs from the environment's (in an online setting) and seems useful enough for DeepMind to included variable discounts in their environment API.

This seems like a fairly simple extension and I would expect that adding solver support should be reasonably straightforward in most cases. In addition, keeping backwards compatibility should be trivial. Are there any packages that support non-constant discount factors? If not, is there any interest in extending the API to support this?

Zachary Sunberg

unread,

Dec 4, 2019, 12:35:31 AM12/4/19

to julia-pomdp-users

Hi Clement,

Thanks for reaching out. This is an interesting question that we have not considered before. I want to be careful about adding optional things to the interface since they increase complexity* and make it more difficult to explain. That being said, if this is the only way to accomplish a particular task, I am open to considering it.

It isn't clear to me exactly what a state dependent discount means. What is the optimization objective? Is i

or do the discounts build on each other, i.e.

In either case, it seems like this discount could be made part of the state or a terminal state can be added with an appropriate probability of transitioning into it. Of course state augmentation is not a free solution, so we should consider the consequences. We should also consider whether basic algorithms like value and policy iteration will work (seems like they should), and what will happen in a POMDP when the state is not known.

It would be helpful to have some more details about why it's needed in the SMDP conversion and what deep mind was using it for (feel free to link to papers).

For now, it would also be reasonable to define and use discount(m, s, a) in your own packages and then maybe it will get incorporated into POMDPs later if it becomes more obvious that it is needed.

*in fact I hope that soon we can only have one version of `observation` and `reward` and get rid of the (a, sp) and (s, a) versions.

clement...@gmail.com

unread,

Dec 4, 2019, 2:29:49 AM12/4/19

to julia-pomdp-users

Hi Zachary,

Thanks for the quick response!

What is the optimization objective?

The latter, discounts build off of each other. With this formulation a constant discount reduces to the familiar discounted return objective. The 2nd edition of the intro to RL book [1, sec 12.8] covers this case briefly but the high-level takeaway is that it grants you the freedom to decouple "flow" of the MDP from what is being "learned". There several advantages to doing this.

On a more abstract level, one such advantage is that it allows us to generalize the idea of a value function and define general value functions (GVF) [2]. GVFs follow a similar definition to the value function but rewards, discounts, and policies (in the off-policy case) are all arbitrarily defined (or learned). One notable fact of GVFs is that choosing a specific subset of GVFs will give you the successor representation [4] and allow you to compute the value function for any reward function (e.g., for transfer learning). This connection has led to a strong interest in better understanding how fitting several GVFs simultaneously could improve generalization and robustness [5]. The machinery required to compute these GVFs is no different than the vanilla value function so, in this context, it makes sense to prefer the slightly more general MDP formulation to allow compatibility with any value function solver.

But more specifically to my use case, you can model the decision points of an agent executing options as a SMDP [3]. In this setting, the number of timesteps between decisions is random and can vary throughout a trajectory. If you wish to abstract away the low-level mdp and only plan at the level of options but still optimize the original low-level constant discount objective, you need to consider a framework with non-constant discounts for the high-level planning problem.

In either case, it seems like this discount could be made part of the state or a terminal state can be added with an appropriate probability of transitioning into it. Of course state augmentation is not a free solution, so we should consider the consequences.

If I understand what you are saying, the augmentation would be a workable solution for something like value iteration but might cause issues with a more generative oriented solver which isn't ideal for my use case (assigning a discount of 0 for some states is perfectly "legal").

We should also consider whether basic algorithms like value and policy iteration will work (seems like they should), and what will happen in a POMDP when the state is not known.

All the MDP related solvers should be trivial to generalize to variable discounts. I haven't thought hard enough about the POMDP case to say for certain it wouldn't cause issues but I would expect it to be fine for most solvers. Generalizing to variable discounts should not require any major change in design/architecture unless the solver is structured in a very unusual way or if it leverages some less intuitive property that for some reason breaks under variable discounts. Most (all?) solvers will evaluate some form of (possibly approximate) conditional expectation and scale it by the discount. In the variable discount case, you simply condition the discount on the state in addition to the expected next value.

For now, it would also be reasonable to define and use discount(m, s, a) in your own packages and then maybe it will get incorporated into POMDPs later if it becomes more obvious that it is needed.

That is what I'm planning to do but I'll also have to modify any solver I want to use to correctly use the variable discount factors (which shouldn't be difficult).

That being said, if this is the only way to accomplish a particular task, I am open to considering it.

There are probably several ways to achieve a similar goal but I don't think any would offer the same level of solver compatibility. Though, I'll point out that I am fairly new to julia so I'm still working on shaking off my python (bad) habits ;) It is still very possible that there is a elegant solution that I can't see yet.

[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

[2] Sutton, Richard S., et al. "Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction." The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2. International Foundation for Autonomous Agents and Multiagent Systems, 2011.

[3] Sutton, Richard S., Doina Precup, and Satinder Singh. "Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning." Artificial intelligence 112.1-2 (1999): 181-211.

[4] Dayan, Peter. "Improving generalization for temporal difference learning: The successor representation." Neural Computation 5.4 (1993): 613-624.

[5] Schaul, Tom, et al. "Universal value function approximators." International Conference on Machine Learning. 2015.

Zachary Sunberg

unread,

Dec 6, 2019, 12:25:19 AM12/6/19

to julia-pomdp-users

Hi Clement,

Thanks for the detailed discussion. Your thoroughness is much appreciated. I haven't had time to think through it all, but I did a little bit tonight:

First to be more concrete, I sketched out my rough ideas for augmenting the state space: https://gist.github.com/zsunberg/67fb8fabda30c7bb6228277ada98885f. It seems like your SMDP task might be accomplished by something like the second implementation, but I am not sure. I need to read the references you sent.

Both the SMDP conversion and GVF ideas are new to me, so I am going to think about them separately even though it seems like they might have the same solution. I'll try to think about the SMDP conversion first and see if there is another way, then move on to GVFs and other learning-related advantages.

One important question to ask is are we 1) trying to broaden the class of problems that we can define or 2) trying to develop new ways to solve constant-discount (PO)MDPs more efficiently?

Thanks for your patience - hopefully you can get by by manually modifying solvers in the short run. In the long run, it is very useful to add these features, so thanks for bringing it up. If anyone else has any thoughts they'd be much appreciated. Has anyone else used state dependent discounts? Should they be a core part of the interface? Is there another way to accomplish this?

Zachary Sunberg

unread,

Dec 6, 2019, 12:41:18 AM12/6/19

to julia-pomdp-users

One thing I don't understand yet is why you would want to keep going after the discount is zero. No further reward can be collected, so this has no meaning for the optimization problem. (I suppose I will find this out in one of the references :) )

clement...@gmail.com

unread,

Dec 6, 2019, 2:35:37 AM12/6/19

to julia-pomdp-users

First to be more concrete, I sketched out my rough ideas for augmenting the state space: https://gist.github.com/zsunberg/67fb8fabda30c7bb6228277ada98885f. It seems like your SMDP task might be accomplished by something like the second implementation, but I am not sure. I need to read the references you sent.

Here is a bit of context about options and how I'd like to use them by formulating an SMDP: https://gist.github.com/gehring/27654c4ee5e4b8bff0b596765578e896 . Also, similar to your tracking the discount idea, you could simply track the amount of time elapsed since the last option decision but that would still require access to the state when computing the discount unless we want a mutable MDP (which doesn't sound appealing at all).

Both the SMDP conversion and GVF ideas are new to me, so I am going to think about them separately even though it seems like they might have the same solution. I'll try to think about the SMDP conversion first and see if there is another way, then move on to GVFs and other learning-related advantages.

I think the ultimate solution would be to have a explicit/generative API that extends to the termination and discounts. Termination and discounts already sort of provide an explicit interface. The most general solution might be to simply treat them similarly to the other random variables but it not clear if that is too big of a hammer.

One important question to ask is are we 1) trying to broaden the class of problems that we can define or 2) trying to develop new ways to solve constant-discount (PO)MDPs more efficiently?

Concretely 1), but indirectly 2) (sort of). This relates to your "why keep going after discount zero" question. Abstracting portions of complex (PO)MDPs can help you solve constant-discount (PO)MDPs more efficiently but that is very dependent on how you do so. How to learn/find good options is a (very) active area of research but also a very difficult question to answer (in fact, just asking the right question seems to be challenging) [1-7]. I'm interested in broadening the class of problems we can define so that we can more easily play around with these types of abstractions.

As for the discount=0 question, the optimization can still optimize states before or after the discount=0 area, it just will have truncated returns for trajectories that go through it. For an online agent, it doesn't change anything but how it updates its value function. It makes more sense to think of it in a setting where there is some higher-order goal imposed by you the designer or even some other parts of your decision making system. Maybe you would like to learn/fit a policy that is more cautious than "rationally" needed (in the expected value sense), say, in the case of a self-driving car. Effectively, this treats these states as terminal states as far as the value function is concerned, but in reality, your agent might still keep going. Should you stop collecting data in that case? You should definitely keep going since you would want your agent to learn how to correct and steer back to somewhere "safe" but you don't want the agent to associate any future rewards to the decisions that led it to a "bad" state. It's probably apparent but my perspective is mostly from an online setting but I believe it is still somewhat relevant to the offline case.

For the record, I think it is completely warranted to be cautious/nervous to extend the core API like this! I currently think it is a good idea but I am by no means certain. I think prototyping ports for some of the other POMDPs*.jl and solvers before committing would be wise (if ever we decide to go down that route).

Disclaimer: I was just looking for a specific paper but the number of relevant papers that showed up in the search results was too large to not spam them here ;)

[1] Bacon, Pierre-Luc, Jean Harb, and Doina Precup. "The option-critic architecture." Thirty-First AAAI Conference on Artificial Intelligence. 2017.

[2] Tiwari, Saket, and Philip S. Thomas. "Natural option critic." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. 2019.

[3] Vezhnevets, Alexander Sasha, et al. "Feudal networks for hierarchical reinforcement learning." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[4] Harutyunyan, Anna, et al. "The Termination Critic." arXiv preprint arXiv:1902.09996 (2019).

[5] Konidaris, George, and Andrew G. Barto. "Building Portable Options: Skill Transfer in Reinforcement Learning." IJCAI. Vol. 7. 2007.

[6] Liu, Miao, et al. "The eigenoption-critic framework." arXiv preprint arXiv:1712.04065 (2017).

[7] Jain, Arushi, Khimya Khetarpal, and Doina Precup. "Safe option-critic: Learning safety in the option-critic architecture." arXiv preprint arXiv:1807.08060 (2018).

Zachary Sunberg

unread,

Jan 31, 2020, 1:43:05 PM1/31/20

to julia-pomdp-users

Sorry I never responded to this - I got very busy buying a house and starting up my course. Clement, I imagine you have moved on, but just in case someone else decides to pick it up:

I haven't had time to review these papers, but it seems like this is probably something worth supporting. I am less worried about adding new things to the interface because now we have QuickPOMDPs which will keep it easier to define problems even if the interface is more complex.

If someone wants to take this problem on, I think it is likely that we would put it in the interface. Before adding it, we need to

1) Describe how it would work in the POMDP case [or decide that it should only be part of the MDP interface]

2) Implement a simulator and make sure it works

3) See if some of the core solvers will work with it - Value Iteration and MCTS for MDPs, Perhaps QMDP and DESPOT for POMDPs.

Reply all

Reply to author

Forward