Hi William,
Let me reorganize your questions a little bit so they're easier to answer. Quick questions first:
Would I be able to convert the pomdp to a mdp
There are two MDPs that you could be referring to - the fully observable MDP, or the belief MDP. Which one do you mean?
If you mean the belief MDP, you can combine the pomdp with an updater (like a particle filter, or the exact discrete updater when Louis finishes updating it (https://github.com/JuliaPOMDP/POMDPs.jl/issues/173)) and put it in a GenerativeBeliefMDP from POMDPToolbox. That would give you a generative model of the belief MDP, which is all you'd need for reinforcement learning.
use the temporal difference learning package
The TabularTDLearning package only works out of the box for discrete problems, and the belief MDP is a continuous MDP, so you would have to make your own reinforcement learning code that has a way to approximate the value function on a continuous domain. (or I suppose you could use the action-observation history as the state since this would be discrete, though large. Let me know if you want to try this and I can give you more guidance)
The examples I've seen seem to
use mdps
Yes, typically reinforcement learning is done on an MDP, but sometimes RL people are lazy with the distinctions between an MDP and a POMDP, so be skeptical when you are reading. If it is truly a POMDP, vanilla RL techniques using the observation as a state may not work well. Instead you would typically need to use something like a recurrent neural network/lstm to approximate the belief. This is also why people use multiple frames of a video game for RL - a single frame is actually only a POMDP observation, but multiple frames give a good idea of what the actual state is.
Would the resulting policy generated from q or sarsa learning on a converted
mdp be similar to what the pomcp policy would generate?
Yes, reinforcement learning on the belief MDP should result in the same policy as POMCP.
Those are the answers to the quick questions about how POMDPs.jl will work, but I think you need to focus on problem formulation a little more before jumping in to implementation. I don't think the right answer is to do RL on the belief MDP. Figure out exactly what the problem is that you're trying to solve, and then figure out how to implement it afterwards. First I'll comment on your initial plan:
I would create training and test set of simulated users have them
"interact" with the generated policy, and use the resulting actions
for reinforcement learning.
Creating training and test data sounds like batch RL (project 2 of CS238), but you need not do batch learning in this case; you could do online learning.
I see two options:
OPTION A:
1. Learn the human-using-a-faucet POMDP. You should be able to break this up into small pieces. Since you can ask the human what they are trying to accomplish (what their state is), you can separate the transition, observation, and reward functions and learn them separately. For learning, you can use a fixed policy for the water temperature (immediately match the temp you read from the encoder). Note that you are *not* doing reinforcement learning here because you are *not* trying to learn a policy. You may need to do *inverse* reinforcement learning to learn the reward function.
2. Solve the human-using-a-faucet POMDP and put that policy on the faucet. You should do this with SARSOP.
OPTION B:
Do online reinforcement learning with real humans in the loop on the history MDP corresponding to the human-using-a-faucet POMDP. You could use POMCP or MCTS or a heuristic to get a better initial policy than random. One issue is measuring the reward.
Let me know if this makes sense. If you need clarification, we can definitely talk in person. I don't get back to Stanford until the 5th, but if you want we can skype before then. We can set that up via email.
- Zach