How to evaluate a policy learnt from batch Q learning?

385 views
Skip to first unread message

wing...@gmail.com

unread,
Nov 9, 2015, 11:32:07 AM11/9/15
to Reinforcement Learning Mailing List
Hi all,

I learnt a policy from historical data using batch q learning, but now I am not sure how to evaluate the policy except for online testing and simulation. I know there are some existing policy evaluation methods, but I thought policy evaluation means that for a policy, find its value function. Since I learnt my policy from a value function, does it make sense for me to back calculate the state value function using the same data?

Thanks,
Yunshi

Vukosi Marivate

unread,
Nov 9, 2015, 3:41:04 PM11/9/15
to rl-...@googlegroups.com, wing...@gmail.com
Hey Yunshi.

You are looking to evaluate how well a policy you have learnt from batch data would do in the real world (unfortunately you don't have access to the online system). 

Evaluating the state value function (you would need the optimal one) from the data with the one estimated in the Q-learning might lead to wrong conclusions. We compared a couple of potential metrics to do offline evaluation like this (Chapter 4 of my thesis. http://cs.brown.edu/~mlittman/theses/marivate.pdf).

Another approach you might try is taking your data and creating a model and then simulating the policy in that model With enough data, and a nice state-action space,  you might have a good idea of how you would do in the real system. If you dont have enough data, or would like to get better understanding of your uncertainty, you might want to build a model that also incorporates uncertainty modelling (Chapter 5 of my thesis http://cs.brown.edu/~mlittman/theses/marivate.pdf)

Regards,
Vukosi Marivate
http://www.vima.co.za

Twitter LinkedIn WordPress




--
You received this message because you are subscribed to the "Reinforcement Learning Mailing List" group.
To post to this group, send email to rl-...@googlegroups.com
To unsubscribe from this group, send email to
rl-list-u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/rl-list?hl=en
---
You received this message because you are subscribed to the Google Groups "Reinforcement Learning Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rl-list+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

wing...@gmail.com

unread,
Nov 9, 2015, 5:29:39 PM11/9/15
to Reinforcement Learning Mailing List, wing...@gmail.com
Hi Vukosi,
 
Thanks for your reply! I actually read your thesis, especially Chapter 4, when I was doing research on policy evaluation. Are you suggesting that the potential metrics you discussed in your thesis are not appropriate for my case? I only have one policy learnt from the historical data using Q learning.
 
Thanks,
Yunshi

Vukosi Marivate

unread,
Nov 10, 2015, 2:48:57 AM11/10/15
to rl-...@googlegroups.com, wing...@gmail.com
We mostly compared Q values between policies or against the optimal policy. If you want to say something about the real world performance of that policy, then you need to have a good amount of data from the full state-space. Data sparsity also becomes a problem as how will you know the performance in states that have rarely been visited in your collected data.

Regards,
Vukosi Marivate
http://www.vima.co.za

Twitter LinkedIn WordPress



Georgios Theocharous

unread,
Nov 11, 2015, 1:11:21 PM11/11/15
to Reinforcement Learning Mailing List
You could also take a look at this paper:
Thomas, G. Theocharous, and M. Ghavamzadeh. High Confidence Off-Policy Evaluation. In Proceedings of the Twenty-Ninth Conference on Artificial Intelligence, 2015. pdf

Georgios

Csaba Szepesvari

unread,
Nov 11, 2015, 4:42:26 PM11/11/15
to rl-...@googlegroups.com
Hi,
I am not sure that there is any sound way of evaluating *any* policy (no
matter how it was learned) other than using it on the real system (or
the simulator, if you trust the simulator), unless you are ready to make
strong assumptions.
These papers discuss the difficulties:
Farahmand, A.m. and Szepesvári, Cs., Model Selection in Reinforcement
Learning, Machine Learning Journal, 85 (3) , pp. 299--332, 2011.
http://www.ualberta.ca/~szepesva/papers/RLModelSelect.pdf
Li, L., Munos, R., and Szepesvári, Cs., Toward Minimax Off-policy Value
Estimation, AISTAT, 2015.
http://www.ualberta.ca/~szepesva/papers/AISTAT15-OffPolicy.pdf
Cheers,
Csaba
Reply all
Reply to author
Forward
0 new messages