Hi,
I am not sure that there is any sound way of evaluating *any* policy (no
matter how it was learned) other than using it on the real system (or
the simulator, if you trust the simulator), unless you are ready to make
strong assumptions.
These papers discuss the difficulties:
Farahmand, A.m. and Szepesvári, Cs., Model Selection in Reinforcement
Learning, Machine Learning Journal, 85 (3) , pp. 299--332, 2011.
http://www.ualberta.ca/~szepesva/papers/RLModelSelect.pdf
Li, L., Munos, R., and Szepesvári, Cs., Toward Minimax Off-policy Value
Estimation, AISTAT, 2015.
http://www.ualberta.ca/~szepesva/papers/AISTAT15-OffPolicy.pdf
Cheers,
Csaba