Hello all, i have successfully applied RL agents to the GRF environment but i have a doubt regarding the benchmarks published in the paper , are the results published for different training runs or evaluation runs. It will be very difficult to beat benchmarks if they are of training runs as we know that off policy algorithms do not show the same rewards while training. Although in evaluation of the policy i am getting results close to the benchmarks. it will be very helpful for my research.
As we all know all the benchmarks for ATARI games are based on evaluations and not training rewards and moreover this article
interpreted the benchmarks as training episodic mean rewards.
Please clarify my doubt as i am in the final stages of my paper