BeNeRL Reinforcement Learning Seminar: Claas A. Voelcker from UT Austin, (Feb. 12, this Thursday)

22 views
Skip to first unread message

Zhao Yang

unread,
Feb 9, 2026, 2:09:20 PM (2 days ago) Feb 9
to Reinforcement Learning Mailing List
Dear colleagues,

Our next BeNeRL Reinforcement Learning Seminar (February 12) is coming: 
SpeakerClaas A. Voelcker (https://cvoelcker.de/), postdoctoral researcher from UT Austin.
TitleOn-policy value learning at 10000 frames per second
Date: February 12, 16.00-17.00 (Amsterdam time zone)
Please find full details about the talk below this email and on the website of the seminar serieshttps://www.benerl.org/seminar-series

The goal of the online BeNeRL seminar series is to invite RL researchers (mostly advanced PhD or early postgraduate) to share their work. In addition, we invite the speakers to briefly share their experience with large-scale deep RL experiments, and their style/approach to get these to work.
 
We would be very glad if you forward this invitation within your group and to other colleagues that would be interested (also outside the BeNeRL region). Hope to see you on February 12!

Kind regards,
Zhao Yang & Thomas Moerland
VU Amsterdam & Leiden University
——————————————————————
Upcoming talk:
DateFebruary 12, 16.00-17.00 (Amsterdam time zone)
SpeakerClaas A. Voelcker (https://cvoelcker.de/)
Title: ​On-policy value learning at 10000 frames per second
AbstractWhen samples are cheap and fast to collect, the RL community relies on the policy-gradient theorem to obtain agents which can reliably train on massive amounts of data. However, since their inception, zeroth-order algorithms such as REINFORCE, TRPO, and PPO have been plagued by high variance, which makes them hard to tune. In the off-policy regime where sample efficiency is the goal, stable and efficient value-driven methods have been explored, but these require replay buffers and specialized architectures to stabilize off-policy learning. What if we could bridge the two paradigms, and bring stable value-driven learning to the on-policy sampling regime? In our new ICLR paper, Relative Entropy Policy Optimization, we explore how to achieve stable on-policy value function learning at 50000 frames per second. We will see that value-learning is possible, and useful, without the use of massive replay buffers by combining insights from both the on-policy and off-policy literature.
Bio: Claas is a PostDoc working at the intersection of Reinforcement and robotics at the University of Texas at Austin with Peter Stone and Amy Zhang. He holds a PhD from the University of Toronto and the Vector Institute, where he was advised by Profs. Amir-massoud Farahmand and Igor Gilitschenski. Outside of research, Claas is also a core organizer at Queer in AI, an affinity group that builds community for queer researchers and industry practitioners.His research focuses on stabilizing brittle Deep Reinforcement Learning approaches by understanding training dynamics and using techniques from model-based RL and representation learning. He is driven by the question of how we can learn to accurately predict the value of taking actions, a central task in RL. Beyond that, he investigates how we can do better science in RL by thinking about what problems we should be benchmarking our exciting advances on. Link to project page: https://cvoelcker.de/projects/reppo/
Reply all
Reply to author
Forward
0 new messages