RISS Today: "Reliable Methods for Agent Evaluation" by Shuvom Sadhuka (MIT)

32 views

Skip to first unread message

Krikamol Muandet

unread,

Apr 8, 2026, 8:03:31 AM (13 days ago) Apr 8

to Machine Learning News

We’re thrilled to announce the upcoming seminar at the Rational Intelligence Seminar Series (RISS), on April 8, 2026 (TODAY). RISS seeks to advance the understanding of rationality, efficiency and reliability in machine learning systems. These seminars serve as a forum for discussions and dissemination of results.

Join us to engage in lively discussions in the session, “Reliable Methods for Agent Evaluation” delivered by Shuvom Sadhuka – PhD Student at MIT.

Abstract
As AI systems become more widely adopted, the need for reliable evaluation techniques is accelerating. In this talk, I will present recent work on building evaluation methods with statistical guarantees. I will introduce e-valuator, a method for monitoring agent trajectories. Agents execute sequences of actions (e.g., reasoning steps or tools calls) and receive feedback from verifiers, such as judge LLMs or process-reward models, which score these actions. These heuristic verifier scores, while informative, do not provide guarantees on correctness when used to decide whether an agent’s actions will yield a successful output. We thus frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user’s prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. I will also briefly discuss recent work on evaluating models with unlabeled and labeled data.

Speaker Bio

Shuvom Sadhuka is a fourth-year PhD student in Computer Science at MIT, where he is advised by Bonnie Berger. He is broadly interested in evaluation and uncertainty quantification methods, with applications in biomedical settings. His work has been supported by an NSF Graduate Research Fellowship and a Hertz Fellowship.

Logistics

Date: April 8, 2026
Time: 14:30 CET
Zoom: https://cispa-de.zoom-x.de/j/61708401597
Meeting ID: 617 0840 1597

We look forward to your participation. For more information about the seminar series, please visit https://ri-lab.org/riss/.

Best regards,
The Rational Intelligence Lab
CISPA Helmholtz Center for Information Security
RISS: "Law of Large Numbers: Accuracy as Stat

Reply all

Reply to author

Forward

0 new messages