[AI-EVAL] 2023 December “AI Evaluation” Digest

25 views

Skip to first unread message

AI Evaluation

unread,

Dec 29, 2023, 2:52:52 PM12/29/23

to AI Evaluation

Our last digest of 2023 brings yet more news on AI evaluation! This year has been full of activity in this increasingly recognised area, with the number of events, papers, initiatives, benchmarks and platforms increasing very notably. Following this trend we expect the area of AI Evaluation to become even more hectic in the future. Because of this, we plan changes for the digest for 2024. Stay tuned!

Cognitive and capability-oriented evaluations are now commonplace!
- Have we built machines that think like people? (arxiv) Evaluating cognitive capabilities in multimodal models, covering intuitive physics, causal reasoning, theory of mind, etc.
- Running cognitive evaluations on large language models: The do's and the don'ts (arxiv) A good compendium of what everybody knows, or not? At least what everybody should know!
- AAAI tutorial on Measurement Layouts for Capability-oriented AI Evaluation (link) will be held at AAAI2024 in Vancouver
- And relatedly, the Animal AI environment 3.0 is out! (link)
- GRASP: Grounding and Situated Physics evaluation of multimodal language models (https://arxiv.org/pdf/2311.09048.pdf)
Of course, more benchmarks, competitions and prizes:
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (arxiv)
- AI-MO Prize: Artificial Mathematical Olympiad Prize (website): XTX Markets has launched a $10 million AI-MO Prize for AI models that can solve difficult International Mathematical Olympiad (IMO) level mathematical problems.
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark (arxiv): comes with information with humans performing and evaluating the questions.
- MLCommons Benchmarking optimizers competition (website).
- AlignBench: guess what, it’s an Alignment Benchmark (in Chinese) https://arxiv.org/abs/2311.18743
- LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models (https://arxiv.org/abs/2311.18232)
- ROBBIE: Robust Bias Evaluation of Large Generative Language Models (https://arxiv.org/abs/2311.18140)
- LLMEval: a benchmark with a lot of scoring annotations, and useful for prediction. https://arxiv.org/pdf/2312.07398.pdf (questions only in Chinese)
Evaluation and Safety getting more and more intertwined:
- CyberSecEval:A benchmark for evaluating the cybersecurity risks of large language models (meta.com)
- Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation (arxiv)
News from recent events: EMNLP & NeurIPS
- ROBBIE: Robust Bias Evaluation of Large Generative Language Models (arxiv)
- More on compositional benchmarks: https://aclanthology.org/2023.conll-1.19/
- The Emergent Abilities Mirage paper (see our Sep2023 Digest) (https://openreview.net/forum?id=ITw9edRDlD) got one of the NeurIPS Outstanding Main Track awards!
- Another edition of the NeurIPS 2023 Datasets and Benchmarks Track (link), with these two papers being awarded:
  - ClimSim: A large multi-scale dataset for hybrid physics-ML climate emulation (link)
  - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models (link)
Miscellanea:
- Comparing humans and AI: Comparing the Evaluation and Production of Loophole Behavior in Children and Large Language Models (openreview)
- Contamination: Technique to detect/understand contamination (https://arxiv.org/pdf/2311.12337.pdf)
- Uncertainty estimation: Polygraph - a python tool for integrated uncertainty estimation in LLMs (https://arxiv.org/pdf/2311.07383.pdf)

Contributors to this month’s digest: Jose H. Orallo, Wout Schellaert, Nando Martínez-Plumed

How to contribute: Feel free to reach out to wsc...@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.

Getting the digest: Once a month if you join https://groups.google.com/g/ai-eval. For previous issues, just check here.

Reply all

Reply to author

Forward

0 new messages