[AI-EVAL] 2023 August "AI Evaluation" Digest

47 views

Skip to first unread message

Wout Schellaert

unread,

Aug 25, 2023, 7:31:46 AM8/25/23

to AI Evaluation

Dear all,

The second month digest is here. Enjoy!

2023 August “AI Evaluation” Digest

AI Evaluation getting more attention from psychology and other fields, as this comment paper in Nature Reviews Psychology: “Baby steps in evaluating the capacities of large language models”.
Anthropic, Google, Microsoft and OpenAI have established the Frontier Model Forum, an industry body committed to the safe and responsible development of advanced AI systems. One objective is to enable independent and standardised evaluation of capabilities and safety. There will be a strong focus initially on developing and sharing a public library of technical evaluations and benchmarks for frontier AI models. (blog)
This paper (arxiv) about statistically inferring skills in language models is a combination of factor analysis, scaling laws and partition graphs.
Evaluating human-AI systems will become increasingly common. This paper in AI Magazine discusses the “minimum necessary rigor” in empirical human-AI evaluation. (paper)
MosaicML introduces a new multi metric and multi benchmark LLM evaluation leaderboard (webpage)
Benchmarks are getting solved faster, a blogpost by Contextual AI.
A survey, taxonomy, and discussion on the relevance of instance-level difficulty (ACM).
A selection of evaluation related work at ICML2023
- RankMe: Assessing the Downstream Performance of Pretrained Self-Supervised Representations by Their Rank (PMLR)
- In or Out? Fixing ImageNet Out-of-Distribution Detection Evaluation (arxiv)
- Distributional Offline Policy Evaluation with Predictive Error Guarantees (arxiv)
- How many perturbations break this model? Evaluating robustness beyond adversarial accuracy (arxiv)
A selection of evaluation related work at UAI2023
- Composing Efficient, Robust Tests for Policy Selection (openreview)
- TCE: A Test-Based Approach to Measuring Calibration Error (openreview)
- Validation of Composite Systems by Discrepancy Propagation (openreview)

Contributors to this month’s digest: Jose Hernandez-Orallo, Wout Schellaert, Lexin Zhou.

How to contribute: Feel free to reach out to wsc...@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.

Getting the digest: Once a month if you join https://groups.google.com/g/ai-eval. For previous issues, just check here.

Jose H. Orallo

unread,

Sep 30, 2023, 2:31:04 PM9/30/23

to ai-...@googlegroups.com

Dear all,

Our monthly digest is here. Enjoy!

2023 September “AI Evaluation” Digest

Some media coverage and introductory pieces about the state of AI evaluation:

“AI hype is built on high test scores. Those tests are flawed.” (MIT Technology Review) Review of the state of AI evaluation for the general public echoing all the “wrongs” of AI evaluation.

“A test of artificial intelligence” (Nature). The Turing Test, not again please! But there are also some interesting insights in this piece, beyond that.

Administration, policy and tech:

Governor of California Gavin Newsom signed an executive outlining California’s strategy towards a responsible process for evaluation and deployment of AI. (source).

David Krueger and Yarin Gal announced as the first research directors for the UK Frontier AI Task Force, together with a host of heavyweight external advisory board members (press release). The task force is working extensively on “AI evals”, which are mostly AI testing and risk evaluation efforts, including red teaming.

Anthropic Responsible Scaling Policy, with focus on evaluations aimed to catch early warning signs (source).

Test and Evaluation (T&E) Methodology from scale.com. (source). More on using the term AI Evals, especially when doing testing and red teaming.

Language models:

“AGIBench: A Multi-granularity, Multimodal, Human-referenced, Auto-scoring Benchmark for Large Language Models”, (arxiv). A new benchmark for large language models from BenchCouncil. Despite the unfortunate name, a very interesting feature is that it is human referenced, with five difficulty levels per instance.

“Efficient Benchmarking (of Language Models)” (arxiv) Shows that something well-known in ML evaluation also happens in HELM: using aggregate metrics on batteries of dataset can change leader boards with the single removal of one dataset.

LLM Reversal Curse: https://owainevans.github.io/reversal_curse.pdf. LLMs trained on sentences of the form “A is B”, will underperform for questions of the form “B is A”. This is called the Reversal Curse. The authors claim that this is the cause of failure of logical deduction in LLMs.

Foundations and evaluation methodology:

“Inferring Capabilities from Task Performance with Bayesian Triangulation” (arxiv). Introduces the concept of measurement layouts to infer AI capabilities, illustrating them in the Animal AI evaluation environment. Highly recommended! (this comment is biased).

More specific papers:

Human Uncertainty in Concept-Based AI Systems (arxiv). Mostly relates to training, but labels are always relevant to evaluation! It was presented at AIES in August.

Federated benchmarking of medical AI (NatMachIntell), Introduces MedPerf, an open platform for benchmarking AI models in the medical domain in a federated way.

Contributors to this month’s digest: Jose H. Orallo, Wout Schellaert

Reply all

Reply to author

Forward

0 new messages