[AI-EVAL] 2023 July “AI Evaluation” Digest

27 views

Skip to first unread message

Jose H. Orallo

unread,

Jul 28, 2023, 2:40:54 AM7/28/23

to ai-...@googlegroups.com

Dear all,

This google group has not seen much spontaneous activity in the past few months, but many of us still believe there is room and need for more conversation about the evaluation of AI systems, an increasingly more important area for technical research, regulation and policy making. Therefore we are launching a monthly digest, to be published on the last Friday of each month, bundling some news, papers and everything else related to AI evaluation that managed to reach the members of this google group.

The concept of the group still remains the same, open for all to publish and join, but now you can expect at least this monthly digest in terms of content.

So here it goes, our first monthly digest, enjoy!

2023 July “AI Evaluation” Digest

The European Commission has launched the TEFs: the Sectorial AI Testing and Experimentation Facilities (compute.dtu.dk, ec.europa.eu).

Apollo Research starts as a new AI research organisation dedicated to building a holistic evaluation suite (announcement).

Jack Clark advocates for the UK Foundation Model Taskforce to focus on evaluation (essay), and UK AI companies seem willing to give early and priority access to support this effort (source). Internal sources mention that it is likely evaluation will be a central aspect of the task force.

Toby Shevlane and many others publish “Model evaluation for extreme risks” (paper).

Yan Zhuang et al. use adaptive testing for LLM evaluation (paper).

In “Lost in the Middle: How Language Models Use Long Contexts”, Nelson F. Liu et al. evaluate and investigate how LLMs use long input contexts, finding that the performance varies significantly depending on where the relevant information is located (paper).

OpenAI commits 20% of their compute for alignment superintelligence, and sets AI evaluation performed by AI as the central part of the whole programme: “AI systems to assist evaluation of other AI systems (scalable oversight)”. Does this mean human oversight is over?

Brittle evaluation and static benchmarks are mentioned as a few of the challenges in the trending survey paper “Challenges and Applications of Large Language Models” (paper).

In “Multi-Dimensional Ability Diagnosis for Machine Learning Algorithms”, Qi Liu et al. publish work on the psychometric evaluation of classifiers (paper).

In a Science letter titled “How do we know how smart AI systems are?”, Melanie Mitchell, makes a good summary of the rights and wrongs of AI evaluation. (letter)

In tandem with the previous entry, Science also reported in April on AI evaluation with Ryan Burnell et al’s “Rethink reporting of evaluation results in AI”, testifying to the increasing importance of AI evaluation to the broader public. (report)

Contributors to this month’s digest: Wout Schellaert, Jose Hernandez-Orallo, Lexin Zhou.

How to contribute: Feel free to reach out to wsc...@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.

Getting the digest: Once a month if you join https://groups.google.com/g/ai-eval. For previous issues, just check here.

Peter Flach

unread,

Aug 4, 2023, 7:51:58 AM8/4/23

to AI Evaluation

Thanks Jose, Wout and Lexin, that's really useful. For me personally the first three items are particularly interesting, but I had seen the Jack Clark essay already.

I am still planning on submitting a fellowship proposal on AI measurement and benchmarking. In that context I am getting in touch with the AI Standards Hub (https://aistandardshub.org) which may be of interest to this group also.

Best wishes,

--Peter

Reply all

Reply to author

Forward

0 new messages