2023 October “AI Evaluation” Digest

10 views

Skip to first unread message

AI Evaluation

unread,

Oct 27, 2023, 5:35:24 AM10/27/23

to AI Evaluation

Dear all,

Our monthly digest is here! Please help distribute the digest and the mailing list with your collaborators!

2023 October “AI Evaluation” Digest

The JMLR family of journals launches a new venue: the Journal of Data-centric Machine Learning (DMLR) with an emphasis on --among other data related topics-- benchmark tooling & methods, data quality evaluation, metrics, methodology of empirical evaluations and more (blogpost).
DeepMind publishes “Evaluating social and ethical risks from generative AI” (blog, arxiv)
China announced its own Global AI Governance Initiative at the 3rd BRI Forum today, 2 weeks before the UK's AI Safety Summit. The document contains new talking points on AI safety, model evals, and national sovereignty (Twitter thread in English).
The European Lighthouse on Secure and Safe AI (ELSA) announces the ELSA Benchmarks platform.
Language, Common Sense, and the Winograd Schema Challenge (link), or why the Winograd Schema Challenge was/is not a sufficient test of intelligence (basically elaborating on this earlier paper).
A good piece on LLM Evaluation and Reasoning abilities (link), based on specialised and relatively new benchmarks, with a related Twitter thread by the author (link). It ties in well with Ida Momennejad et al.’s cog-sci inspired evaluation of planning capabilities and cognitive maps in language models (arxiv) and with Cohn’s spatial reasoning evaluation (arxiv).
Anthropic’s ‘Challenges in Evaluating AI Systems’ (link) provides useful insights into perceived difficulties with evaluation in the industry labs, including challenges with common benchmarks such as BIG-Bench and HELM.
According to the popular State of AI Report (link, p. 32), AI evaluation is apparently so unreliable for LLMs that people just follow the “vibes” of anecdotal evaluation.
A few other, technical, papers:
- Anchor Points: Benchmarking Models with Much Fewer Examples (arxiv)
- Benchmarking Cognitive Biases in Large Language Models as Evaluators (arxiv)
- ‘Intriguing Properties of Generative Classifiers’ interestingly compares errors between humans and vision systems (arxiv).
And we close with a bit of humour: Pretraining on the training set is all you need!

Contributors to this month’s digest: Jose H. Orallo, Wout Schellaert

How to contribute: Feel free to reach out to wsc...@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.

Getting the digest: Once a month if you join https://groups.google.com/g/ai-eval. For previous issues, just check here.

Reply all

Reply to author

Forward

0 new messages