2023 November “AI Evaluation” Digest

8 views

Skip to first unread message

Wout Schellaert

unread,

Nov 25, 2023, 8:57:28 AM11/25/23

to AI Evaluation

Dear all,

The monthly digest for November is here. Enjoy!

There has been a lot of news and attention on evaluations of AI safety. Note that we include these items because they have a big impact on AI (evaluation) research, not necessarily as a statement of support of the respective policies.

Anthropic, Google, Microsoft, and OpenAI announced a new AI Safety Fund, with more than $10 million in funding to support and a primary focus on supporting new model evaluations (link).
UK Safe AI Summit Declaration emphasises evaluation and puts the responsibility of evaluating the system on the developers (link).

Similarly, the White House announces an AI Safety Institute, having the responsibility to evaluate for dangerous capabilities (link). The role of NIST will be central for this, and they are seeking collaborators for a “new consortium supporting development of innovative methods for evaluating artificial intelligence (AI) systems” (link).
Open Philanthropy seeks to fund the development of benchmarks assessing how well LLMs can perform real-world tasks corresponding to human professions (link). Grants are in the range $0.3-3M over a period of 6 months to 2 years.

A selection of the many new papers on evaluating LLMs:

Very fresh, GAIA: a benchmark for General AI Assistants (arxiv). Largely text based, but with significant attention for multimodality and tool use. There are three difficulty levels, roughly based on the number of steps needed to complete the task.
Cataloguing LLM Evaluations (link), a paper introduced together with Singapore's new AI evaluation sandbox (link).
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples (arxiv), discussing issues with current decontamination methods and finding them to be insufficient.
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation (arxiv), tying into the increased dependence on human preferences for evaluation.
TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs (link), a new benchmark for language models, with categories and capability mapping, human data and human based difficulty levels (although the data still needs to be released).

And a miscellaneous collection of new papers:

Stanford releases HEIM: Holistic Evaluation of Text-to-Image Models (openreview, website), a sibling of the previous HELM effort (Holistic Evaluation of Language Models), again with a large suite of models, tasks, and the publishing of instance level results.
Evaluating machine-generated explanations: a “Scorecard” method for XAI measurement science (link).
Evaluating General-Purpose AI with Psychometrics (arxiv), a position paper advocating for the use of psychometrics in AI evaluation.
See Levels of AGI: Operationalizing Progress on the Path to AGI (arxiv) for a high-level discussion of generality, capabilities and performance in the context of “AGI”.

Contributors to this month’s digest: Wout Schellaert, Jose H. Orallo, Yael Moros, Nando Martínez-Plumed

How to contribute: Feel free to reach out to wsc...@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post.

Getting the digest: Once a month if you join https://groups.google.com/g/ai-eval. For previous issues, just check here.

Reply all

Reply to author

Forward

0 new messages