2023 October “AI Evaluation” Digest

10 views
Skip to first unread message

AI Evaluation

unread,
Oct 27, 2023, 5:35:24 AM10/27/23
to AI Evaluation

Dear all,

Our monthly digest is here!  Please help distribute the digest and the mailing list with your collaborators!

2023 October “AI Evaluation” Digest

  • The JMLR family of journals launches a new venue: the Journal of Data-centric Machine Learning (DMLR) with an emphasis on --among other data related topics--  benchmark tooling & methods, data quality evaluation, metrics, methodology of empirical evaluations and more (blogpost).

  • DeepMind publishes “Evaluating social and ethical risks from generative AI” (blog, arxiv)

  • China announced its own Global AI Governance Initiative at the 3rd BRI Forum today, 2 weeks before the UK's AI Safety Summit. The document contains new talking points on AI safety, model evals, and national sovereignty (Twitter thread in English).

  • The European Lighthouse on Secure and Safe AI (ELSA) announces the ELSA Benchmarks platform.

  • Language, Common Sense, and the Winograd Schema Challenge (link), or why the Winograd Schema Challenge was/is not a sufficient test of intelligence (basically elaborating on this earlier paper).

  • A good piece on LLM Evaluation and Reasoning abilities (link), based on specialised and relatively new benchmarks, with a related Twitter thread by the author (link). It ties in well with Ida Momennejad et al.’s cog-sci inspired evaluation of planning capabilities and cognitive maps in language models (arxiv) and with Cohn’s spatial reasoning evaluation (arxiv).

  • Anthropic’s ‘Challenges in Evaluating AI Systems’ (link) provides useful insights into perceived difficulties with evaluation in the industry labs, including challenges with common benchmarks such as BIG-Bench and HELM.

  • According to the popular State of AI Report (link, p. 32), AI evaluation is apparently so unreliable for LLMs that people just follow the “vibes” of anecdotal evaluation.

  • A few other, technical, papers:

    • Anchor Points: Benchmarking Models with Much Fewer Examples (arxiv)

    • Benchmarking Cognitive Biases in Large Language Models as Evaluators (arxiv)

    • ‘Intriguing Properties of Generative Classifiers’ interestingly compares errors between humans and vision systems (arxiv).

  • And we close with a bit of humour: Pretraining on the training set is all you need!


Contributors to this month’s digest: Jose H. Orallo, Wout Schellaert


How to contribute: Feel free to reach out to wsc...@vrain.upv.es if you want to get involved, or if you have news to share that you are not comfortable posting as a standalone post. 


Getting the digest: Once a month if you join https://groups.google.com/g/ai-eval. For previous issues, just check here.
Reply all
Reply to author
Forward
0 new messages