Terence Tao said, "We haven't done many experiments ... large-scale studies where we take a thousand problems and just test them."
So I told Claude: You know my style. Suggest some innovative experiments I could run.
The first suggestion was cool! The Polya Audit. Polya's How to Solve It lists 20 heuristics (work backwards, induction, analogy, etc.). Mathematicians treat these as wisdom. Nobody has ever measured which ones actually work, and on what problem types.
So I prompted Copilot running Claude Sonnet 4.6 to run the LeanDojo Benchmark through an LLM n times, with different Polya heuristic system prompts and compare success rates.
Not-surprisingly different heuristics help different problems.
The impact of each heuristic is also quite varied.
Also not-surprisingly, different models respond differently to the same heuristic.
The same heuristic on the same problem affects models quite differently, too. For example "Introduce Auxiliary Elements" hurts GPT -25% but helps Claude +14%!
So yes, different heuristics work for different problems, and different models respond differently to the same heuristic.
But finally, at least for LLMs, we can measure. We can find out which heuristics work for which problems, and which heuristics get varied responses vs which ones are more universally helpful / harmful. And maybe teach humans.
Or maybe not.
