There have been some advances in AI relating to mathematics recently. Hence this email.
-
FrontierMath: This was a dataset created a few months ago to deal with issues with benchmarks for mathematics (mostly based on competition problems): to better reflect research mathematics, to be sure none of the problems were in
the training (they were freshly created and kept private), and to be hard enough to be a benchmark.
-
The problems were of three levels:
-
T1 (25%) Advanced, near top-tier undergrad/IMO
-
T2 (50%) Needs serious grad-level background
-
T3 (25%) Research problems demanding relevant research experience
-
When this was created the best models, GPT-4o and Gemini, could only score 2%.
-
Presumably describing the hardest problems, Terence Tao said the dataset should "resist AIs for several years at least" and "These are extremely challenging. I think that in the near term basically the only way to solve them, short
of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
-
A model announced two days ago, o3
from OpenAI, scored 25%, solving problems in all tiers. Seems safe to say that it is comparable to top-level mathematicians in an area
different from their expertise and at least comparable to everyone else.
-
This is in a series of "test time compute" models which do not output straightaway but have a long internal "chain of thought" involving a search and checking for consistency.
-
This is an expensive model and solving each problem was computationally very expensive. But there will be a less expensive
o3-mini
. Also, Google recently released an experimental "Gemini-2.0-Flash" which seems very good (flash means the smaller, faster, cheaper version) which also has a think mode.
-
The model o3
also did very well in some other benchmarks (again kept private so not in training), in particular the ARC-AGI benchmark specifically designed to test learning with very few examples - something where humans
are much better than the best LLMs (though for this benchmark o3
crossed human levels).