Fwd: AI progress : "FrontierMath problems" and o3

15 views

Skip to first unread message

Siddhartha Gadgil

unread,

Dec 22, 2024, 4:28:18 AM12/22/24

to Automated Mathematics India

Dear All,

There have been some advances in AI relating to mathematics recently. Hence this email.

regards,

Siddhartha

FrontierMath: This was a dataset created a few months ago to deal with issues with benchmarks for mathematics (mostly based on competition problems): to better reflect research mathematics, to be sure none of the problems were in the training (they were freshly created and kept private), and to be hard enough to be a benchmark.
The problems were of three levels:

T1 (25%) Advanced, near top-tier undergrad/IMO
T2 (50%) Needs serious grad-level background
T3 (25%) Research problems demanding relevant research experience

When this was created the best models, GPT-4o and Gemini, could only score 2%.
Presumably describing the hardest problems, Terence Tao said the dataset should "resist AIs for several years at least" and "These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
A model announced two days ago, o3 from OpenAI, scored 25%, solving problems in all tiers. Seems safe to say that it is comparable to top-level mathematicians in an area different from their expertise and at least comparable to everyone else.
This is in a series of "test time compute" models which do not output straightaway but have a long internal "chain of thought" involving a search and checking for consistency.
This is an expensive model and solving each problem was computationally very expensive. But there will be a less expensive o3-mini. Also, Google recently released an experimental "Gemini-2.0-Flash" which seems very good (flash means the smaller, faster, cheaper version) which also has a think mode.
The model o3 also did very well in some other benchmarks (again kept private so not in training), in particular the ARC-AGI benchmark specifically designed to test learning with very few examples - something where humans are much better than the best LLMs (though for this benchmark o3 crossed human levels).

Reply all

Reply to author

Forward

0 new messages