Fwd: AI progress : "FrontierMath problems" and o3

12 views
Skip to first unread message

Siddhartha Gadgil

unread,
Dec 22, 2024, 4:28:18 AM12/22/24
to Automated Mathematics India
Dear All,
       There have been some advances in AI relating to mathematics recently. Hence this email. 

regards,
Siddhartha

  • FrontierMath: This was a dataset created a few months ago to deal with issues with benchmarks for mathematics (mostly based on competition problems): to better reflect research mathematics, to be sure none of the problems were in the training (they were freshly created and kept private), and to be hard enough to be a benchmark.
  • The problems were of three levels:
    • T1 (25%) Advanced, near top-tier undergrad/IMO
    • T2 (50%) Needs serious grad-level background
    • T3 (25%) Research problems demanding relevant research experience
  • When this was created the best models, GPT-4o and Gemini, could only score 2%.
  • Presumably describing the hardest problems, Terence Tao said the dataset should "resist AIs for several years at least" and "These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages…”
  • A model announced two days ago, o3 from OpenAI, scored 25%, solving problems in all tiers. Seems safe to say that it is comparable to top-level mathematicians in an area different from their expertise and at least comparable to everyone else. 
  • This is in a series of "test time compute" models which do not output straightaway but have a long internal "chain of thought" involving a search and checking for consistency.
  • This is an expensive model and solving each problem was computationally very expensive. But there will be a less expensive o3-mini. Also, Google recently released an experimental "Gemini-2.0-Flash" which seems very good (flash means the smaller, faster, cheaper version) which also has a think mode.
  • The model o3 also did very well in some other benchmarks (again kept private so not in training), in particular the ARC-AGI benchmark specifically designed to test learning with very few examples - something where humans are much better than the best LLMs (though for this benchmark o3 crossed human levels). 
Reply all
Reply to author
Forward
0 new messages