I spent a lot of time creating test files such as mapleok.
There have been many new ones since the fork, see
axiom/src/input/*
It would be simple to ask an AI system to develop a suite of
test files for each of the various domains. It would also be
possible to ask the AI to use current textbooks, e.g. Calculus
to develop more comprehensive test suites.
Even better, the latest AI systems could probably tell you why
a test failed, pointing at lines of code.
Tim