Note: even though constraint-list=? handles differences in type variable naming, it still requires that the two constraint lists be in the same order.
I ask because we have 40 some odd test cases for parse but we only
need 83 to get full credit, and it seems like testing all of these
with test cases could push us way over 400 points.
The assignment said you may assume syntactically valid input for the parser. You might have a handful of test cases there to verify that you get correct output for correct input, but 40 seems like a lot of redundancy if you're only testing valid input.
For generate-constraints you should probably have around one test case per production in the grammar to ensure code coverage and to verify that each produces the correct sets of constraints.
For unify you should make sure to include simple tests for each of the different branches in the algorithm, and you should have cases that make sure your substitution (in the constraint list and the substitution list) are working.
Most of the rest of the cases should be for infer-type to verify that the alpha-vary, generate-constraints, and unify functions all work together correctly to handle some fairly complex inference cases.