As part of my attempt to improve Tesseract's accuracy with a language
model, I needed a better DangAmbigs file. To get one I made a utility
that generates them empirically from any collection of correct and
generated texts (such as the UNLV test set and Tesseract's output from
the tests). I believe this should be useful to other Tesseract
users. You can find it here, with a more detailed description:
http://www.cs.toronto.edu/~mreimer/tesseract.html
To the project owners: I'm willing to support this long-term, and
would be pleased to see it put on the external add-ons page or
included in the training files.