I have pushed this as far as appears to be possible with Tesseract
2.01. We now have a "language" configuration built from 10 training
pages, each containing 123 glyphs, composed of the numbers 0-9 the
comma, assorted punctuation marks and 106 Japanese characters. This
is the limit of what mftraining will handle with this combination of
characters; adding another training page triggers a segfault. But
tess runs the 10-page language configuration without complaint.
Obviously, most of the text we get back for a financial statement fed
to tess is garble, but the numbers and the known Japanese glyphs are
recognized with a good degree of accuracy. Cleaning the output will
give us what we need for our research purposes. Looks like we've just
barely managed to thread the needle.
Thanks to Ray Smith, to HP, to Google, and to everyone who has put in
effort on tesseract. More to come, I'm sure, but even at this stage
this is great stuff.