Hi Will,
I feel your pain. The tesseract training process is hideously complex. What makes matters worse is:
(a) We are now keeping the svn repository up to date with our own code that we use in Google.
(b) We use a more automated training process based on rendering text from fonts. We haven't managed to open it up yet, but even if we had, it wouldn't be suitable for every application.
(c) The training process is in flux as we work on improving the accuracy and making it work for more languages, including Indic languages and RTL languages.
This all adds up to the documentation being inaccurate, and likely to remain that way. Fixing it is a bit of a waste of time, as it will be wrong again within a month. Having said that, what we really need to do is cut a 3.01 tarball, so we can write documentation for that, and then let the svn version be different and undocumented until we have the training process stabilized.
Now to Yiddish. i am aware that the current Hebrew is no good, and we have a plan in place to fix it in our training system, and in the recognition process. Those fixes *should* automatically make Yiddish work too, unless:
Is there anything fundamentally different about Yiddish compared to Hebrew in the way it is written or encoded (I am aware that the dictionary and language are totally different) like the rules for writing it RTL that I should be aware of? For instance I understand Yiddish to be a dialect of German when spoken. Does that mean it has the same infinite word compounding that Geman has?
For your particular application, how important is BiDi handling? Or would pure RTL be good enough to get started?
Ray.