I haven't been working on this for very long. My objective is quite narrow, to convert subtitles to text, and thought the stock tesseract should be a quick solution. I was wrong.
I did the scrollView redirection only today because I was at a dead end and needed to visualize what was happening after suspecting the box calculation/assignment wasn't working. Creating HTML would be way better!
The software is, umm, quite a challenging beast and needs a massive cleanup.
To support some odd fonts with no linux equivalents (had to buy from a font foundry in Taiwan), I replaced text2image.
So in all I've made a few big changes:
- replaced text2image with new program on MacOs which can use the special licensed fonts and creates the image, box file and .gt.txt files
- replaced the Makefile (!) used for training with a readable, maintainable program that runs the training
- added new text and retrained chi_tra.traineddata
- created a "reviewer" which publishes data to a website that shows the original image, the output text and lets the user make corrections (to generate additional training data)
I'm focusing on Chinese right now but ultimately will test a lot of other languages. Given the experience so far, I foresee problems with some extended latin languages. In particular, I worry about Hungarian and Vietnamese; lots of accents.
I haven't published any of this work so far. I'm not adverse to doing so, but I'm a bit retentive on code style (I just can't look at the 'blob of barf' style in much of the code); I figured people would be upset/disturbed with changes to the style...