Training lstm with symbol boxes

28 views
Skip to first unread message

Maxim Kizub

unread,
Jul 17, 2025, 11:57:50 AMJul 17
to tesseract-ocr
Hello.
I need to OCR text with mix of latin and cyrillic letters plus emoji-like icons.
Text font is printed, not hand-written. I take line images as screenshots.
Original "rus+eng" version shows bad perfomance, probably due to the mix of scripts and many words do not belong to dictionary. And icons, of cause.

After some attempts to fine-tune 'rus.traineddata' I give up and decided to train new 'language' from scratch. I removed all cyrillic glyphs that looks similar to latin letters (like O, H, T, etc. - just replaced them in groundtruth text), added icond and trained new language on about ~3000 short lines. But perfomance become even worse. I cannot provide more samples, so I decided to improve lstm training by adding exact boxes to glyphs. And after I've marked boxes the performance of trained detector rised extremely and it's completely acceptable now.

BUT. Trained with glyph boxes LSTM stops providing spaces in recognized text. It reports something like "HelloWorld" instead of "Hello World" even if there is a huge gap between words. Ok, I revised box files and added boxes for spaces. It did not help, Tesseract still does not recognize spaces between words. I've duplicated trained data, so it has both symbol boxes with (with spaces) and line boxes (one box for the whole line, as originally LSTM generates boxes). Now the tesseract trainer complains for every sample and reports huge character error rate, probably because of spaces (glyphs are detected correctly).

So. What how can I train LSTM with glyph boxes to recognize spaces between words? I cannot use line-boxes because of bad recognition perfomance, and I cannot use new traineddata because it misses spaces and does something wrong inside, overfitted to 'distinguish' between to add or not add spaces.





Reply all
Reply to author
Forward
0 new messages