Training Tesseract 4 on real images

99 views
Skip to first unread message

Sim Tov

unread,
Oct 8, 2020, 4:07:02 AM10/8/20
to tesseract-ocr
Hello,

I would like to train Tesseract 4 to recognize certain scripts/languages based on real images rather than synthetic ones. Here are my questions:

1. Is there a tool, preferably cross-platform (Windows/Linux) GUI, that assists in creating .box file based on scanned images? How to get coordinates of textlines? etc...

2. Is there a youtube/video tutorial describing .tiff/.box files preparation based on real scans?

3. What provides better recognition - training on real images or training on synthetic images?

4. How many textlines of real scans do I need to get proper recognition?

Thank you very much!
ST

Murtuza Dahodwala

unread,
Jan 8, 2021, 3:32:41 AM1/8/21
to tesseract-ocr
I also want to know that how we can train on real images which are not single lines?
Reply all
Reply to author
Forward
0 new messages