Training Tesseract 4 on real images

99 views

Skip to first unread message

Sim Tov

unread,

Oct 8, 2020, 4:07:02 AM10/8/20

to tesseract-ocr

Hello,

I would like to train Tesseract 4 to recognize certain scripts/languages based on real images rather than synthetic ones. Here are my questions:

1. Is there a tool, preferably cross-platform (Windows/Linux) GUI, that assists in creating .box file based on scanned images? How to get coordinates of textlines? etc...

2. Is there a youtube/video tutorial describing .tiff/.box files preparation based on real scans?

3. What provides better recognition - training on real images or training on synthetic images?

4. How many textlines of real scans do I need to get proper recognition?

Thank you very much!

Murtuza Dahodwala

unread,

Jan 8, 2021, 3:32:41 AM1/8/21

to tesseract-ocr

I also want to know that how we can train on real images which are not single lines?

Reply all

Reply to author

Forward

0 new messages