Hello,
I would like to train Tesseract 4 to recognize certain scripts/languages based on real images rather than synthetic ones. Here are my questions:
1. Is there a tool, preferably cross-platform (Windows/Linux) GUI, that assists in creating .box file based on scanned images? How to get coordinates of textlines? etc...
2. Is there a youtube/video tutorial describing .tiff/.box files preparation based on real scans?
3. What provides better recognition - training on real images or training on synthetic images?
4. How many textlines of real scans do I need to get proper recognition?
Thank you very much!
ST