You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to tesseract-ocr
I have a few questions regarding the fine-tuning process. I'm building an app that is able to recognize data from the following documents:
- ID Card - Driving license - Passport - Receipts
All of them have different fonts (especially receipts) and it is hard to match exactly the same font and I will have to train the model on a lot of similar fonts.
So my questions are:
1. Should I train a separate model for each of the document types for better performance and accuracy or it is fine to train a single `eng` model on a bunch of fonts that are similar to the fonts that are being used on this type of documents?
2. How many pages of training data should I generate per font? By default, I think `tesstrain.sh` generates around 4k pages. Maybe any suggestions on how I can generate training data that is closest to real input data
3. How many iterations should be used?
For example, if I'm using some font that has a high error rate and I want to target `98% - 99%` accuracy rate.
As well maybe some of you had experience working with this type of documents and maybe you know some common fonts that are being used for these documents?
I know that MRZ in passport and id cards is using `OCR-B` font, but what about the rest of the document?