1. using sythetic data:
What can you do if you do not have a data that is confirmed to be accurate?
The only way around that I know is to use sythetic data. That is: you generate the images from the texts using text2image script. You then train from that one. The accuracy of the result model is not going to be perfect because the actual data is messier than the syntactic data. But, you can try different methods to get better accuracy:
(a) by training from a network: that is you can cut the top layer of a working model, and train from that one.
(b) configure text2image script to add noise to the sythetic data so that it will be similar to the actual images.
(c) using larger dataset
etc
2) the hocr hack:
- I havn't tried this method myself. But, I read in GitHub that Shree has some kind of hack (script) that uses horc script inside tesseract.
a. First, ocr the images using the standard model to an hocr format.
b) he then breaks down the hocr format to box, tif, text files
c) he then compares the text files with the images, and manually corrects faulty ones.
This one also requires a lot of manual work because the standard model will miss a lot of characters.
3) Alternatively, you can try other ocr engines such as EasyOCr. Some people say EasOCR is better to ocr those kinds of images: while tesseract is better for scanned docs.