Hi, a few things I would try (I never trained on cursive fonts):
- I would use a stable tesseract version (4.1 right now)
- 0.7 is not a very good score for a text this clean
- I think 6000 lines is not much, hard to tell if it is enough, this is not a classic font
- data pre processing may help, but the sample looks perfectly clean. This is already processed.
- How much testing data did you use? 20%? Real world accuracy will always be a little worse than testing accuracy because you pick the model that best fits the test dataset. But do not trust your guts on this difference, it's very hard to estimate it informally. Make sure the real document is processed in the same way as the training/test data
- do some data augmentation: bold, noise, stretch, skew, blur, tiny rotations, etc. to generate more data (not too much, maybe 3 to 5 times more), also keep the original data. If you use python you can use imgaug.
- if you can find the font, it should be possible, add some synthetic data too (again with augmentation). There are online tools to find fonts by samples.
- small labels errors are not a big problem if you have a lot of data and if you do not overfit too much. In this case you can first train one model with current data, then use it to tell you which samples do not match the gt.txt files according to this model. It will likely find most of the mislabeled data. Fix it and then of course train again on the new data. If this is english text you could even run a spell check on the gt.txt files to find some errors.
- restrict the output charset only to the characters you need
- there is some "noise/dust" around the text, probably it is just the jpeg compression, I would apply a simple threshold and save the file
s as png. Noise should not be a problem if it is present in the training data and prediction data but maybe you are getting this extra noise because you saved the file on disk and maybe at runtime you won't have it. Maybe tesseract will remove it for you, but if you want to remove a source of doubt just threshold
them.
- check the boxes of the recognized text to understand what is going on (see ocr_boxes.py or maybe hocr output)
- Your text has long/tall legs, the body is 35px but it goes up to 120 with the legs. So I think it is important to understand how your lines are cropped. The input size for the LSTM is 48(*) so if you feed lines 120px tall these are going to be downscaled a lot and the core part will suffer most. So maybe (just speculating) it is better to cut a little the "legs" and the top (see the example). In any case I'd try to understand what images are fed to the NN at training time and prediction time.
- your text is aligned extremely well, it does not look like something out of a scanner. Is this real scanned text?
- as this is English text, consider doing a dictionary spell check/fix.
-
maybe also consider to try to train from scratch using only a lot of synthetic data with
very similar fonts only, then fine tune with real data (if you have
enough time)
(*) According to this page:
input size, for the "fast" model is 36 or 48, I suppose it is 48 for all the "best" models.
Lorenzo