Hi Simon,
if I understand correctly how tesseract works, it follows this steps:
- it segments the image into lines of text
- it then takes each individual line and slides a small window, 1px wide I think, over it, from one end to the other. For each step the model outputs a prediction. The model, being an bidirectional LSTM has some memory of the previous and following pixel columns.
- all these predictions are converted into characters using beam search
Please correct me if I got it wrong. So the first thing I think looking at your picture is the segmentation step. Do you want to read the "< 0,05 A" block only? Is the segmentation step able to isolate it? This is the first thing I would try to understand.
Also your sample image for "<" has a very different angle to the one before 0,05.
In this case a would try to do a custom segmentation, looking for rectangular boxes of a certain height, aspect ratio, etc. Then cropping these out (maybe dropping the rectangular box and the black vertical lines) and feed them to tesseract. This of course requires custom programming.
This might give good results even without fine tuning. I would try this manually with GIMP first.
Also I suppose you are not going to encounter a lot of wild fonts into these kind of diagrams. The more fonts you use, the harder the training. I would focus on very few fonts, even one. I would start with exactly one font and train on these to see quickly if my training setup/pipeline is working. And if the training results reflect onto the diagrams later. If the model error rate is good on the individual text lines and it is bad on the real images it might be a segmentation problem that training cannot fix. Or the problem might be the external box, that I suppose you do not have in your generated data.
Ideally, I would use real crops from these diagrams rather than images from text2image.
Also distinguishing 0 from O with many fonts is very hard. Often you have domain knowledge that can help you to fix these errors in post, for example 0,O5 can be easily spotted and fixed. You can, for example, assume that each box contains only one kind of data and guess the most likely one from this or from the box sequence, etc.
I got good results with 20k samples (real world scanned docs, multi fonts). 10k seems reasonable, I also assume your output "characters set" is very small, like the numbers and a few capital letters and a couple of symbols (no %, ^, &, {, etc.).
Lorenzo