I am trying to use Tesseract to OCR screen shots from various Windows applications. So essentially the data is a random collection of letters and numbers, not written words/sentences like it was primarily oriented to handle.
Here is my setup:
-Tesseract Windows Version 5.0.0 from UB-Mannheim
-image cleaning and resizing using openCV (have put much effort into getting this as good as I can)
-parameters --psm 6 --oem 1 (have also tried oem 0 and 3 with pretty much same results)
-config file contents
language_model_penalty_non_dict_word 0.0
language_model_penalty_chartype 0.0
language_model_penalty_case 0.0
language_model_penalty_non_freq_dict_word 0.0
Tesseract is performing reasonably well for my needs, but I have a couple of problems that I can't resolve. They seem to be related to Tesseract functionality which tries to decide what a given character is not just based on its pixel layout, but also based on the context that the character occurs in.
Issue #1
Occasionally Tesseract inserts extra characters in its output, seemingly when it is unsure how to choose between a couple of different alternatives:
For the above image, Tesseract produces the following output:
10of3
As you can see, Tesseract inserts the digit 0 in front of the lower case letter o in the output. It also ignores the white space in the image.
Others have reported this issue, for example the thread below:
Issue #2
As shown in the above example, Tesseract sometimes ignores white space which at least to my eye is big enough not to be missed.
Issue #3
Tesseract has a hard time dealing with random strings of alpha characters and digits mixed together in no particular order. It has a tendency to output a digit when the previous character was a digit, and an alpha when the previous character was an alpha.
Others have reported this issue, for example the thread below:
Suggestion:
At least for my situation, it seems that the best thing would be if there were a definitive Tesseract option to interpret individual characters without reference to their context. Since my data comes from screen shots, it is very clear and very consistent, and I would think that a character-by-character mode would work well.