load_system_dawg 0
load_freq_dawg 0
load_punc_dawg 0
tessedit_char_whitelist 0123456789-
I've made some more tests and found out that the problem with
recognizing occurs only when minus is located at the beginning or at the
end of the word - placing minus between digits recognizes just fine.
Here
are some test results. For making
test images I used MS Paint 'Draw Text' tool, Arial font, 26pt.
Box file:
4 35 1 59 25 0
0 65 1 82 25 0
0 84 1 101 25 0
Test image 2
Box file:
~ 40 8 50 12 0
2 51 1 68 26 0
0 70 1 87 26 0
0 89 1 106 26 0
Test image 3
Box file:
~ 35 8 45 12 0
3 46 1 63 26 0
0 65 1 82 26 0
0 84 1 101 26 0
Test image 4
Box file:
- 42 6 53 12 0
4 50 0 70 25 0
0 72 0 89 25 0
0 91 0 108 25 0
Test image 5
Box file:
~ 40 7 50 11 0
5 51 0 68 25 0
0 70 0 87 25 0
0 89 0 106 25 0
Test image 6
Box file:
4 39 1 60 26 0
5 55 1 67 26 0
0 69 1 86 26 0
0 88 1 105 26 0
Test image 7
Box file:
~ 43 7 53 11 0
7 54 0 71 25 0
0 73 0 90 25 0
0 92 0 109 25 0
Test image 8
Box file:
~ 38 7 48 11 0
3 49 0 66 25 0
0 68 0 85 25 0
0 87 0 104 25 0
Test image 9
Box file:
~ 39 7 49 11 0
9 50 0 67 25 0
0 69 0 86 25 0
0 88 0 105 25 0
Test image 10
Box file:
4 40 0 61 25 0
3 56 0 68 25 0
0 70 0 87 25 0
0 89 0 106 25 0
Note that only 4-th test image ('-400') recognized correctly. That intrigued me and after some experimenting I found out that this specific case is extremly fragile - moving the minus sign even on one pixel from it's position causing incorrect recognition results. So it can be called a fortuity that it recognized correctly. As an example, here are recognition results for a modified picture with a moved minus sign on one pixel to left.
Test image 11
Box file:
~ 41 7 51 11 0
4 52 0 70 25 0
0 72 0 89 25 0
0 91 0 108 25 0
And here's an example of a correct recognition as a relief. As I said before, a minus sign between digits recognizes perfectly all the time.
Test image 12
Box file:
0 10 0 26 25 0
2 29 0 45 25 0
- 49 7 59 11 0
0 60 0 76 25 0
2 79 0 95 25 0
- 99 7 109 11 0
9 110 0 126 25 0
2 129 0 145 25 0
So, after converting them into grayscale in can be seen that negative numbers are more darker than positive numbers.
After that I can detect the brightest pixel in the image and, if it's darker than some given value, I can determine if the number is negative. If I determine that, I just fill the left side of the image with the background color, so it can be recognized by Tesseract-OCR correctly.
Before recognition I also do some more image processing to additionally increase recognition quality.
And finally, after recognition, I just concatenate the minus sign to the result value, if the number was detected as negative.
'-100'
'100'
I believe it's not a very elegant and troubleproof solution, but at the moment it works fine for me and recognises 100% of my test dataset. Still, I'll be glad to receive any thoughts about detecting leading minus sign, if you have such. Thank you for reading.
Regards, Pavel Shcherbakov.