Trouble with recognizing minus sign using Tesseract 3.02.02

1,029 views
Skip to first unread message

Павел Щербаков

unread,
Jun 4, 2015, 3:44:44 PM6/4/15
to tesser...@googlegroups.com

Hello,

I faced a problem trying to recognize numbers with a leading minus sign, such as '-100', '-200' etc. My tests with various samples showed me that it's almost always recognized as '400', '100' and similar to it, but almost never includes leading minus sign itself. I need help to find out how can I improve my recognition quality.

Here are some facts and things I've tried to get the correct output:

1. I'm using Tesseract 3.02.02 through C API functions (TessBaseAPICreate, TessBaseAPIInit3, TessBaseAPISetVariable, TessBaseAPIProcessPages, TessBaseAPIClear, TessBaseAPIEnd, TessBaseAPIDelete) with Delphi 7 on Windows 7 64-bit.
2. Image format is .bmp (I've not tried any other formats as I believe that bmp has no quality loss).
3. Before recognizing, I process my image with some decolorization and contrast enhancement for improving the quality of recognition. An example of a postprocessed image is attached.
4. I set character whitelist as '-0123456789' (I've also tried '--0123456789', '-0123456789 ' and even '-0' and '-' for a test - but minus wasn't recognized not even once).
5. I've tried to increase the size of image, but the result is still the same.
7. I've tried to stretch the image height (still '400') or width (terrible results like '111 0' or '11 -21').
8. I've tried to "extend" the minus sign by drawing it in MS Paint by myself, but it also was not recognized correctly (only once I've drawn a "perfect" minus sign that was recognized correctly, but unfortunately I couldn't repeat it later). I've also tried to recognize text
added through MS Paint 'Draw Text' tool (Calibri, Arial Black, Times New Roman, 20-72pt, black text on white background), and still no result. Which makes me think that the image itself could be OK, and it's something wrong with my usage of Tesseract itself.
9. Nonetheless, unsigned numbers and words (using other whitelist, of course) with same code recognizes just nicely.

If there are some hints or actions that I could miss to get proper result, please let me know.

Regards, Pavel Shcherbakov.

Павел Щербаков

unread,
Jun 4, 2015, 4:12:54 PM6/4/15
to tesser...@googlegroups.com
Ah forgot to mention - I also set 'tessedit_pageseg_mode' to 8 (single word) before recognition.

And I've also tried to recognize it with shell application through cmd (tesseract-dll D:\Users\Psijic\Desktop\PosTracker\test.bmp D:\Users\Psijic\Desktop\PosTracker\tempout -psm 8 nobatch qty).
Contents of qty config file:
load_system_dawg 0
load_freq_dawg 0
load_punc_dawg 0
tessedit_char_whitelist 0123456789-

Still the same results were received. I feel really stuck here.

Павел Щербаков

unread,
Jun 5, 2015, 5:08:18 PM6/5/15
to tesser...@googlegroups.com

I've made some more tests and found out that the problem with recognizing occurs only when minus is located at the beginning or at the end of the word - placing minus between digits recognizes just fine.
Here are some test results. For making test images I used MS Paint 'Draw Text' tool, Arial font, 26pt.

 
Test image 1


Box file:

4 35 1 59 25 0
0 65 1 82 25 0
0 84 1 101 25 0

Visualization of the box file:


Test image 2



Box file:

~ 40 8 50 12 0
2 51 1 68 26 0
0 70 1 87 26 0
0 89 1 106 26 0

Visualization:


Test image 3



Box file:

~ 35 8 45 12 0
3 46 1 63 26 0
0 65 1 82 26 0
0 84 1 101 26 0

Visualization:


Test image 4



Box file:

- 42 6 53 12 0
4 50 0 70 25 0
0 72 0 89 25 0
0 91 0 108 25 0

Visualization:


Test image 5



Box file:

~ 40 7 50 11 0
5 51 0 68 25 0
0 70 0 87 25 0
0 89 0 106 25 0

Visualization:


Test image 6



Box file:

4 39 1 60 26 0
5 55 1 67 26 0
0 69 1 86 26 0
0 88 1 105 26 0

Visualization:


Test image 7



Box file:

~ 43 7 53 11 0
7 54 0 71 25 0
0 73 0 90 25 0
0 92 0 109 25 0

Visualization:


Test image 8



Box file:

~ 38 7 48 11 0
3 49 0 66 25 0
0 68 0 85 25 0
0 87 0 104 25 0

Visualization:


Test image 9



Box file:

~ 39 7 49 11 0
9 50 0 67 25 0
0 69 0 86 25 0
0 88 0 105 25 0

Visualization:


Test image 10



Box file:

4 40 0 61 25 0
3 56 0 68 25 0
0 70 0 87 25 0
0 89 0 106 25 0

Visualization:


Note that only 4-th test image ('-400') recognized correctly. That intrigued me and after some experimenting I found out that this specific case is extremly fragile - moving the minus sign even on one pixel from it's position causing incorrect recognition results. So it can be called a fortuity that it recognized correctly. As an example, here are recognition results for a modified picture with a moved minus sign on one pixel to left.


Test image 11



Box file:

~ 41 7 51 11 0
4 52 0 70 25 0
0 72 0 89 25 0
0 91 0 108 25 0

Visualization:


And here's an example of a correct recognition as a relief. As I said before, a minus sign between digits recognizes perfectly all the time.


Test image 12



Box file:

0 10 0 26 25 0
2 29 0 45 25 0
- 49 7 59 11 0
0 60 0 76 25 0
2 79 0 95 25 0
- 99 7 109 11 0
9 110 0 126 25 0
2 129 0 145 25 0

Visualization:



I've also tried to recognize different formats (jpg, png, tif, even gif), but none of them gave me correct results.
I suppose that re-learning is the only reliable option to fix that inaccuracy, but I would appreciate any opinion.
Message has been deleted

Павел Щербаков

unread,
Jun 7, 2015, 6:01:09 AM6/7/15
to tesser...@googlegroups.com

I didn't go deep into a problem of re-learning Tesseract-OCR because it's a problem to me at the moment to get the sufficient number of images in an initial quality that could cover all possible variations. So, eventually, I solved my problem in a more tricky way.

As things stand, positive and negative numbers in my data input are colored in different colors:



So, after converting them into grayscale in can be seen that negative numbers are more darker than positive numbers.



After that I can detect the brightest pixel in the image and, if it's darker than some given value, I can determine if the number is negative. If I determine that, I just fill the left side of the image with the background color, so it can be recognized by Tesseract-OCR correctly.



Before recognition I also do some more image processing to additionally increase recognition quality.


And finally, after recognition, I just concatenate the minus sign to the result value, if the number was detected as negative.


'-100'

'100'


I believe it's not a very elegant and troubleproof solution, but at the moment it works fine for me and recognises 100% of my test dataset. Still, I'll be glad to receive any thoughts about detecting leading minus sign, if you have such. Thank you for reading.


Regards, Pavel Shcherbakov.

Reply all
Reply to author
Forward
0 new messages