Tesseract is ignoring numbers

98 views

Skip to first unread message

Anish Radhakrishnan Nair

unread,

Jun 3, 2015, 2:10:07 AM6/3/15

to tesser...@googlegroups.com

I have to read text from screenshots of speed test results and extract the upload and download speeds from them. Most of the images I have tested have been of very high quality and I have binarized and also corrected skew if necessary, but the results are still only at around 60% accuracy. The biggest issue is that after preprocessing some images in which the numbers are very clearly distinguishable are not read well. As an example, I have attached a test image after preprocessing, and the result of Tesseract performing OCR on it.

The result I have received after performing OCR on this picture, in a single line is-

000003 4G 15:41 4 83% - / OOKLA SPEEDTEST PWG DOWNLOAD UPLOAD 49 ms Mbps Mbps L,» SHARE ‘ ”‘ “\ ‘ 5M I” 1°“ \\ I 2M 20M ‘ I I 1M 0M | , ‘ ‘ ‘ 1,3,!Ht‘u‘z‘gssz‘:}::;\ ..;~,-. ~‘ ‘ ' 'mmW" 50 ,

Note how the Mbps shows up but the number is completely ignored. How do I improve this result?

Rick Leir

unread,

Jun 8, 2015, 12:03:13 PM6/8/15

to tesser...@googlegroups.com

This is not really an answer. I would experiment with a higher resolution image. And maybe experiment with masking the image using graphicsmagick. The mask would cover the 'ms', 'Mbps', and second 'Mbps'. Good luck!

Reply all

Reply to author

Forward

0 new messages