Tesseract is ignoring numbers

98 views
Skip to first unread message

Anish Radhakrishnan Nair

unread,
Jun 3, 2015, 2:10:07 AM6/3/15
to tesser...@googlegroups.com

I have to read text from screenshots of speed test results and extract the upload and download speeds from them. Most of the images I have tested have been of very high quality and I have binarized and also corrected skew if necessary, but the results are still only at around 60% accuracy. The biggest issue is that after preprocessing some images in which the numbers are very clearly distinguishable are not read well. As an example, I have attached a test image after preprocessing, and the result of Tesseract performing OCR on it.


The result I have received after performing OCR on this picture, in a single line is-
000003 4G 15:41 4 83% - / OOKLA SPEEDTEST PWG DOWNLOAD UPLOAD 49 ms Mbps Mbps L,» SHARE ‘ ”‘ “\ ‘ 5M I” 1°“ \\ I 2M 20M ‘ I I 1M 0M | , ‘ ‘ ‘ 1,3,!Ht‘u‘z‘gssz‘:}::;\ ..;~,-. ~‘ ‘ ' 'mmW" 50 ,

Note how the Mbps shows up but the number is completely ignored. How do I improve this result?

Rick Leir

unread,
Jun 8, 2015, 12:03:13 PM6/8/15
to tesser...@googlegroups.com
This is not really an answer. I would experiment with a higher resolution image.  And maybe experiment with masking the image using graphicsmagick.  The mask would cover the 'ms', 'Mbps', and second 'Mbps'. Good luck!
Reply all
Reply to author
Forward
0 new messages