Tesseract mixung up digits in a low-res font

139 views
Skip to first unread message

Michał Śmielak

unread,
Mar 28, 2021, 1:44:50 PM3/28/21
to tesseract-ocr
Hi everyone, new user here.
I have an issue with tesseract (run as an R library if that makes any difference).
I am trying to read data from a camera trap photo - an example:
pic (1).jpg
The photo is low res - 640x480 px, but as you can see the number are easily readable by a human. I managed to tune it a bit by clipping parts of the picture, reversing image, upscaling etc and I have something like that:

date_49.jpg

You would think it is an easy thing to read so I created a subset of these outputs, merged into a tiff, manually fixed the automatic detection, but tesseract is consistently misreading some digits sometimes, for instance, this is read as 10/16/2015, while this time:
time_55.jpg
is read as 16:66:04.
I find this very weird as these numbers are embedded into the photo by camera trap itself and are very consistent. The size is always the same, the digits are identical, yet the same digit is read by the software in different ways, and sometimes not read at all. And 8 is always read as something else.

I would appreciate any advice on how to fix that. My training data was 140 dates and 140 times and still when I generated boxes (I used jTessBoxEditor for that) sometimes that would be read fine, and then the next one would be read as letters that are not even similar. Could the "pixelated" type of font be the issue? Digits are originally 8 px high.

Alternatively, can you advise me on a method to read these values correctly?

Thanks in advance everyone.
Michal


Michał Śmielak

unread,
Mar 29, 2021, 9:01:23 PM3/29/21
to tesseract-ocr

Ok, so if anyone is interested, I ended up creating a custom font based on the actual digits that I extracted from the photo, then using this custom font to train data and it worked 100%. Took me a couple of days tweaking.
I described it in details here: https://msmielak.github.io/post/2021-03-29-extracting-date-and-time-from-photo-using-ocr-engine-tesseract/
Reply all
Reply to author
Forward
0 new messages