OCR for digits only - improving recognition of decimal point

976 views
Skip to first unread message

Giorgos Papageorgiou

unread,
Jul 13, 2021, 3:38:31 AM7/13/21
to tesseract-ocr
I am having issues getting tesseract to recognise a column of numbers in what I naively assume should be a straightforward problem. Most of the issues come from a mis-recognition of the decimal point - it either skips it, or mistakes it for a number. I call tesseract 4.1.1 with the options " -c tessedit_char_whitelist=-.0123456789 --psm 4 -l eng --oem 2" and I am interested to get a column of numbers in tabular form. After pre-processing my image, I have something of the sort:
 20.jpg
which is then recognised as:

2.565
2597
2.614
2528
2.441
2564
2.530
24479
2.601
2.601
2.569
24555
2.437
2.531
2.592
2.385
2.618
2.738
2.766
24473
2.624
2.611
2.749
2.730

I can't afford to skip decimal points and there is no fixed pattern where the decimal points are (so can't skip "." nor "-" from the list of allowed characters). Can someone advise whether this is a pre-processing or tesseract issue and how I could improve OCR here?

Thanks

Zdenko Podobny

unread,
Jul 13, 2021, 3:51:23 AM7/13/21
to tesser...@googlegroups.com
Use legacy engine for this type of input:

tesseract digits.jpg - --oem 0
Estimating resolution as 769
2.565
2.597
2.614
2.528
2.441
2.564
2.530
'2.479
2.601
2.601
2.569
2.555

2.437
2.531
2.592
'2.385
2.618
2.738
2.766
2.473
2.624
2.611
2.749
2.730

Zdenko


ut 13. 7. 2021 o 9:38 Giorgos Papageorgiou <gpap...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/31697a83-777c-4eae-92b6-04ad75ba4ab1n%40googlegroups.com.

Giorgos Papageorgiou

unread,
Jul 14, 2021, 3:05:45 AM7/14/21
to tesseract-ocr
Hi Zdenko,

thanks for the suggestion. Although still not perfect, "--oem 0" did produce the best results yet and I have been able to correct the 15-or-so errors manually (this was one of five hundred images that needed digitising). Still puzzled as to why these errors are there but I guess it'll have to do.

thanks again

Giorgos

Reply all
Reply to author
Forward
0 new messages