OCR for digits only - improving recognition of decimal point

Giorgos Papageorgiou

unread,

Jul 13, 2021, 3:38:31 AM7/13/21

to tesseract-ocr

I am having issues getting tesseract to recognise a column of numbers in what I naively assume should be a straightforward problem. Most of the issues come from a mis-recognition of the decimal point - it either skips it, or mistakes it for a number. I call tesseract 4.1.1 with the options " -c tessedit_char_whitelist=-.0123456789 --psm 4 -l eng --oem 2" and I am interested to get a column of numbers in tabular form. After pre-processing my image, I have something of the sort:

which is then recognised as:

2.565

2597

2.614

2528

2.441

2564

2.530

24479

2.601

2.569

24555

2.437

2.531

2.592

2.385

2.618

2.738

2.766

24473

2.624

2.611

2.749

2.730

I can't afford to skip decimal points and there is no fixed pattern where the decimal points are (so can't skip "." nor "-" from the list of allowed characters). Can someone advise whether this is a pre-processing or tesseract issue and how I could improve OCR here?

Thanks

Zdenko Podobny

unread,

Jul 13, 2021, 3:51:23 AM7/13/21

to tesser...@googlegroups.com

Use legacy engine for this type of input:

tesseract digits.jpg - --oem 0
Estimating resolution as 769
2.565
2.597
2.614
2.528
2.441
2.564
2.530
'2.479
2.601
2.601
2.569
2.555

2.437
2.531
2.592
'2.385
2.618
2.738
2.766

2.473
2.624
2.611
2.749
2.730

Zdenko

ut 13. 7. 2021 o 9:38 Giorgos Papageorgiou <gpap...@gmail.com> napísal(a):

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/31697a83-777c-4eae-92b6-04ad75ba4ab1n%40googlegroups.com.

Giorgos Papageorgiou

unread,

Jul 14, 2021, 3:05:45 AM7/14/21

to tesseract-ocr

Hi Zdenko,

thanks for the suggestion. Although still not perfect, "--oem 0" did produce the best results yet and I have been able to correct the 15-or-so errors manually (this was one of five hundred images that needed digitising). Still puzzled as to why these errors are there but I guess it'll have to do.