tesseract performs wrong auto-correction sometimes : how to disable it?

419 views
Skip to first unread message

Youcef

unread,
Apr 25, 2018, 10:59:34 AM4/25/18
to tesseract-ocr
Hi,


Tesseract seems to post process its prediction.

Here after, what I get after OCRizing images (same font, same size images generated with text2image):

- an image containing "12345678I" => `123456781`
- an image containing "GLOTHUVFI" => `GLOTHUVFI`
- an image containing "12345678H" => `12345678H`
- an image containing "GLOTHUVFH" => `GLOTHUVFH`
- an image containing "12345678A" => `123456784`
- an image containing "GLOTHUVFA" => `GLOTHUVFA`

It looks like Tesseract doesn't like a word with a some numbers and one letter at the end. In fact, if the letter looks like a number ("I" and "A" looks like "1" and "4" respectively), it replaces it by the closest number.
I have tried to tune following parameters without any changement in the result:

- segment_penalty_dict_frequent_word
- language_model_penalty_chartype

Thanks for any help.

Regards

ShreeDevi Kumar

unread,
Apr 25, 2018, 12:49:22 PM4/25/18
to tesser...@googlegroups.com
Which version of tesseract are you using?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4722674d-27a1-4b8e-8c5a-9e07dbe3ca7d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Youcef

unread,
Apr 26, 2018, 4:05:30 AM4/26/18
to tesseract-ocr

I'm using master branch with tessdata_fast models

Le mercredi 25 avril 2018 18:49:22 UTC+2, shree a écrit :
Which version of tesseract are you using?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Apr 25, 2018 at 8:29 PM, Youcef <youcef...@gmail.com> wrote:
Hi,


Tesseract seems to post process its prediction.

Here after, what I get after OCRizing images (same font, same size images generated with text2image):

- an image containing "12345678I" => `123456781`
- an image containing "GLOTHUVFI" => `GLOTHUVFI`
- an image containing "12345678H" => `12345678H`
- an image containing "GLOTHUVFH" => `GLOTHUVFH`
- an image containing "12345678A" => `123456784`
- an image containing "GLOTHUVFA" => `GLOTHUVFA`

It looks like Tesseract doesn't like a word with a some numbers and one letter at the end. In fact, if the letter looks like a number ("I" and "A" looks like "1" and "4" respectively), it replaces it by the closest number.
I have tried to tune following parameters without any changement in the result:

- segment_penalty_dict_frequent_word
- language_model_penalty_chartype

Thanks for any help.

Regards

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Message has been deleted

shree

unread,
Apr 30, 2018, 10:20:44 AM4/30/18
to tesseract-ocr

Youcef

unread,
May 3, 2018, 5:03:59 AM5/3/18
to tesseract-ocr
Hi Shree,

Thank you

Clark Knøsen

unread,
May 6, 2018, 4:29:41 AM5/6/18
to tesseract-ocr
I experience the same with tesseract 4.0 installed with best traindata from repo

# printf "deb https://notesalexp.org/tesseract-ocr/$(lsb_release -sc)/ $(lsb_release -sc) main\ndeb https://notesalexp.org/tesseract-ocr/tessdata_best/ stretch main\n" >> /etc/apt/sources.list

ilochray

unread,
Dec 17, 2018, 11:10:06 AM12/17/18
to tesseract-ocr
I am experiencing the same issue.  Did you ever find a resolution for this?
Reply all
Reply to author
Forward
0 new messages