German "Straße" is often "StraBe" (tesseract 4.0)

94 views
Skip to first unread message

Thomas Güttler

unread,
May 24, 2018, 7:35:55 AM5/24/18
to tesseract-ocr
I use tesseract 4.0 via docker (tesseractshadow/tesseract4re)

Very often tesseract detects "StraBe" instead of "Straße".

Yes, I use -l=deu

The word "Straße" is very common in german. It means "street".

Since "StraBe" makes no sense I would like to improve this.

What do you suggest?


shree

unread,
May 24, 2018, 7:41:30 AM5/24/18
to tesseract-ocr
Please try with script/Latin traineddata to see if you get better results.

I have added your comment to issue at https://github.com/tesseract-ocr/langdata/pull/54

Greg Dunkel

unread,
May 24, 2018, 9:39:36 AM5/24/18
to tesser...@googlegroups.com
A work-around could be easily implemented with a sed script.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/494dba60-4142-4bfc-8b14-2cae4f8e71ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thomas Güttler

unread,
May 25, 2018, 5:37:39 AM5/25/18
to tesseract-ocr
Am Donnerstag, 24. Mai 2018 15:39:36 UTC+2 schrieb gdunkel:
A work-around could be easily implemented with a sed script.

yes, you could use sed, awk or python to do post processing. I know how to use regexs.

I would like to improve tesseract, since this helps several people, not just me.


Thomas Güttler

unread,
May 25, 2018, 6:02:29 AM5/25/18
to tesseract-ocr
Hi Shree,

what do you mean with "script/Latin traineddata"? I am new to tesseract and use version 4.0 via docker.
Most internet pages are about tesseract 3.0.x.

I am unsure where to start.

Maybe it is better to use 3.0.x?

Regards,
  Thomas

Quan Nguyen

unread,
May 27, 2018, 10:10:38 AM5/27/18
to tesseract-ocr

Thomas Güttler

unread,
May 30, 2018, 6:22:19 AM5/30/18
to tesseract-ocr
I found the root of the problem: correct is "-l deu" but I used "-l=deu".


Reply all
Reply to author
Forward
0 new messages