German mutated vowel/umlaut ü/Ü

Stefan Greiner

unread,

Feb 12, 2016, 2:11:56 AM2/12/16

to tesseract-ocr

Tesseract 3.04 (953523b)

Since using 3.04 with current German language file "deu.traineddata" the small ü are always recognised as big Ü.

Are there any parameters to fix this? the other characters are recognised properly.

example source screenshot added.

Tesseract_screen_002.jpg

Marco Atzeri

unread,

Feb 12, 2016, 2:39:03 AM2/12/16

to tesser...@googlegroups.com

> --

on my cygwin build, Tesseract 3.04 catches it fine.

tesseract 3.04.00
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 :
libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3

See attached output.

Which system are you using ?

Marco

umlaut.txt

umlaut.png

Stefan Greiner

unread,

Feb 12, 2016, 4:37:10 AM2/12/16

to tesseract-ocr

I'm using the Tess4J API
http://tess4j.sourceforge.net/docs/index.html

Version 3.0 - 25 December 2015:

Upgrade to Tesseract 3.04 (953523b)
Include Lept4J library
Incorporate slf4j and logback libraries for logging
Make GhostScript calls thread safe

System is a Windows 10 Pro 64bit (German)
Intel Xeon E5-1630 v3 3,70 GHz
32 GB Ram

I'll recheck the config parameters.

Stefan Greiner

unread,

Aug 27, 2016, 6:17:56 AM8/27/16

to tesseract-ocr

Still getting the same problem.

OCR-Text:

VERKEHR

Um 14 Uhr sperrte die Nord-
SÜd-Verbindung (A9) wieder
auf.

Nach dem Brand musste ein
30 Tonnen schwerer Teil der
Zwischendecke abgetragen
und neu betoniert werden.

1472148887349_imageProcessedWithMarks.png

Stefan Greiner

unread,

Aug 27, 2016, 6:19:22 AM8/27/16

to tesseract-ocr

Running on Tess4j Version 3.2.1 - 29 May 2016
System is the same

Stefan Greiner

unread,

Aug 27, 2016, 6:23:21 AM8/27/16

to tesseract-ocr

another example

OCR-Text:

FRANKREICH

Frankreichs Oberstes Verwaltungsgericht
urteilt Über das Burkini-Verbot.

1472155035999_imageProcessedWithMarks.png

Quan Nguyen

unread,

Aug 27, 2016, 9:56:40 AM8/27/16

to tesseract-ocr

Tried the command line and got a similar result. You may want to perform post-corrections (with regex, maybe) to compensate, if possible.

Stefan Greiner

unread,

Aug 27, 2016, 2:30:11 PM8/27/16

to tesseract-ocr

I'll try to add a spell checker with grammar function. Maybe this helps.

I was thinking about regex, but I dind't find a good rule.

Thank you for your input.

Quan Nguyen

unread,

Aug 27, 2016, 8:55:49 PM8/27/16

to tesseract-ocr

You can be more confident in certain scenarios. Take the case of "SÜd", if the letter Ü is immediately surrounded by any lower-case letters, it's likely lower-case. A regex can be applied in that particular context.

Stefan Greiner

unread,

Aug 29, 2016, 6:23:39 AM8/29/16

to tesseract-ocr

good input. thank you very much

Reply all

Reply to author

Forward