German mutated vowel/umlaut ü/Ü

532 views
Skip to first unread message

Stefan Greiner

unread,
Feb 12, 2016, 2:11:56 AM2/12/16
to tesseract-ocr
Tesseract 3.04 (953523b)

Since using 3.04 with current German language file "deu.traineddata" the small ü are always recognised as big Ü.

Are there any parameters to fix this? the other characters are recognised properly.

example source screenshot added.
Tesseract_screen_002.jpg

Marco Atzeri

unread,
Feb 12, 2016, 2:39:03 AM2/12/16
to tesser...@googlegroups.com
> --

on my cygwin build, Tesseract 3.04 catches it fine.

tesseract 3.04.00
leptonica-1.72
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 :
libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3

See attached output.


Which system are you using ?

Marco
umlaut.txt
umlaut.png

Stefan Greiner

unread,
Feb 12, 2016, 4:37:10 AM2/12/16
to tesseract-ocr
I'm using the Tess4J API
http://tess4j.sourceforge.net/docs/index.html

Version 3.0 - 25 December 2015:

  • Upgrade to Tesseract 3.04 (953523b)
  • Include Lept4J library
  • Incorporate slf4j and logback libraries for logging
  • Make GhostScript calls thread safe
System is a Windows 10 Pro 64bit (German)
Intel Xeon E5-1630 v3 3,70 GHz
32 GB Ram

I'll recheck the config parameters.

Stefan Greiner

unread,
Aug 27, 2016, 6:17:56 AM8/27/16
to tesseract-ocr

Still getting the same problem.

OCR-Text:

VERKEHR

Um 14 Uhr sperrte die Nord-
SÜd-Verbindung (A9) wieder
auf.

Nach dem Brand musste ein
30 Tonnen schwerer Teil der
Zwischendecke abgetragen
und neu betoniert werden.
1472148887349_imageProcessedWithMarks.png

Stefan Greiner

unread,
Aug 27, 2016, 6:19:22 AM8/27/16
to tesseract-ocr

Running on Tess4j Version 3.2.1 - 29 May 2016
System is the same

Stefan Greiner

unread,
Aug 27, 2016, 6:23:21 AM8/27/16
to tesseract-ocr
another example

OCR-Text:

FRANKREICH

Frankreichs Oberstes Verwaltungsgericht
urteilt Über das Burkini-Verbot.
1472155035999_imageProcessedWithMarks.png

Quan Nguyen

unread,
Aug 27, 2016, 9:56:40 AM8/27/16
to tesseract-ocr
Tried the command line and got a similar result. You may want to perform post-corrections (with regex, maybe) to compensate, if possible.

Stefan Greiner

unread,
Aug 27, 2016, 2:30:11 PM8/27/16
to tesseract-ocr
I'll try to add a spell checker with grammar function. Maybe this helps.

I was thinking about regex, but I dind't find a good rule.

Thank you for your input.

Quan Nguyen

unread,
Aug 27, 2016, 8:55:49 PM8/27/16
to tesseract-ocr
You can be more confident in certain scenarios. Take the case of "SÜd", if the letter Ü is immediately surrounded by any lower-case letters, it's likely lower-case. A regex can be applied in that particular context.

Stefan Greiner

unread,
Aug 29, 2016, 6:23:39 AM8/29/16
to tesseract-ocr
good input. thank you very much

Reply all
Reply to author
Forward
0 new messages