German special characters

Stefan Greiner

unread,

Aug 24, 2016, 1:08:38 PM8/24/16

to tesseract-ocr

When doing OCR via Teeract with the latest Tess4j Api using deu.traineddata. Often the dots from Ü Ö Ä are used in the validation oft the line above.

Adding example 1471881362044_imageProcessedWithMarks.png
The OCR is done in the red rectangles.

OCR-Output:

I_D_as Ergebnis der Online-Umfrage der
Ärztekammer wurde heute veröffent

Anybody an idea to fix it?

Thany you in advance.

1471881362044_imageProcessedWithMarks.png

Stefan Greiner

unread,

Aug 24, 2016, 1:15:38 PM8/24/16

to tesseract-ocr

Another example:

ocr-text:

SKI ALPIN
Verletzu_nggspech
bei den DSV-Damen

1471881870054_imageProcessedWithMarks.png

Stefan Greiner

unread,

Aug 27, 2016, 6:08:04 AM8/27/16

to tesseract-ocr

One more example

OCR-Text:

l_-'_Ür die Helfer wird es immer schwieriger,
Überlebende zu finden.

1472134382041_imageProcessedWithMarks.png

Stefan Greiner

unread,

Aug 27, 2016, 6:09:18 AM8/27/16

to tesseract-ocr

Has somebody an idea what I could try to fix it or get better results?

Quan Nguyen

unread,

Aug 27, 2016, 10:03:59 AM8/27/16

to tesseract-ocr

If the stock language data proves not adequate to you requirements, you may want to consider training Tesseract.

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract

Stefan Greiner

unread,

Aug 27, 2016, 2:27:08 PM8/27/16

to tesseract-ocr

I was hoping that there's a paramter to limit line size.
Something like xy pixel or % below the baseline it isn't part of the line.

Does anybody know what the parameter does?

tessedit_pageseg_mode

Standard: 5

Values: Page seg mode: 0=osd only, 1=auto+osd, 2=auto, 3=col, 4=block, 5=line, 6=word, 7=char (Values from PageSegMode enum in publictypes.h)

Reply all

Reply to author

Forward