German special characters

98 views
Skip to first unread message

Stefan Greiner

unread,
Aug 24, 2016, 1:08:38 PM8/24/16
to tesseract-ocr
When doing OCR via Teeract with the latest Tess4j Api using deu.traineddata. Often the dots from Ü Ö Ä are used in the validation oft the line above.

Adding example 1471881362044_imageProcessedWithMarks.png
The OCR is done in the red rectangles.

OCR-Output:

I_D_as Ergebnis der Online-Umfrage der
Ärztekammer wurde heute veröffent

Anybody an idea to fix it?

Thany you in advance.
1471881362044_imageProcessedWithMarks.png

Stefan Greiner

unread,
Aug 24, 2016, 1:15:38 PM8/24/16
to tesseract-ocr
Another example:

ocr-text:

SKI ALPIN
Verletzu_nggspech
bei den DSV-Damen

1471881870054_imageProcessedWithMarks.png

Stefan Greiner

unread,
Aug 27, 2016, 6:08:04 AM8/27/16
to tesseract-ocr
One more example

OCR-Text:

l_-'_Ür die Helfer wird es immer schwieriger,
Überlebende zu finden.
1472134382041_imageProcessedWithMarks.png

Stefan Greiner

unread,
Aug 27, 2016, 6:09:18 AM8/27/16
to tesseract-ocr

Has somebody an idea what I could try to fix it or get better results?

Quan Nguyen

unread,
Aug 27, 2016, 10:03:59 AM8/27/16
to tesseract-ocr
If the stock language data proves not adequate to you requirements, you may want to consider training Tesseract.

Stefan Greiner

unread,
Aug 27, 2016, 2:27:08 PM8/27/16
to tesseract-ocr
I was hoping that there's a paramter to limit line size.
Something like xy pixel or % below the baseline it isn't part of the line.

Does anybody know what the parameter does?

tessedit_pageseg_mode

Standard: 5

Values: Page seg mode: 0=osd only, 1=auto+osd, 2=auto, 3=col, 4=block, 5=line, 6=word, 7=char (Values from PageSegMode enum in publictypes.h)
Reply all
Reply to author
Forward
0 new messages