Block detection in document header

42 views

Skip to first unread message

Peter

unread,

Aug 12, 2018, 11:05:11 AM8/12/18

to tesseract-ocr

Hi everyone,

does any of you know a way to make tessearct acknowledge large horizontal distances as separators for blocks?

Considering the attached document (it's just a random example from the web, tesseract shows the same behavior on similar documents). Tesseract consistently fails to recognize the two separate blocks in the header and instead reads the words line by line.

The output then looks like this:
COUR EUROPEENNE EUROPEAN COURT
des of
DROITS DE L’HOMME HUMAN RIGHTS

Where it should clearly look like this:
COUR EUROPEENNE
des
DROITS DE L’HOMME

EUROPEAN COURT
of
HUMAN RIGHTS

Looking at the blocks, it becomes clear that tesseract does not recognize the two header blocks as separate, even though they are clearly distinguishable.

Is there a way to tweak tesseract's block/paragraph detection to be more sensitive to this and correctly separate the header blocks?

This problem has been haunting me for a while now. and tesseract is such a powerful tool and does such a great job with tasks that are way more complex, that I just cannot accept that it can't get this right.

Thanks in advance for you help,

best,

Peter

PS:
Find below the version I'm using. I do not think this is a problem of the version, though, the issue is the same with version 3.
tesseract 4.0.0-beta.3-199-gba757
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found SSE

CouncilOfEurope-12July2005-1.jpg

blocks.png

Reply all

Reply to author

Forward

0 new messages