Hi everyone,
does any of you know a way to make tessearct acknowledge large horizontal distances as separators for blocks?
Considering the attached document (it's just a random example from the web, tesseract shows the same behavior on similar documents). Tesseract consistently fails to recognize the two separate blocks in the header and instead reads the words line by line.
The output then looks like this:
COUR EUROPEENNE EUROPEAN COURT
des of
DROITS DE L’HOMME HUMAN RIGHTS
Where it should clearly look like this:
COUR EUROPEENNE
des
DROITS DE L’HOMME
EUROPEAN COURT
of
HUMAN RIGHTS
Looking at the blocks, it becomes clear that tesseract does not recognize the two header blocks as separate, even though they are clearly distinguishable.
Is there a way to tweak tesseract's block/paragraph detection to be more sensitive to this and correctly separate the header blocks?
This problem has been haunting me for a while now. and tesseract is such a powerful tool and does such a great job with tasks that are way more complex, that I just cannot accept that it can't get this right.
Thanks in advance for you help,
best,
Peter
PS:
Find below the version I'm using. I do not think this is a problem of the version, though, the issue is the same with version 3.
tesseract 4.0.0-beta.3-199-gba757
leptonica-1.76.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
Found SSE