Problems with pdf out put from tesseract

43 views
Skip to first unread message

che

unread,
Mar 24, 2020, 10:19:52 AM3/24/20
to tesseract-ocr
Hello,

i am using the following version of the software:

 tesseract 4.0.0
 leptonica-1.76.0
 libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found SSE

I try to convert .tif in to PDF within a python-script:

pdf = pytesseract.image_to_pdf_or_hocr(result, lang='deu+tur+kur', extension='pdf', config='--psm 6')

The text "underneeth" the picture is the following (pdftotext -layout xyz.pdf):

...
Nach langer Abstinenz ist Apple fulminan              t  au f  de n  Mo ni  to rm ar  kt   zu rü ck ge ke hr t:  Da s  Pr o
...

If I use "pure" text-conversion:

 text = (pytesseract.image_to_string(result, lang='deu+tur+kur',config='--psm 6'))

The output is correct (like on the .tif):

...
Nach langer Abstinenz ist Apple fulminant auf den Monitormarkt zurückgekehrt: Das Pro
...

The text is needed for search operations, so the added whitespaces are quite anoying.

Is this a fault of tesseract or did I some thing wrong.

Thanks in advance!
Reply all
Reply to author
Forward
0 new messages