Problems with pdf out put from tesseract

43 views

Skip to first unread message

che

unread,

Mar 24, 2020, 10:19:52 AM3/24/20

to tesseract-ocr

Hello,

i am using the following version of the software:

tesseract 4.0.0
leptonica-1.76.0
libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found SSE

I try to convert .tif in to PDF within a python-script:

pdf = pytesseract.image_to_pdf_or_hocr(result, lang='deu+tur+kur', extension='pdf', config='--psm 6')

The text "underneeth" the picture is the following (pdftotext -layout xyz.pdf):

...

Nach langer Abstinenz ist Apple fulminan t au f de n Mo ni to rm ar kt zu rü ck ge ke hr t: Da s Pr o

...

If I use "pure" text-conversion:

text = (pytesseract.image_to_string(result, lang='deu+tur+kur',config='--psm 6'))

The output is correct (like on the .tif):

...

Nach langer Abstinenz ist Apple fulminant auf den Monitormarkt zurückgekehrt: Das Pro

...

The text is needed for search operations, so the added whitespaces are quite anoying.

Is this a fault of tesseract or did I some thing wrong.

Thanks in advance!

Reply all

Reply to author

Forward

0 new messages