Hello,
i am using the following version of the software:
tesseract 4.0.0
leptonica-1.76.0
libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found SSE
I try to convert .tif in to PDF within a python-script:
pdf = pytesseract.image_to_pdf_or_hocr(result, lang='deu+tur+kur', extension='pdf', config='--psm 6')
The text "underneeth" the picture is the following (pdftotext -layout xyz.pdf):
...
Nach langer Abstinenz ist Apple fulminan t au f de n Mo ni to rm ar kt zu rü ck ge ke hr t: Da s Pr o
...
If I use "pure" text-conversion:
text = (pytesseract.image_to_string(result, lang='deu+tur+kur',config='--psm 6'))
The output is correct (like on the .tif):
...
Nach langer Abstinenz ist Apple fulminant auf den Monitormarkt zurückgekehrt: Das Pro
...
The text is needed for search operations, so the added whitespaces are quite anoying.
Is this a fault of tesseract or did I some thing wrong.
Thanks in advance!