tesseract 4 on Debian Bullseye

68 views
Skip to first unread message

Rich M

unread,
Jan 19, 2022, 11:03:59 PM1/19/22
to tesseract-ocr
Hi,

I'm fairly new to tesseract and had a written a bash script in Debian Buster(previous release) using tesseract 3 which worked very well. I've since upgraded my OS to the next stable release, Bullseye which also upgraded tesseract to V4. After the upgrade, tesseract isn't "working" any longer. I'm needing help in troubleshooting the issue.

Basically the important line of the script is
tesseract PDFIn001.tiff PDFOut001 -l eng pdf

Then in the terminal,
Tesseract Open Source OCR Engine v4.1.1 with Leptonica

The resulting PDF file is 2.4kB and appears to be empty or corrupted.

With the previous Debian release, I didn't need to install any "training". Is that what I'm missing?

Thanks,
Rich

I don't recall seeing the response in the terminal about Leptonica.

Zdenko Podobny

unread,
Jan 20, 2022, 1:18:20 AM1/20/22
to tesser...@googlegroups.com
Please provide details for reproducing problem: input image, output pdf, tesseract details (tesseract -v)

Zdenko


št 20. 1. 2022 o 5:03 Rich M <rama...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3a998a4a-6a6c-4062-84ca-8719adfb05ffn%40googlegroups.com.

Rich M

unread,
Jan 21, 2022, 7:48:36 PM1/21/22
to tesseract-ocr
Sure. I'll need to find a test file that doesn't contain private information.

Before seeing your response now, I ran my script on a file that I had converted to a searchable PDF last year and the output file was very poor. Out of curiosity, I changed the converted image from .tiff to .png and the result was very good. I'm wondering if it's something with the convert package.

Rich

Rich M

unread,
Jan 23, 2022, 10:44:27 PM1/23/22
to tesser...@googlegroups.com
Please provide details for reproducing problem: input image, output pdf, tesseract details (tesseract -v)
tesseract-ocr:   Installed: 4.1.1-2.1
convert, provided by imagemagick:  Installed: 8:6.9.11.60+dfsg-1.3 (It could also be an issue with convert, but I've converted the PDF with GIMP, but get the same results.)
My OS is Linux, Debian Bullseye (stable)

I execute the script by
$ ./PDF2SearchablePDF.sh Sh ShockDataMeasurementsLessonsLearned.pdf

The source PDF
ShockDataMeasurementsLessonsLearned.pdf

Split PDF pg 1
PDFIn001.pdf

Split PDF pg1 converted to .tiff with convert (imagemagick)
PDFIn001.tiff

Pg 1 after processing with tesseract
PDFIn001Searchable.pdf

Bash script:
###
#!/bin/bash
SourcePDF=$1
mkdir PDFIn PDFOut TIFFIn
pdfseparate $SourcePDF PDFIn/PDFIn%03d.pdf
#pdfseparate InputDoc02.pdf PDFIn/PDFIn%03d.pdf
echo $1
cd PDFIn
ls PDFIn*.pdf >../list.txt
cd ..

for FIL in $(<list.txt)
do
convert -density 300 PDFIn/${FIL} TIFFIn/${FIL/.pdf/}.tiff
#gs -q -dNOPAUSE -r300x300 -sDEVICE=tiff32nc -sOutputFile=TIFFIn/${FIL/.pdf/}.tiff PDFIn/${FIL} -c quit
tesseract TIFFIn/${FIL/.pdf/}.tiff PDFOut/${FIL/.pdf/} -l eng pdf
done

pdfunite PDFOut/PDFIn*.pdf OutputPDF.pdf
###


ShockDataMeasurementsLessonsLearned.pdf
PDFIn001.pdf
PDFIn001Searchable.pdf

Zdenko Podobny

unread,
Jan 24, 2022, 1:27:27 AM1/24/22
to tesser...@googlegroups.com
Please send the tesseract relevant file - tiff ;-) .
First think you always need to check the tesseract input. Input of your script (pdf) is not important in this stage.




Zdenko


po 24. 1. 2022 o 4:44 Rich M <rama...@gmail.com> napísal(a):

Rich M

unread,
Jan 24, 2022, 9:43:29 AM1/24/22
to tesser...@googlegroups.com
Looking at the .PDF to .tiff conversion, it might be an issue with convert, provided by imagemagick. Using a different CLI pdf to image conversion, tesseract seems to be working better.
Reply all
Reply to author
Forward
0 new messages