"Error in selectDefaultPdfEncoding: type selection failure" on Tesseract 5.1.0 in Ubuntu

73 views
Skip to first unread message

Lucas L.

unread,
Jun 6, 2022, 11:53:32 AM6/6/22
to tesseract-ocr
Hi, I'm trying to upgrade Tesseract in our Ubuntu 20.04 VMs used to OCR documents to Tesseract 5.1 from 4.1.1, both versions were built from source on that VM. 4.1.1 worked, but 5.1 throws an error that I can't seem to find anywhere else online:

sudo -u userx tesseract --loglevel ALL --oem 1 -l eng /opt/.../pdfprocessor/test/ocr-working/1/ocrIn_1.tif /opt/.../pdfprocessor/test/test pdf
Error in selectDefaultPdfEncoding: type selection failure
Error during processing.

I have tried the training data from both "tessdata" and "tessdata_best" and got the same error. Any help would be appreciated.

Thanks,
Lucas LeBlanc

Zdenko Podobny

unread,
Jun 6, 2022, 12:21:25 PM6/6/22
to tesser...@googlegroups.com
Can you please share  ocrIn_1.tif + info which tessdata version you use?
+ output of 'tesseract -v'

Zdenko


po 6. 6. 2022 o 17:53 Lucas L. <infinit...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6a8a3c7c-5c09-478e-a897-dca4314646e6n%40googlegroups.com.

Lucas L.

unread,
Jun 6, 2022, 12:46:30 PM6/6/22
to tesseract-ocr
It seems to be specific to the document in question. However I'm afraid I can't post the document because it has sensitive information on it. I guess I can try to scrub the info using an image editing tool and see if the error still occurs.

Lucas L.

unread,
Jun 6, 2022, 12:47:31 PM6/6/22
to tesseract-ocr
Oh yeah, here's the output of tessdata -v:

tesseract 5.1.0
 leptonica-1.79.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4


Lucas L.

unread,
Jun 6, 2022, 1:00:45 PM6/6/22
to tesseract-ocr
No luck sadly, when I edited the image in Irfanview to block out the sensitive parts and tried to OCR it again, the error didn't occur. I'm not sure what changed in the .tiff image file. Any ideas on what kind of image metadata can possibly cause this "selectDefaultPdfEncoding" error? 

Only differences I can notice between the two files is that the original has 16 BPP color depth. They both have LZW compression.

Lucas L.

unread,
Jun 6, 2022, 4:47:12 PM6/6/22
to tesseract-ocr
OK, I have a sample document to share now. I've pulled out one page from a document exhibiting this error that does not have any identifying information on it.
I noticed in the process of doing this, that the same original document (they usually come in as PDFs) split into TIFFs by other applications (i.e., FoxIt) don't seem to run into issues. The TIFFs are not invalid when I look at them on my personal PC. However when the document goes through our pipeline and is split into TIFFs in preparation for being OCR'd, Tesseract throws the "defaultPdfEncoding" error mentioned above. Unfortunately unless I know exactly what about this document is causing this, I won't be able to address it in our pipeline.

ocrIn_4.tif

Zdenko Podobny

unread,
Jun 7, 2022, 1:27:08 AM6/7/22
to tesser...@googlegroups.com
Can you please create an issue at https://github.com/tesseract-ocr/tesseract/issues?

I confirm a problem with recent tesseract and leptonica, so it should be fixed for the next release...

Zdenko


po 6. 6. 2022 o 22:47 Lucas L. <infinit...@gmail.com> napísal(a):

Lucas L.

unread,
Jun 7, 2022, 10:02:38 AM6/7/22
to tesseract-ocr
Sure, I will write that up. Thanks for helping, zdenop. Would you happen to know which is the most recent version that does not exhibit this issue so I can switch to that?

Lucas L.

unread,
Jun 7, 2022, 10:05:46 AM6/7/22
to tesseract-ocr
Also, I feel compelled to mention that I think I have seen this on some of my unupdated VMs running 4.1.1, also built from source, on the same document. Sorry for the spam, I wish I could edit. I think it may be tied to leptonica specifically or something else in the environment? The same version of Tesseract was working before I updated Ubuntu to version 20.04, which leads me to think it would be some kind of dependency.
Reply all
Reply to author
Forward
0 new messages