Doing OCR on pdfs with embedded CID fonts

36 views

Skip to first unread message

Kristóf Horváth

unread,

Apr 2, 2019, 7:32:07 AM4/2/19

to tesseract-ocr

I just tried to doOCR on a pdf that has embedded CID fonts and gave me the following error:

6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - **** Error: can't process embedded font stream,
6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - attempting to load the font using its name.
6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - Output may be incorrect.

Some of the CID fonts a correctly embeded and have font names that i recogniz, but it also has font with names such as Fd64459.

I figured, it has to do with the fonts, although Ghostscripts website says :

NOTE: care must be exercised since poor or incorrect output may result from inappropriate CIDFont substitution. We therefore strongly recommend embedding CIDFonts in your Postscript and PDF files if at all possible.

So if we try to do OCR on this pdf, it wont produce anything because Ghostscript recognizes the false CID fonts and throws an error.

So my first question is: Did I make my assessment correctly?
My second question is: If pdf has CID fonts but the situation is not as bad, meaning Ghostscript can work with it, but it will produce incorrect output, does Tesseract handles this in any way? to put it in an other way, Can I be sure that Tesseract will not give me false output and also throws me this error or something similar?

Shree Devi Kumar

unread,

Apr 2, 2019, 11:28:15 AM4/2/19

to tesser...@googlegroups.com

Tesseract does not take pdfs as direct input. You have to convert pdf to images and provide that to tesseract.

However there are many 3rd party applications which take pdf as input and have tesseract as backend to do OCR.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d6741e58-5894-4b52-a39d-e684243b6498%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Kristóf Horváth

unread,

Apr 4, 2019, 3:08:27 AM4/4/19

to tesseract-ocr

Okey, thanks. That means I have to figure if Tess4J takes care of that.

2019. április 2., kedd 17:28:15 UTC+2 időpontban shree a következőt írta:

Tesseract does not take pdfs as direct input. You have to convert pdf to images and provide that to tesseract.

However there are many 3rd party applications which take pdf as input and have tesseract as backend to do OCR.

On Tue, Apr 2, 2019 at 5:02 PM Kristóf Horváth <vazzz...@gmail.com> wrote:

I just tried to doOCR on a pdf that has embedded CID fonts and gave me the following error:
6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - **** Error: can't process embedded font stream,
6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - attempting to load the font using its name.
6329 [pool-2-thread-1] INFO org.ghost4j.Ghostscript - Output may be incorrect.

Some of the CID fonts a correctly embeded and have font names that i recogniz, but it also has font with names such as Fd64459.

I figured, it has to do with the fonts, although Ghostscripts website says :
NOTE: care must be exercised since poor or incorrect output may result from inappropriate CIDFont substitution. We therefore strongly recommend embedding CIDFonts in your Postscript and PDF files if at all possible.

So if we try to do OCR on this pdf, it wont produce anything because Ghostscript recognizes the false CID fonts and throws an error.

So my first question is: Did I make my assessment correctly?
My second question is: If pdf has CID fonts but the situation is not as bad, meaning Ghostscript can work with it, but it will produce incorrect output, does Tesseract handles this in any way? to put it in an other way, Can I be sure that Tesseract will not give me false output and also throws me this error or something similar?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d6741e58-5894-4b52-a39d-e684243b6498%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages