Why is there no selectable text in the PDF output file?

133 views
Skip to first unread message

Eike Stepper

unread,
Jan 26, 2020, 2:33:31 PM1/26/20
to tesseract-ocr
I just installed Tesseract OCR for the first time and had some success with the following command:

  C:\Users\FooBar\AppData\Local\Tesseract-OCR>tesseract img001.tif stdout -l deu
  
It correctly outputs the recognized German text to the console. Then I tried this command:

  C:\Users\FooBar\AppData\Local\Tesseract-OCR>tesseract img001.tif img001 -l deu pdf
  
It creates a PDF file. I can open it with PDF-XChange and see the image. 
But there's no selectable text. With Notepad++ I checked that it does not contain any German text.

Am I doing something wrong?

Cheers
/Eike

P.S.: I hope I didn't post this twice. My first attempt seems to have disappeared in limbo...

Eike Stepper

unread,
Jan 27, 2020, 2:21:50 AM1/27/20
to tesseract-ocr
I found out that my attempt to find the text in the PDF with Notepad was too naive.
Instead I installed the "A-PDF Text Extractor" tool and indeed I could extract all the recognized text from the PDF.

But the text is not selectable with the broadly used PDF viewer "PDF-XChange", 
nor does Windows Search find any of the recognized words in the PDF.

So my question remains, am I doing something wrong?

Shree Devi Kumar

unread,
Jan 27, 2020, 3:20:03 AM1/27/20
to tesseract-ocr
Not all viewers work alike. Try with the free Adobe Acrobat Reader or the viewer in Chrome.

When I last checked most readers/viewers will select and search text in tesseract generated pdfs. Many times the highlighting of selection is incorrect but if you copy and paste all recognized text should be there.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/78d57772-13ca-4aa4-891e-0d0880e7dc01%40googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Eike Stepper

unread,
Jan 27, 2020, 11:03:14 AM1/27/20
to tesseract-ocr
Thank you for the reply. I can confirm that the FoxIt PDF viewer allows me to select the recognized text.
Unfortunately I can't just change to a different viewer ;-(

PDF-XChange perfectly makes text selectable that was recognized and embedded by ABBYY FineReader.
Do you think the Tesseract OCR team would be interested in making Tesseract's PDF creation more flexible/compatible?

As a Java-only programmer I could only contribute to such an effort by providing example PDFs that were created
by ABBYY FineReader and perhaps by finding the structural differences between the two types of PDFs.

Thad Guidry

unread,
Jan 27, 2020, 11:16:17 AM1/27/20
to tesser...@googlegroups.com
Have you tried to use gImageReader (it uses Tesseract4) and the hOCR/PDF dropdown option and inspect the output panel ?
You can also highlight and select text on the image and then see what rows are affected in the output panel.


Eike Stepper

unread,
Jan 27, 2020, 11:24:03 AM1/27/20
to tesseract-ocr
Thank you, Thad. Do you think that gImageReader would work well on my Windows box? What would be the goal of this excercise?

Cheers
/Eike

Thad Guidry

unread,
Jan 27, 2020, 11:50:48 AM1/27/20
to tesser...@googlegroups.com
I use it all the time on my Windows 10 PC.

You can save the PDF created and compare to see if it works better.
If so, then might be a configuration issue.



--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Thad Guidry

unread,
Jan 27, 2020, 12:08:36 PM1/27/20
to tesser...@googlegroups.com

Eike Stepper

unread,
Jan 27, 2020, 1:02:47 PM1/27/20
to tesseract-ocr
Ok, I installed gImageReader and I can confirm that the hOCR data is 100% accurate.
When I open te exported PDF in PDF-XChange it finds selectable text. 
The selection bounds are majorly off, though:

Unbenannt.PNG


I think this might really be a problem in PDF-XChange.

I contacted the vendor and asked for his opinion.

When I get feedback I'll update this thread...


Thanks again for your care ;-)


Reply all
Reply to author
Forward
0 new messages