Lost spaces in some pdf renderers

45 views
Skip to first unread message

DJArty

unread,
Feb 19, 2018, 9:03:02 AM2/19/18
to tesseract-ocr
Attached pdf OCRed by ocrmypdf using tesseract 4.00.00alpha 
Linux 4.13.0-32-generic #35~16.04.1-Ubuntu SMP  x86_64 x86_64 x86_64 GNU/Linux

In some pdf viewers (Evince, Chrome, Opera) all ok but in other (Firefox, Alfresco Share, pdfjs) not so good - lost spaces between the words.

So text "Test PDF from LibreOffice" looks like one big word "TestPDFfromLibreOffice" after copy/paste.

You can load pdf to pdfjs demo here: https://mozilla.github.io/pdf.js/web/viewer.html 

If use some other commercial OCR engines for source pdf - got OCRed pdf with normal spaces in all pdf viewers (in pdfjs too all ok).

So this is two side problem:  tesseract devs says - its pdfjs problem,  pdfjs devs says - its tesseract problem.

Is it possible to solve this "spaces" problem via some keys for tesseract (ocrmypdf) to force space recognition (like in other OCRs)?
Or make understanding problem root for some more info for pdfjs devs. 

Testpdfsandwich.pdf
Reply all
Reply to author
Forward
0 new messages