Lost spaces in some pdf renderers

45 views

Skip to first unread message

unread,

Feb 19, 2018, 9:03:02 AM2/19/18

to tesseract-ocr

Attached pdf OCRed by ocrmypdf using tesseract 4.00.00alpha

Linux 4.13.0-32-generic #35~16.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

In some pdf viewers (Evince, Chrome, Opera) all ok but in other (Firefox, Alfresco Share, pdfjs) not so good - lost spaces between the words.

So text "Test PDF from LibreOffice" looks like one big word "TestPDFfromLibreOffice" after copy/paste.

If use some other commercial OCR engines for source pdf - got OCRed pdf with normal spaces in all pdf viewers (in pdfjs too all ok).

So this is two side problem: tesseract devs says - its pdfjs problem, pdfjs devs says - its tesseract problem.

Is it possible to solve this "spaces" problem via some keys for tesseract (ocrmypdf) to force space recognition (like in other OCRs)?