Searchable PDF output with oversized font

Ryan Johnson

unread,

Sep 17, 2014, 1:26:02 PM9/17/14

to tesser...@googlegroups.com

Hi all,

I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 LTS. When I use either hocr or the internal tesseract output for searchable pdfs I get an oversized font that fills the page too quickly and does not follow the text in the image.

I scan the images as tiffs at 300 dpi, then clean up the images using ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions slightly altered. After that I perform the ocr. The output is there, but the font is not aligned properly to the image, as stated above it makes the font too large and so the text is cut off before the end, and the missing text does not come up in a search.

I'm using the stock tesseract package for Ubuntu 14.04. I tried following the instructions to build the training packages but it errorred out.

Version info:

tesseract --version
tesseract 3.03
 leptonica-1.70
  libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0

Here is a sample of my script for the ocr process using the output from ScanTailor:

#!/bin/bash
# Run OCR on multiple PDF files and create a new pdf with the
# extracted text in hidden layer. Requires tesseract, hocr2pdf, gs.
# NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03
# Usage: ./makeit output.pdf

set -e
output="$1"
dir=`pwd`

# OCR each page individually and convert into PDF
for page in "$dir"/*page*.tif
do
    base="${page%.tif}"
#    tesseract "$page" "$base" -l isl hocr
    tesseract "$page" "$base.pdf" -l isl     # I have also tried adding -psm 4 here
#    Tesseract now outputs searchable pdf on its own
#    hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr"
done

# combine the pages into one PDF
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$dir"/*page*.pdf

If anybody could please point out any error I have made or provide a solution to this problem I would be very grateful. I am trying to get a copy of a document to a professor of mine, where the original electronic version of the document was lost. Searchable text is a desirable attribute of the final result for her.

Regards

Chris

unread,

Nov 23, 2014, 6:30:13 AM11/23/14

to tesser...@googlegroups.com

Hi Ryan,
I run in the same problem. Do you have solved it?

Best regards,

Chris

ShreeDevi Kumar

unread,

Nov 23, 2014, 11:12:12 AM11/23/14

to tesser...@googlegroups.com

Have you tried with version compiled from latest source on git?

If you post a couple of sample images I can give a try and let you know what results I get.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris

unread,

Nov 25, 2014, 11:30:40 AM11/25/14

to tesser...@googlegroups.com

Hi,
no I have only tried with the ubuntu version.

Here are the samples:
https://drive.google.com/file/d/0B2kkT1CBqTPCRE1veGtQT3NvSTg/view?usp=sharing

for page in $(ls $1_out_*.tif); do
    tesseract -l deu -psm 3 "$page" "$page" hocr
    hocr2pdf -i "$page" -s -o "$page.pdf.bak" < "$page.hocr"
#    rm -rf $page
done

pdftk $1_out_*.tif.pdf.bak cat output "$1.tmp.pdf"

Thank you,

Chris

ShreeDevi Kumar

unread,

Nov 25, 2014, 12:44:57 PM11/25/14

to tesser...@googlegroups.com

Hi Chris,

I opened the pdfs in Adobe Reader as well as Foxit Reader on Windows7, and the page flickers with large size text but then seems to display normally - zoom 100% also seems to be regular output only.

Tesseract now has a 'pdf' option, so you don't need to do 'hocrpdf'. Try the following:

tesseract -l deu -psm 3 "$page" "$page" pdf

If you also need hocr, you can give the command as

tesseract -l deu -psm 3 "$page" "$page" hocr pdf

I'll test later with the git version of tesseract and post the pdfs for you.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/08e155d9-9ce2-4170-9934-35e7cbe9ad55%40googlegroups.com.

ShreeDevi Kumar

unread,

Nov 25, 2014, 1:09:10 PM11/25/14

to tesser...@googlegroups.com, tesser...@googlegroups.com

I don't have access to my pc with tesseract so I tested using the beta4 of vietocr - http://sourceforge.net/projects/vietocr/files/vietocr/4.0%20Beta/

(Use Command - Bulk OCR with pdf as the output format)

The generated pdf files are smaller in size and don't display the large size text ...

Vietocr uses - Tesseract 3.03 RC (r1127) - so I should expect that your ubuntu version should work the same - try with the PDF config, rather than HOCRpdf.

Files attached.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

test.pdf_out_4.tif.pdf

test.pdf_out_3.tif.pdf

test.pdf_out_2.tif.pdf

test.pdf_out_1.tif.pdf

Reply all

Reply to author

Forward