Hi Chris,I opened the pdfs in Adobe Reader as well as Foxit Reader on Windows7, and the page flickers with large size text but then seems to display normally - zoom 100% also seems to be regular output only.Tesseract now has a 'pdf' option, so you don't need to do 'hocrpdf'. Try the following:tesseract -l deu -psm 3 "$page" "$page" pdfIf you also need hocr, you can give the command astesseract -l deu -psm 3 "$page" "$page" hocr pdfI'll test later with the git version of tesseract and post the pdfs for you.ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.comOn Tue, Nov 25, 2014 at 10:00 PM, Chris <christia...@gmail.com> wrote:Hi,--
no I have only tried with the ubuntu version.
Here are the samples:
https://drive.google.com/file/d/0B2kkT1CBqTPCRE1veGtQT3NvSTg/view?usp=sharingfor page in $(ls $1_out_*.tif); do
tesseract -l deu -psm 3 "$page" "$page" hocr
hocr2pdf -i "$page" -s -o "$page.pdf.bak" < "$page.hocr"
# rm -rf $page
done
pdftk $1_out_*.tif.pdf.bak cat output "$1.tmp.pdf"
Thank you,
Chris
On Sunday, November 23, 2014 5:12:12 PM UTC+1, shree wrote:Have you tried with version compiled from latest source on git?If you post a couple of sample images I can give a try and let you know what results I get.ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.comOn Sun, Nov 23, 2014 at 5:00 PM, Chris <christia...@gmail.com> wrote:Hi Ryan,--
I run in the same problem. Do you have solved it?
Best regards,
Chris
On Wednesday, September 17, 2014 7:26:02 PM UTC+2, Ryan Johnson wrote:Hi all,I'm having problems with tesseract-ocr since upgrading to Ubuntu 14.04 LTS. When I use either hocr or the internal tesseract output for searchable pdfs I get an oversized font that fills the page too quickly and does not follow the text in the image.I scan the images as tiffs at 300 dpi, then clean up the images using ScanTailor which outputs it as a tiff at 300 dpi as well, dimensions slightly altered. After that I perform the ocr. The output is there, but the font is not aligned properly to the image, as stated above it makes the font too large and so the text is cut off before the end, and the missing text does not come up in a search.I'm using the stock tesseract package for Ubuntu 14.04. I tried following the instructions to build the training packages but it errorred out.Version info:
tesseract --versiontesseract 3.03leptonica-1.70libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0Here is a sample of my script for the ocr process using the output from ScanTailor:
#!/bin/bash# Run OCR on multiple PDF files and create a new pdf with the# extracted text in hidden layer. Requires tesseract, hocr2pdf, gs.
# NOTE: hocr2pdf is no longer required as of tesseract-ocr 3.03# Usage: ./makeit output.pdfset -eoutput="$1"dir=`pwd`# OCR each page individually and convert into PDFfor page in "$dir"/*page*.tifdobase="${page%.tif}"# tesseract "$page" "$base" -l isl hocrtesseract "$page" "$base.pdf" -l isl # I have also tried adding -psm 4 here# Tesseract now outputs searchable pdf on its own# hocr2pdf -i "$page" -o "$base.pdf" < "$base.hocr"done# combine the pages into one PDFgs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$dir"/*page*.pdf
If anybody could please point out any error I have made or provide a solution to this problem I would be very grateful. I am trying to get a copy of a document to a professor of mine, where the original electronic version of the document was lost. Searchable text is a desirable attribute of the final result for her.Regards
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3bd841a9-075c-4467-b37c-74024f7ecc5b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/08e155d9-9ce2-4170-9934-35e7cbe9ad55%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.