lines dissappear in resulting file

C.

unread,

Jan 8, 2015, 9:30:20 AM1/8/15

to tesser...@googlegroups.com

If I do a simple "tesseract 1.tif 2 pdf ", all vertical and horizontal lines (and grahics with small lines) in the source-file dissapear in the resulting PDF-file (Ubuntu server 12.04, tesseract 3.03).

Is that the supposed behavior?

ShreeDevi Kumar

unread,

Jan 8, 2015, 10:02:31 AM1/8/15

to tesser...@googlegroups.com

I don't think that's the supposed behavior. What version of tesseract are you using? Please post a sample image for testing?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 8, 2015 at 8:00 PM, C. <car...@lehrach.de> wrote:

If I do a simple "tesseract 1.tif 2 pdf ", all vertical and horizontal lines (and grahics with small lines) in the source-file dissapear in the resulting PDF-file (Ubuntu server 12.04, tesseract 3.03).

Is that the supposed behavior?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dcbb0e46-b29b-447a-a5f4-d634b4371725%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

C.

unread,

Jan 8, 2015, 10:53:31 AM1/8/15

to tesser...@googlegroups.com

tesseract 3.03, example is attached (5.tif is the original, 5.tig the result).

5.pdf

5.tif

C.

unread,

Jan 8, 2015, 10:54:44 AM1/8/15

to tesser...@googlegroups.com

sorry, meant: 5.pdf is the resulting file.

ShreeDevi Kumar

unread,

Jan 9, 2015, 12:33:01 AM1/9/15

to tesser...@googlegroups.com

I am using the git version -- output and messages attached. pdf seems to have all the lines.

User@HP ~/tesseract-ocr/testing

$ tesseract 5.tif 5 pdf

Tesseract Open Source OCR Engine v3.04.00 with Leptonica

Page 1

OSD: Weak margin (5.78), horiz textlines, not CJK: Don't rotate.

Page 2

Too few characters. Skipping this page

OSD: Weak margin (0.00) for 0 blob text block, but using orientation anyway: 0

Empty page!!

Too few characters. Skipping this page

OSD: Weak margin (0.00) for 0 blob text block, but using orientation anyway: 0

Empty page!!

Warning in pixReadMemTiff: tiff page 2 not found

User@HP ~/tesseract-ocr/testing

$ tesseract -v

tesseract 3.04.00

leptonica-1.71

libgif 5.1.0 : libjpeg 8d : libpng 1.6.14 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.2

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6637bf0e-bf23-4ac8-a5bf-8add588ca9be%40googlegroups.com.

5.pdf

C.

unread,

Jan 9, 2015, 3:23:25 AM1/9/15

to tesser...@googlegroups.com

First of all: thanks for your help.

Concerning my problem I did a complete reinstall of the Ubuntu 14.04-Server, installed tesseract 3.03 from the repos again and the failure still exists ! As 3.03 does not seem to be that old, I did not and - to be honest - do not want to install a newer version from github.

Is this a know bug?

ShreeDevi Kumar

unread,

Jan 9, 2015, 3:28:53 AM1/9/15

to tesser...@googlegroups.com

As far as I know, pdf creation is a new addition and the issues were ironed out only recently. There have been over 100 commits to the code since 3.03 rc.

If you want the new functionality, you can try compiling the code from https://code.google.com/p/tesseract-ocr/source/checkout

Instructions are at https://code.google.com/p/tesseract-ocr/wiki/Compiling

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3363264f-ba7e-41d7-a866-57a395d09755%40googlegroups.com.

C.

unread,

Jan 9, 2015, 7:07:51 AM1/9/15

to tesser...@googlegroups.com

I tried to compile the version you mentioned (after having installed the dependencies of the readme), but make stops with the following error:

./.libs/libtesseract.so: undefined reference to `l_generateCIDataForPdf'

./.libs/libtesseract.so: undefined reference to `l_CIDataDestroy'

collect2: error: ld returned 1 exit status

make[2]: *** [tesseract] Fehler 1

ShreeDevi Kumar

unread,

Jan 9, 2015, 7:15:06 AM1/9/15

to tesser...@googlegroups.com

you should uninstall the old version fully and then build the version from git. It is possibly referring to some older libraries.

Also, this needs leptonica 1.71. Not sure if the documentation mentions it or not.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e39afe04-6bcb-4b04-9697-a9e702440f37%40googlegroups.com.

ShreeDevi Kumar

unread,

Jan 9, 2015, 7:16:03 AM1/9/15

to tesser...@googlegroups.com

please see https://code.google.com/p/tesseract-ocr/issues/detail?id=1278

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

C.

unread,

Jan 9, 2015, 12:34:25 PM1/9/15

to tesser...@googlegroups.com

I did not succeed in completely reinstalling so I reinstalled the server again and installed just the latest version of tesseract from the source.

Now everything worked fine again "tesseracting": all lines are shown in the resulting pdf-file. So it has to be a bug in tesseract 3.03.

Hope that the latest version goes to to ubuntu-repos soon (cause I had some problems after compiling with the TESSDATA_PREFIX thing).

C.

unread,

Jan 9, 2015, 5:52:43 PM1/9/15

to tesser...@googlegroups.com

After rebooting the server tesseract complains as follows:

Error opening data file /usr/local/tesseract-ocr/tessdata/deu.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.

Failed loading language 'deu'

Tesseract couldn't load any languages!

Could not initialize tesseract.

I manually copied deu.traineddata to that folde and chmod'ed it to 777, but that just works until next reboots.

I think I'll give up soon with Tesseract and stay with OCR in Acrobat pro...

ShreeDevi Kumar

unread,

Jan 9, 2015, 10:31:42 PM1/9/15

to tesser...@googlegroups.com

Looks like the reboot is resetting some variables - TESSDATA_PREFIX environment variable

You can try giving the path in commandline. See the following batchfile as a sample ..

---------

#Page Segmentation Modes

#3 = Fully automatic page segmentation, but no OSD. (Default)

#4 = Assume a single column of text of variable sizes.

#6 = Assume a single uniform block of text.

PSM=3

MYFILE=$1

LANG=$2

PDF=pdf

MYOUTPUTFILE=$MYFILE-merged

now=$(date +"%y%m%d-%H%M");

rm $MYOUTPUTFILE.txt

for f in *$MYFILE*.tif

do

echo "Starting OCR for $f file with -l $LANG at $(date) , please wait..."

tesseract --tessdata-dir /home/shree/tesseract-ocr $f $f-$LANG -l $LANG -psm $PSM $PDF

cat $f-$LANG.txt>>$MYOUTPUTFILE.txt

done

echo "OCR done"

gswin32c -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sProcessColorModel=DeviceCMYK -sPDFACompatibilityPolicy=2 -sOutputFile=$MYOUTPUTFILE.pdf *$MYFILE*-$LANG.pdf

echo "pdf merged"

---------

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a15c4b73-248f-4eca-acbc-1d9dfb7cc174%40googlegroups.com.

Mika Koistinen

unread,

May 27, 2016, 8:29:02 AM5/27/16

to tesseract-ocr

Looks like i have related problem when trying to create HOCR files for a single word images. The result for single word is disappearing, however I can find it from txt files without HOCR parameter.

I am using

tesseract 3.05.00dev

leptonica-1.73

libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

myimage.tif does not work (only works if i use psm 5 6 7 8 9 10 and just text)

however this image works (both txt and hocr formats)

ERROR message:

Too few characters. Skipping this page

OSD: Weak margin (0.00) for 1 blob text block, but using orientation anyway: 0

Empty page!!

Tom Morris

unread,

Jun 2, 2016, 6:49:31 PM6/2/16

to tesseract-ocr

On Friday, May 27, 2016 at 8:29:02 AM UTC-4, Mika Koistinen wrote:

Looks like i have related problem when trying to create HOCR files for a single word images. The result for single word is disappearing, however I can find it from txt files without HOCR parameter.

...

ERROR message:
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 1 blob text block, but using orientation anyway: 0
Empty page!!

The "too few characters. Skipping this page" message explains what's going on.

How are you requesting hOCR output? If you are using the default `hocr` config file, it not only enables hOCR output, but it also changes the page segmentation mode to 1, which is what's causing the problem.

You can remove this line:

tessedit_pageseg_mode 1

or change it to a more appropriate page segmentation mode like

tessedit_pageseg_mode 6

Tom

Reply all

Reply to author

Forward