lines dissappear in resulting file

925 views
Skip to first unread message

C.

unread,
Jan 8, 2015, 9:30:20 AM1/8/15
to tesser...@googlegroups.com
If I do a simple "tesseract 1.tif 2 pdf ", all vertical and horizontal lines (and grahics with small lines) in the source-file dissapear in the resulting PDF-file (Ubuntu server 12.04, tesseract 3.03).

Is that the supposed behavior?

ShreeDevi Kumar

unread,
Jan 8, 2015, 10:02:31 AM1/8/15
to tesser...@googlegroups.com
I don't think that's the supposed behavior. What version of tesseract are you using? Please post a sample image for testing?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 8, 2015 at 8:00 PM, C. <car...@lehrach.de> wrote:
If I do a simple "tesseract 1.tif 2 pdf ", all vertical and horizontal lines (and grahics with small lines) in the source-file dissapear in the resulting PDF-file (Ubuntu server 12.04, tesseract 3.03).

Is that the supposed behavior?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/dcbb0e46-b29b-447a-a5f4-d634b4371725%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

C.

unread,
Jan 8, 2015, 10:53:31 AM1/8/15
to tesser...@googlegroups.com
tesseract 3.03, example is attached (5.tif is the original, 5.tig the result).
5.pdf
5.tif

C.

unread,
Jan 8, 2015, 10:54:44 AM1/8/15
to tesser...@googlegroups.com
sorry, meant: 5.pdf is the resulting file.

ShreeDevi Kumar

unread,
Jan 9, 2015, 12:33:01 AM1/9/15
to tesser...@googlegroups.com
I am using the git version -- output and messages attached. pdf seems to have all the lines.

User@HP ~/tesseract-ocr/testing
$ tesseract 5.tif 5 pdf
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
OSD: Weak margin (5.78), horiz textlines, not CJK: Don't rotate.
Page 2
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 0 blob text block, but using orientation anyway: 0
Empty page!!
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 0 blob text block, but using orientation anyway: 0
Empty page!!
Warning in pixReadMemTiff: tiff page 2 not found

User@HP ~/tesseract-ocr/testing
$ tesseract -v
tesseract 3.04.00
 leptonica-1.71
  libgif 5.1.0 : libjpeg 8d : libpng 1.6.14 : libtiff 4.0.3 : zlib 1.2.8 : libwebp 0.4.2


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

5.pdf

C.

unread,
Jan 9, 2015, 3:23:25 AM1/9/15
to tesser...@googlegroups.com
First of all: thanks for your help.

Concerning my problem I did a complete reinstall of the Ubuntu 14.04-Server, installed tesseract 3.03 from the repos again and the failure still exists ! As 3.03 does not seem to be that old, I did not and - to be honest - do not want to install a newer version from github.

Is this a know bug?

ShreeDevi Kumar

unread,
Jan 9, 2015, 3:28:53 AM1/9/15
to tesser...@googlegroups.com
As far as I know, pdf creation is a new addition and the issues were ironed out only recently. There have been over 100 commits to the code since 3.03 rc. 

If you want the new functionality, you can try compiling the code from https://code.google.com/p/tesseract-ocr/source/checkout


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

C.

unread,
Jan 9, 2015, 7:07:51 AM1/9/15
to tesser...@googlegroups.com
I tried to compile the version you mentioned (after having installed the dependencies of the readme), but make stops with the following error:

./.libs/libtesseract.so: undefined reference to `l_generateCIDataForPdf'
./.libs/libtesseract.so: undefined reference to `l_CIDataDestroy'
collect2: error: ld returned 1 exit status
make[2]: *** [tesseract] Fehler 1

ShreeDevi Kumar

unread,
Jan 9, 2015, 7:15:06 AM1/9/15
to tesser...@googlegroups.com
you should uninstall the old version fully and then build the version from git. It is possibly referring to some older libraries.

Also, this needs leptonica 1.71. Not sure if the documentation mentions it or not.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
Jan 9, 2015, 7:16:03 AM1/9/15
to tesser...@googlegroups.com

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

C.

unread,
Jan 9, 2015, 12:34:25 PM1/9/15
to tesser...@googlegroups.com
I did not succeed in completely reinstalling so I reinstalled the server again and  installed just the latest version of tesseract from the source.

Now everything worked fine again "tesseracting": all lines are shown in the resulting pdf-file. So it has to be a bug in tesseract 3.03.

Hope that the latest version goes to to ubuntu-repos soon (cause I had some problems after compiling with the TESSDATA_PREFIX thing).

C.

unread,
Jan 9, 2015, 5:52:43 PM1/9/15
to tesser...@googlegroups.com
After rebooting the server tesseract complains  as follows:

Error opening data file /usr/local/tesseract-ocr/tessdata/deu.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'deu'
Tesseract couldn't load any languages!
Could not initialize tesseract.

I manually copied  deu.traineddata to that folde and chmod'ed it to 777, but that just works until next reboots.

I think I'll  give up soon with Tesseract and stay with OCR in Acrobat pro...

ShreeDevi Kumar

unread,
Jan 9, 2015, 10:31:42 PM1/9/15
to tesser...@googlegroups.com
Looks like the reboot is resetting some variables - TESSDATA_PREFIX environment variable

You can try giving the path in commandline. See the following batchfile as a sample ..

---------
#Page Segmentation Modes
#3 = Fully automatic page segmentation, but no OSD. (Default)
#4 = Assume a single column of text of variable sizes.
#6 = Assume a single uniform block of text.
PSM=3
MYFILE=$1
LANG=$2
PDF=pdf
MYOUTPUTFILE=$MYFILE-merged

now=$(date +"%y%m%d-%H%M");
rm $MYOUTPUTFILE.txt
for f in *$MYFILE*.tif
do
  echo "Starting OCR for $f file with -l $LANG at $(date) , please wait..."
  tesseract  --tessdata-dir /home/shree/tesseract-ocr   $f $f-$LANG  -l $LANG   -psm $PSM $PDF 
  cat  $f-$LANG.txt>>$MYOUTPUTFILE.txt
done
echo "OCR done"

gswin32c -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sProcessColorModel=DeviceCMYK  -sPDFACompatibilityPolicy=2 -sOutputFile=$MYOUTPUTFILE.pdf *$MYFILE*-$LANG.pdf 
echo "pdf merged"

---------



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Mika Koistinen

unread,
May 27, 2016, 8:29:02 AM5/27/16
to tesseract-ocr
Looks like i have related problem when trying to create HOCR files for a single word images. The result for single word is disappearing, however I can find it from txt files without HOCR parameter.

I am using 

tesseract 3.05.00dev
 leptonica-1.73
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

myimage.tif does not work (only works if i use psm 5 6 7 8 9 10 and just text)

however this image works (both txt and hocr formats)




ERROR message:

Too few characters. Skipping this page

OSD: Weak margin (0.00) for 1 blob text block, but using orientation anyway: 0

Empty page!!




Tom Morris

unread,
Jun 2, 2016, 6:49:31 PM6/2/16
to tesseract-ocr
On Friday, May 27, 2016 at 8:29:02 AM UTC-4, Mika Koistinen wrote:
Looks like i have related problem when trying to create HOCR files for a single word images. The result for single word is disappearing, however I can find it from txt files without HOCR parameter.
 ...

ERROR message:

Too few characters. Skipping this page

OSD: Weak margin (0.00) for 1 blob text block, but using orientation anyway: 0

Empty page!!


The "too few characters. Skipping this page" message explains what's going on.

How are you requesting hOCR output? If you are using the default `hocr` config file, it not only enables hOCR output, but it also changes the page segmentation mode to 1, which is what's causing the problem.

You can remove this line:

tessedit_pageseg_mode 1

  
or change it to a more appropriate page segmentation mode like 

tessedit_pageseg_mode 6


Tom

Reply all
Reply to author
Forward
0 new messages