tesseract on cygwin

1,619 views
Skip to first unread message

Simon Eigeldinger

unread,
Jul 22, 2015, 1:41:27 PM7/22/15
to tesser...@googlegroups.com
Hi,

sorry for starting a new thread but i deleted all the other mails.

just updated the package for the german and english languages to include
osd.traineddata to make the error go away.
the other 2 files are unchanged at the moment.
bleeding edge code doesn't compile every day i know that.
thats why i just dropped the error here.
but i guess zdenko was writing to the other guy.

greetings,
simon

--
Simon Eigeldinger
Follow me on Twitter: http://www.twitter.com/domasofan/
E-Mail: simon.ei...@vol.at
MSN: simon_ei...@hotmail.com
ICQ: 121823966
Jabber: doma...@andrelouis.com

ShreeDevi Kumar

unread,
Jul 22, 2015, 10:56:41 PM7/22/15
to tesser...@googlegroups.com

Excellent instructions, Simon. 

I am downloading and will give it a try under Windows8.

I would suggest that you add  'Tesseract for Windows' as a heading on the instructions page too. Thanks!



>> did you managed to build training tools?

​>> ​
Zdenko
​​


Zdenko,

http://domasofan.spdns.eu/tesseract/tesseract-core-20150721.exe only has tesseract.exe, so the training tools are NOT included in this.




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/55AFD5BF.3070508%40vol.at.
For more options, visit https://groups.google.com/d/optout.

Simon Eigeldinger

unread,
Jul 23, 2015, 12:16:35 PM7/23/15
to tesser...@googlegroups.com
Hi,

Just fixed the how to file.


greetings,
simon
---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
https://www.avast.com/antivirus

ShreeDevi Kumar

unread,
Jul 24, 2015, 1:10:57 AM7/24/15
to tesser...@googlegroups.com
Simon,

I gave the cygwin compiled windows binary a try. It runs fine and I was able to create the txt and hocr output. 

I am getting some error creating the pdf and also if I use a gif as input. Just FYI, in the past I have used MSYS2 on the same PC for building tesseract - not sure if that is causing a conflict.

Here is the log, in case you can help me pinpoint the problem. In case you have the debug version compiled, I'll download that and give it  a try. Thanks!


Microsoft Windows [Version 6.3.9600]
(c) 2013 Microsoft Corporation. All rights reserved.

C:\Users\User>cd Downloads

C:\Users\User\Downloads>cd TESS

C:\Users\User\Downloads\TESS>tess eurotext eng

C:\Users\User\Downloads\TESS>tesseract test/eurotext.tif test/eurotext-eng-txt -l eng
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

C:\Users\User\Downloads\TESS>tesseract test/eurotext.tif test/eurotext-eng-hocr -l eng hocr
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

C:\Users\User\Downloads\TESS>tesseract test/eurotext.tif test/eurotext-eng-pdf -l eng pdf
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
Error in fopenReadStream: file not found
Error in extractG4DataFromFile: stream not opened to file
Error in l_generateG4Data: datacomp not extracted
Error in pixGenerateCIData: g4 data not made
Error in l_generateCIDataForPdf: file test/eurotext.tif format is 4; unreadable
Error during processing.


C:\Users\User\Downloads\TESS>tesseract test/phototest.gif phototest.gif -l eng
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Warning in pixReadMemGif: writing to a temp file, not directly to memory
Error in fopenWriteStream: stream not opened
Error in l_binaryWrite: stream not opened
Error in fopenReadStream: file not found
Error in pixRead: image file not found: /tmp/leptonica/791198_5264_mem.gif
Error in pixReadMemGif: pix not read
Error in pixReadMem: gif: no pix returned
Error during processing.

C:\Users\User\Downloads\TESS>



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

zdenko podobny

unread,
Jul 24, 2015, 2:18:05 AM7/24/15
to tesser...@googlegroups.com
On Fri, Jul 24, 2015 at 7:10 AM, ShreeDevi Kumar <shree...@gmail.com> wrote:
C:\Users\User\Downloads\TESS>tesseract test/eurotext.tif test/eurotext-eng-pdf -l eng pdf
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
Error in fopenReadStream: file not found
Error in extractG4DataFromFile: stream not opened to file
Error in l_generateG4Data: datacomp not extracted
Error in pixGenerateCIData: g4 data not made
Error in l_generateCIDataForPdf: file test/eurotext.tif format is 4; unreadable
Error during processing.

It looks like leptonica issue. Did you try to build and run leptonica progs (all that has pdf in name)?



Zdenko 

Simon Eigeldinger

unread,
Jul 24, 2015, 2:42:10 AM7/24/15
to tesser...@googlegroups.com
Hi,

i never tried to give tesseract a pdf as an input.
cygwin has leptonica 1.71 or 1.72 by default so i used this for compiling.
maybe leptonica doesn't like pdf files so it might complain.
so ShreeDevi Kumar might convert the pdf into an image or he uses a
normal image (tif, jpg, etc.).


greetings,
simon

zdenko podobny

unread,
Jul 24, 2015, 2:51:20 AM7/24/15
to tesser...@googlegroups.com
it is not about input, but output.
pdf output is key feature of  leptonica 1.71 release (and tesseract 3.03/3.04) and I guess it was not tested on cygwin yet.

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Simon Eigeldinger

unread,
Jul 24, 2015, 2:54:26 AM7/24/15
to tesser...@googlegroups.com
hi,

i did test it 2 days ago and it seems to work.
at least over here and on a windows 7 machine in the office.
but i could recheck again.


greetings,
simon

Simon Eigeldinger

unread,
Jul 24, 2015, 3:48:53 AM7/24/15
to tesser...@googlegroups.com
hi,

sorry missed the point.
just reproduced it:

$ tesseract testing\eurotext.tif testing\eurotext -l eng+deu pdf

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Error in fopenWriteStream: stream not opened
Error in pixWrite: stream not opened
Error in fopenReadStream: file not found
Error in extractG4DataFromFile: stream not opened to file
Error in l_generateG4Data: datacomp not extracted
Error in pixGenerateCIData: g4 data not made
Error in l_generateCIDataForPdf: file testing\eurotext.tif format is 4;
unreadab
le
Error during processing.



the pdf comes out but you can't open it.
adobe reader shows anerror that it is corrupted.
i did another test without pdf.

$ tesseract testing\eurotext.tif testing\eurotext -l eng+deu

Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

It creates a text which seem to contain everything but shows the warning
message.

i recompiled a new version on my fake website so people can play with
the training tools as well.
so and now i am off for 2 weeks.
have a nice time while i am not around.

greetings,
simon

marco....@gmail.com

unread,
Jul 26, 2015, 4:06:31 PM7/26/15
to tesseract-ocr, simon.ei...@vol.at

on cygwin with the just built 3.04.00 package

$ tesseract -l eng+deu eurotext.tif eurotext  pdf

Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Page 1
Warning in pixReadMemTiff: tiff page 1 not found

the pdf is fine (just looks as bad as the original eurotext.tif picture)

Regards
Marco
(tesseract+leptonica cygwin package maintainer)
 

ShreeDevi Kumar

unread,
Jul 26, 2015, 10:55:03 PM7/26/15
to tesser...@googlegroups.com, simon.ei...@vol.at
Thank you, Marco.

1. Is there a way to download just the tesseract package and dependencies (like Simon had setup) for testing purposes for those who do not have a cygwin install?

2. The pdf output option (as far as I understand it) adds the OCRed text layer on top of copy of the original image, so looking like the original image is by intention.

3. Are the training tools (text2image and other programs from training directory) included as part of this? If so, may I request you to also include the bash scripts in training directory - tesstrain.sh, tesstrain_util.sh and language-specific.sh. Training also requires langdata which is available in a separate repository - https://github.com/tesseract-ocr/langdata

Question for Zdenko, Jeff, Ray ...

Should Tesseract training tools be packaged separately from tesseract-ocr, since not everyone is interested in doing training?




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Marco Atzeri

unread,
Jul 27, 2015, 2:20:21 AM7/27/15
to tesser...@googlegroups.com


On 7/27/2015 4:54 AM, ShreeDevi Kumar wrote:
> Thank you, Marco.
>
> 1. Is there a way to download just the tesseract package and
> dependencies (like Simon had setup) for testing purposes for those who
> do not have a cygwin install?

possible:

The package is available on mirrors:
http://mirrors.kernel.org/sourceware/cygwin/x86_64/release/tesseract-ocr/

the setup.hint reports the dependencies, in this case
requires: cygwin libgcc1 libleptonica_3 libstdc++6 libtesseract-ocr_3
tesseract-ocr-eng


http://mirrors.kernel.org/sourceware/cygwin/x86_64/release/leptonica/libleptonica_3/
requires: cygwin libgif4 libjpeg8 libpng16 libtiff6 libwebp5 zlib0

and so on..

> 2. The pdf output option (as far as I understand it) adds the OCRed text
> layer on top of copy of the original image, so looking like the original
> image is by intention.

I guessed so.

>
> 3. Are the training tools (text2image and other programs from training
> directory) included as part of this? If so, may I request you to also
> include the bash scripts in training directory - tesstrain.sh,
> tesstrain_util.sh and language-specific.sh. Training also requires
> langdata which is available in a separate repository -
> https://github.com/tesseract-ocr/langdata

No, from what I see nothing is built or installed from the training
directory.
There is a specific switch for that ?
I am using a clean configure call.

ShreeDevi Kumar

unread,
Jul 27, 2015, 3:05:23 AM7/27/15
to tesser...@googlegroups.com

Marco,

Please see https://github.com/tesseract-ocr/tesseract/wiki/Compiling for dependencies and instructions for compiling training tools.

They may not compile with 3.04.00 , please see https://github.com/tesseract-ocr/tesseract/issues/61 
 Closed

building tesseract under cygwin: training tools don't build #61

- sent from my phone. excuse the brevity and typos.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Marco Atzeri

unread,
Jul 27, 2015, 5:45:33 AM7/27/15
to tesser...@googlegroups.com
Hi ShreeDevi,

Which icu libs is requested ?

$ pkg-config --list-all|grep icu-
icu-i18n icu-i18n - International Components for
Unicode: Internationalization library
icu-uc icu-uc - International Components for
Unicode: Common and Data libraries
icu-io icu-io - International Components for
Unicode: Stream and I/O Library
icu-le icu-le - International Components for
Unicode: Layout library
icu-lx icu-lx - International Components for
Unicode: Paragraph Layout library


$ pkg-config --libs icu-i18n
-licui18n -licuuc -licudata -lpthread -lm


On 7/27/2015 9:05 AM, ShreeDevi Kumar wrote:
> Marco,
>
> Please see https://github.com/tesseract-ocr/tesseract/wiki/Compiling for
> dependencies and instructions for compiling training tools.
>
> They may not compile with 3.04.00 , please see
> https://github.com/tesseract-ocr/tesseract/issues/61
> Closed
>
> *building **tesseract**under **cygwin**: training tools don't build #61*
>
> - sent from my phone. excuse the brevity and typos.
>
> On 27 Jul 2015 11:50, "Marco Atzeri" <marco....@gmail.com
> <mailto:tesseract-ocr%2Bunsu...@googlegroups.com>.
> To post to this group, send email to tesser...@googlegroups.com
> <mailto:tesser...@googlegroups.com>.
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com
> <mailto:tesseract-oc...@googlegroups.com>.
> To post to this group, send email to tesser...@googlegroups.com
> <mailto:tesser...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUL2wq%2Bb442ceQVkxw7PZNdAH%3DhD1Z1WGC%3DuFHOLqA4vg%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUL2wq%2Bb442ceQVkxw7PZNdAH%3DhD1Z1WGC%3DuFHOLqA4vg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

ShreeDevi Kumar

unread,
Jul 27, 2015, 10:53:13 AM7/27/15
to tesser...@googlegroups.com

Most probably icui18n

see issues 61 and 62

- sent from my phone. excuse the brevity and typos.

To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages