Tesseract 2.00 Now available!

130 views
Skip to first unread message

thera...@gmail.com

unread,
Jul 18, 2007, 6:12:50 PM7/18/07
to tesseract-ocr
The code for Tesseract 2.00 is now checked in to subversion and the
tarballs are on the main site.
See http://code.google.com/p/tesseract-ocr/downloads/list.
Note that this version recognizes 6 languages. To be completely
language independent, there is *no* language data with the source, so
you have to download a separate language file to get it to work at
all.

The wiki (http://code.google.com/p/tesseract-ocr/w/list) is
extensively updated with new release notes (http://code.google.com/p/
tesseract-ocr/wiki/ReleaseNotes), documentation on training (http://
code.google.com/p/tesseract-ocr/wiki/TrainingTesseract) and
documentation on testing (http://code.google.com/p/tesseract-ocr/wiki/
TestingTesseract)

Be aware that this version has substantial changes and therefore may
have broken the build on one of the systems for which we have no
direct testing. The extern "C" problem should be a thing of the past
however.

Ray.

slch...@gmail.com

unread,
Jul 18, 2007, 9:59:38 PM7/18/07
to tesseract-ocr
Hi,

I have download the latest Tesseract and tried it out. I run the
Tesseract.exe in Window platform and it give me this error "the system
cannot execute the specified program". Is it I gonna run the training
first in order to run the tesseract and dlltest execution file?

On Jul 19, 6:12 am, theraysm...@gmail.com wrote:
> The code for Tesseract 2.00 is now checked in to subversion and the
> tarballs are on the main site.

> Seehttp://code.google.com/p/tesseract-ocr/downloads/list.

lohith

unread,
Jul 19, 2007, 6:13:14 AM7/19/07
to tesseract-ocr
Hi,

while installing the tesseract from SVN i found that the following
files are not checked in
tessdata/tessconfigs/Makefile.in
tessdata/configs/Makefile.in

This was breaking the ./configure.
After copying these files from tesseract-2.00.tar.gz. I was able to
build the system.

--lohith

On Jul 19, 3:12 am, theraysm...@gmail.com wrote:
> The code for Tesseract 2.00 is now checked in to subversion and the
> tarballs are on the main site.

> Seehttp://code.google.com/p/tesseract-ocr/downloads/list.

rlo...@gmail.com

unread,
Jul 19, 2007, 11:13:12 AM7/19/07
to tesseract-ocr
Thank very much for your contribution to open source, and specially to
OCR.

I'm new trying to use Tesseract and I'm a little bit confused. I have
already used OCR systems for Window, such as Readiris and Omnipage
Pro. I want to have a flexible OCR system under Ubuntu 7.04, and I
have found Tesseract, which seems to be pretty interesting. I have
three general questions:

a) How does Tesseract compare to Omnipage and Readiris? Is it better,
easier to configure, ...?

b) Is there any 'simple' guide for novices?

c) Would you (or anybody) please give an advice on how can I use 50
pages already scanned and stored in JPG? These pages have spanish and
english text mixed.

Again, thank you very much for your efforts,
kind regards,
Ricardo

On 19 jul, 00:12, theraysm...@gmail.com wrote:
> The code for Tesseract 2.00 is now checked in to subversion and the
> tarballs are on the main site.

> Seehttp://code.google.com/p/tesseract-ocr/downloads/list.

thera...@gmail.com

unread,
Jul 19, 2007, 11:54:38 AM7/19/07
to tesseract-ocr
Please see issue 43. http://code.google.com/p/tesseract-ocr/issues/detail?id=43
As a short-term solution (this version only) I will upload a set of
exes from VC++6. These should be more easily usable.
Ray.

thera...@gmail.com

unread,
Jul 19, 2007, 11:55:10 AM7/19/07
to tesseract-ocr
Added. Thanks for pointing that out.
Ray.

thera...@gmail.com

unread,
Jul 19, 2007, 12:16:14 PM7/19/07
to tesseract-ocr
a) In 1995, Tesseract was about as good as Omnipage, but since then, a
lot of improvements have been made to Omnipage, by myself and my ex-
colleagues at Caere/Scansoft/Nuance, while Tesseract sat on a shelf.
In 2000, Readiris was way behind on accuracy, but I have not tested it
recently and they may have caught up. IMHO, Omnipage and Finereader
are the best that you can buy today, and Tesseract is the best that
you can get for free.

b) Read all the Wikis at http://code.google.com/p/tesseract-ocr/w/list.
It is important to note that Tesseract currently has some important
features missing, like page segmentation and it also has no graphical
user interface (GUI).

c) You will have to convert from jpg to tif. Mixed languages may be a
problem. You will have to pick the most frequent, but bias your choice
in favour of Spanish, as that has more accents, and with English you
will lose them.

74yrsold

unread,
Jul 21, 2007, 3:56:57 AM7/21/07
to tesseract-ocr
Hi,
I had tried in window platform - same error message"the system
> cannot execute the specified program" displayed in "CommandPrompt".
Unless it is rectified, I have doubt whether training is able to run -
which I have not tested yet
Any solution is forthcoming?.

74yrsold

unread,
Jul 21, 2007, 6:58:10 AM7/21/07
to tesseract-ocr
Mr.Ray,
With reference to your assurance viz

"As a short-term solution (this version only) I will upload a
set of exes from
VC++6. These should be more easily usable."
Extremely thankful to you for the immediate implementation. I just
downloaded "tesseract-2.exe6" from the website:
http://code.google.com/p/tesseract-ocr/downloads/list.
- works fine without any error message by running tesseract.exe. Now I
shall test training exe files.
-with regards.
-sriranga(74yrsold)

On Jul 19, 8:54 pm, theraysm...@gmail.com wrote:
> Please see issue 43.http://code.google.com/p/tesseract-ocr/issues/detail?id=43

Ray Smith

unread,
Jul 21, 2007, 10:46:31 AM7/21/07
to tesser...@googlegroups.com
Try the new exe6 tarball that I put up yesterday...
Ray

On 7/21/07, 74yrsold <withbl...@gmail.com > wrote:

Hi,
I had tried in window platform - same error message"the system
> cannot execute the specified program" displayed in "CommandPrompt".
Unless it is rectified, I have doubt whether training is able to run -
which I have not tested yet
Any solution is forthcoming?.

On Jul 19, 6:59 am, "slch2...@gmail.com" <slch2...@gmail.com> wrote:
> Hi,
>
> I have download the latest Tesseract and tried it out. I run the
> Tesseract.exe in Window platform and it give me this error "the system
> cannot execute the specified program". Is it I gonna run the training
> first in order to run the tesseract and dlltest execution file?
>
> On Jul 19, 6:12 am, theraysm...@gmail.com wrote:
>
> > The code for Tesseract 2.00 is now checked in to subversion and the
> > tarballs are on the main site.
> > Seehttp://code.google.com/p/tesseract-ocr/downloads/list.
> > Note that this version recognizes 6 languages. To be completely
> > language independent, there is *no* language data with the source, so
> > you have to download a separate language file to get it to work at
> > all.
>

> > extensively updated with new release notes (http://code.google.com/p/
> > tesseract-ocr/wiki/ReleaseNotes), documentation on training (http://
> > code.google.com/p/tesseract-ocr/wiki/TrainingTesseract) and
> > documentation on testing ( http://code.google.com/p/tesseract-ocr/wiki/

withbl...@gmail.com

unread,
Jul 21, 2007, 12:53:47 PM7/21/07
to tesser...@googlegroups.com
Dear Mr.Ray,
To-day I had downloaded exe6 - and run tesseract phototest.tif  photo
output was fine without any error message. I had re-installed xp after
formating and but without  installing Visual Studio6 - But still works fine!!
I am now testing with training exe using eurotext.tiff. with help of wiki-training.
eurotest.box
u 185 601 206 624
, 197 496 205 507
, 206 496 214 508

corrected/saved  as
ü 185 601 206 624
,,197 496 214 508
Then I tried to re-run "tesseract eurotext.tiff  eurotext.txt"  It is noticed
there is no change in output in spite of corrected in box as stated above.
In wiki-training - I could not find next step to be taken after corrected/edited box as above. At present eurotext.box appears along with eurotext.tiff   How to test whether  corrected box file is in order.
With Regards,
-sriranga(74yrsold)






 I appreciate your quick response and implementation.

On 7/21/07, Ray Smith < thera...@gmail.com> wrote:
Try the new exe6 tarball that I put up yesterday...
Ray
On 7/21/07, 74yrsold < withbl...@gmail.com > wrote:

Hi,
I had tried in window platform - same error message"the system
> cannot execute the specified program" displayed in "CommandPrompt".
Unless it is rectified, I have doubt whether training is able to run -
which I have not tested yet
Any solution is forthcoming?.

On Jul 19, 6:59 am, "slch2...@gmail.com" < slch2...@gmail.com> wrote:
> Hi,
>
> I have download the latest Tesseract and tried it out. I run the
> Tesseract.exe in Window platform and it give me this error "the system
> cannot execute the specified program". Is it I gonna run the training
> first in order to run the tesseract and dlltest execution file?
>
> On Jul 19, 6:12 am, theraysm...@gmail.com wrote:
>
> > The code for Tesseract 2.00 is now checked in to subversion and the
> > tarballs are on the main site.
> > Seehttp://code.google.com/p/tesseract-ocr/downloads/list.
> > Note that this version recognizes 6 languages. To be completely
> > language independent, there is *no* language data with the source, so
> > you have to download a separate language file to get it to work at
> > all.
>

withbl...@gmail.com

unread,
Jul 21, 2007, 1:45:24 PM7/21/07
to tesser...@googlegroups.com
Ray,
sub:New Revision: 73TrainingTesseract  
      How to use the tools provided to train Tesseract for a new language
I
find little difficult to follow instructions in the absence of example.i.e.
input(example) and what output expected for each command line.
Is it possible to modify the wiki/training accordinly for benefit of newbies.
With Regards,
-sriranga(74yrsold)

Keith Beaumont

unread,
Jul 23, 2007, 8:12:54 AM7/23/07
to tesser...@googlegroups.com
Ray,
Prev msg from sriranga(74yrsold):
"Is it possible to modify the wiki/training accordinly for benefit of newbies."
 
Speaking as a newbie, yes PLEASE. I am completely lost when reading these notes. Sorry!!
 
Maybe we could have an Email conversation. I say "how do you ...?" you answer "blah blah"
And we keep going till I understand (may take a while!!!  I'm a bit simple!!).
Then you can publish new instructions.
I would be using window .exe's ONLY!!
 
By the way, should I be using the forum for this request?

Ray Smith

unread,
Jul 25, 2007, 5:11:34 PM7/25/07
to tesser...@googlegroups.com
Sriranga, Keith,

You guys are the testers for the documentation, so I am happy to help
you and update it with the deficiencies that I learn along the way.
You won't get much help for the next 2-3 weeks though as I will be
traveling and will not be checking my email much. The forum is a good
place to work through this though, as any one else having trouble can
see the answers.

For now, this may help:
When you have prepared a training.box file matching a training.tif
file, the command line is:
tesseract training.tif junk nobatch box.train
You should not expect any interesting output at all in junk.txt. The
output goes to training.tr and that SHOULD change if you change the
content of training.box.
Training.tr is then the input to mftraining and cntraining. I will add
a diagram to illustrate the data flows soon.

Regards,
Ray.

withbl...@gmail.com

unread,
Jul 26, 2007, 3:28:09 AM7/26/07
to tesser...@googlegroups.com
Ray,
Thanks for the valuable  guidance. In addition to proposed a diagram to illustrate the data flows, it would be nice/helpful to add example of 'Input' and 'what  output expected' for each commandline.

With Regards.
-sriranga(74yrsold)

unknowner

unread,
Jul 27, 2007, 5:12:37 AM7/27/07
to tesseract-ocr
Hello,

i tried everything but i thik i missed something. Can someone give a
step by step example how to train tess and what files i do need?
kind regards

withbl...@gmail.com

unread,
Jul 27, 2007, 7:15:49 AM7/27/07
to tesser...@googlegroups.com
Hi Ray,
Just know downloaded tesseractOCR. pdf - which is impressive and educative one, lucidly explained.especially baselines. Based on baselines,now I am feeling that  I can suceed Kannada(kan) Language (one of the Indian languages) which is complex and no one has developed OCR for Kannada/telugu only whereas other Indian langauges like hindi, tamil, marathi, Gujarati, bengali etc are generally available.
Appreciate your pdf.
With regards,
-sriranga(74yrsold)

On 7/26/07, Ray Smith <thera...@gmail.com > wrote:

alexandrino

unread,
Jul 30, 2007, 5:18:02 PM7/30/07
to tesseract-ocr
Ray,

The new application uploaded works fine but I would like that you
upload the project with the new source files too.
Thanks,

Alexandrino.

On Jul 25, 6:11 pm, "Ray Smith" <theraysm...@gmail.com> wrote:
> Sriranga, Keith,
>
> You guys are the testers for the documentation, so I am happy to help
> you and update it with the deficiencies that I learn along the way.
> You won't get much help for the next 2-3 weeks though as I will be
> traveling and will not be checking my email much. The forum is a good
> place to work through this though, as any one else having trouble can
> see the answers.
>
> For now, this may help:
> When you have prepared a training.box file matching a training.tif
> file, the command line is:
> tesseract training.tif junk nobatch box.train
> You should not expect any interesting output at all in junk.txt. The
> output goes to training.tr and that SHOULD change if you change the
> content of training.box.
> Training.tr is then the input to mftraining and cntraining. I will add
> a diagram to illustrate the data flows soon.
>
> Regards,
> Ray.
>

> On 7/23/07, Keith Beaumont <beaumon...@gmail.com> wrote:
>
> > Ray,
> > Prev msg from sriranga(74yrsold):
> > "Is it possible to modify the wiki/training accordinly for benefit of
> > newbies."
>
> > Speaking as a newbie, yes PLEASE. I am completely lost when reading these
> > notes. Sorry!!
>
> > Maybe we could have an Email conversation. I say "how do you ...?" you
> > answer "blah blah"
> > And we keep going till I understand (may take a while!!! I'm a bit
> > simple!!).
> > Then you can publish new instructions.
> > I would be using window .exe's ONLY!!
>
> > By the way, should I be using the forum for this request?
>

> > On 7/21/07, withblessi...@gmail.com <withblessi...@gmail.com> wrote:
>
> > > Ray,
> > > sub:New Revision: 73TrainingTesseract
> > > How to use the tools provided to train Tesseract for a new language
> > > I find little difficult to follow instructions in the absence of
> > > example.i.e.
> > > input(example) and what output expected for each command line.
> > > Is it possible to modify the wiki/training accordinly for benefit of
> > > newbies.
> > > With Regards,
> > > -sriranga(74yrsold)
>

> > > > On 7/21/07, Ray Smith < theraysm...@gmail.com> wrote:
>
> > > > > Try the new exe6 tarball that I put up yesterday...
> > > > > Ray
>

> > > > > > > > The wiki (http://code.google.com/p/tesseract-ocr/w/list) is

withbl...@gmail.com

unread,
Jul 31, 2007, 7:01:02 AM7/31/07
to tesser...@googlegroups.com
Alexandrino,
I could understand which "new application uploaded works fine"
Will you kindly elaborate/explain in detail, since I am curious to know?
Have suceeded in training and if so which language?
Regards,
-sriranga(74 yrs old)

74yrsold

unread,
Jul 31, 2007, 11:34:56 AM7/31/07
to tesseract-ocr
Alexadrino,
By mistake omitted "not' ,as such kindly read as " I could not
understand which
"new......fine"
-74yrs old

On Jul 31, 4:01 pm, withblessi...@gmail.com wrote:
> Alexandrino,
> I could understand which "new application uploaded works fine"
> Will you kindly elaborate/explain in detail, since I am curious to know?
> Have suceeded in training and if so which language?
> Regards,
> -sriranga(74 yrs old)
>

Keith Beaumont

unread,
Jul 31, 2007, 2:13:08 PM7/31/07
to tesser...@googlegroups.com
Where is this tesseractOCR. pdf?

unknowner

unread,
Jul 31, 2007, 2:28:40 PM7/31/07
to tesseract-ocr
PDF -> http://code.google.com/p/tesseract-ocr/downloads/list

On 31 Jul., 20:13, "Keith Beaumont" <beaumon...@gmail.com> wrote:
> Where is this tesseractOCR. pdf?
>

> On 7/27/07, withblessi...@gmail.com <withblessi...@gmail.com> wrote:
>
>
>
> > Hi Ray,
> > Just know downloaded tesseractOCR. pdf - which is impressive and educative
> > one, lucidly explained.especially baselines. Based on baselines,now I am
> > feeling that I can suceed Kannada(kan) Language (one of the Indian
> > languages) which is complex and no one has developed OCR for
> > Kannada/telugu only whereas other Indian langauges like hindi, tamil,
> > marathi, Gujarati, bengali etc are generally available.
> > Appreciate your pdf.
> > With regards,
> > -sriranga(74yrsold)
>

> > On 7/26/07, Ray Smith <theraysm...@gmail.com > wrote:
>
> > > Sriranga, Keith,
>
> > > You guys are the testers for the documentation, so I am happy to help
> > > you and update it with the deficiencies that I learn along the way.
> > > You won't get much help for the next 2-3 weeks though as I will be
> > > traveling and will not be checking my email much. The forum is a good
> > > place to work through this though, as any one else having trouble can
> > > see the answers.
>
> > > For now, this may help:
> > > When you have prepared a training.box file matching a training.tif
> > > file, the command line is:
> > > tesseract training.tif junk nobatch box.train
> > > You should not expect any interesting output at all in junk.txt. The
> > > output goes to training.tr and that SHOULD change if you change the
> > > content of training.box.

> > > Training.tr <http://training.tr/> is then the input to mftraining and


> > > cntraining. I will add
> > > a diagram to illustrate the data flows soon.
>
> > > Regards,
> > > Ray.
>

> > > On 7/23/07, Keith Beaumont < beaumon...@gmail.com> wrote:
> > > > Ray,
> > > > Prev msg from sriranga(74yrsold):
> > > > "Is it possible to modify the wiki/training accordinly for benefit of
> > > > newbies."
>
> > > > Speaking as a newbie, yes PLEASE. I am completely lost when reading
> > > these
> > > > notes. Sorry!!
>
> > > > Maybe we could have an Email conversation. I say "how do you ...?" you
> > > > answer "blah blah"
> > > > And we keep going till I understand (may take a while!!! I'm a bit
> > > > simple!!).
> > > > Then you can publish new instructions.
> > > > I would be using window .exe's ONLY!!
>
> > > > By the way, should I be using the forum for this request?
>

> > > > On 7/21/07, withblessi...@gmail.com <withblessi...@gmail.com> wrote:
>
> > > > > Ray,
> > > > > sub:New Revision: 73TrainingTesseract
> > > > > How to use the tools provided to train Tesseract for a new
> > > language
> > > > > I find little difficult to follow instructions in the absence of
> > > > > example.i.e.
> > > > > input(example) and what output expected for each command line.
> > > > > Is it possible to modify the wiki/training accordinly for benefit of
> > > > > newbies.
> > > > > With Regards,
> > > > > -sriranga(74yrsold)
>

> > > > > On 7/21/07, withblessi...@gmail.com <withblessi...@gmail.com >

> > > > > > On 7/21/07, Ray Smith < theraysm...@gmail.com> wrote:
>
> > > > > > > Try the new exe6 tarball that I put up yesterday...
> > > > > > > Ray
>

withbl...@gmail.com

unread,
Aug 1, 2007, 2:12:51 AM8/1/07
to tesser...@googlegroups.com
Hi Keith,
PDF can be download from http://code.google.com/p/tesseract-ocr/downloads/list
-74yrsold
Reply all
Reply to author
Forward
0 new messages