3.01 code

67 views
Skip to first unread message

Jimmy O'Regan

unread,
Oct 1, 2010, 5:26:21 PM10/1/10
to tesser...@googlegroups.com
I've put the code of 3.01 on GitHub -
http://github.com/jimregan/tesseract-ocr. I'd intended to push the
merge into SVN today, but stupidly used the http address when building
the git repository instead of the https address, so I can't push back
directly without rewriting the references. That might be for the best
though, as I think it might be worth leaving 3.00 as is for a week or
two, before pushing out the update.

There might still be a few glitches in the build system.

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

zdenko podobny

unread,
Oct 17, 2010, 8:20:11 AM10/17/10
to tesser...@googlegroups.com
On Fri, Oct 1, 2010 at 11:26 PM, Jimmy O'Regan <jor...@gmail.com> wrote:
I've put the code of 3.01 on GitHub -
http://github.com/jimregan/tesseract-ocr. I'd intended to push the
merge into SVN today, but stupidly used the http address when building
the git repository instead of the https address, so I can't push back
directly without rewriting the references. That might be for the best
though, as I think it might be worth leaving 3.00 as is for a week or
two, before pushing out the update.

There might still be a few glitches in the build system.


I try to compile code on windows with VC++2008. I solved several issues but I stopped on this:
ocrclass.h(30) : fatal error C1083: Cannot open include file: 'sys/time.h': No such file or directory
When I comment that include I got other problem.

..\ccutil\ocrclass.h(333) : error C3861: 'gettimeofday': identifier not found

I found on internet gettimeofday() is a Linux method not supported in VC++2008.
Is it possible to replace it with something else supported by VC++2008? Or is there other solution?

BR,

Zd.

Jimmy O'Regan

unread,
Oct 17, 2010, 8:53:00 AM10/17/10
to tesser...@googlegroups.com

gettimeofday is a simple enough function - if there isn't an existing
Windows implementation around, it'll be easy enough to implement.
Makes me wonder /why/ anything in Tesseract would need to know the
time though - it's probably something that gives a slightly nicer idea
of how long processing took, but that isn't strictly necessary.

Ray Smith

unread,
Oct 18, 2010, 12:49:37 AM10/18/10
to tesser...@googlegroups.com
Oops. That was changed to make the timeout in real time instead of CPU time. (It used to use clock()).
Maybe fix it with an ifdef to use clock with windows if there is no equivalent to gettimeofday.
Ray.

AndrewC

unread,
Nov 9, 2010, 10:05:37 AM11/9/10
to tesseract-dev
FYI : I downloaded the 3.01 code from the Git repository and have
managed
to update the VS2008 project to get most of the projects compiling
using VS2008.
successfully. The only two that do not compile are TESSDLL and
DLLTEST
due to the recent changes I suspect. I will look into this next.
TESSERACT.EXE runs correctly and produces valid OCR output.

The only source code changes were :-

ccutils/strngs.cpp - moved #include "assert.h" 5 lines down below
other header files.
ccutils/ocrclass.h - gettimeofday() and header file changes. // I
added a gettimeofday() function from one I found at microsoft.com.
wordrec\lang_model.cpp : Line 1721 changed ---> priority /=
sqrt((float)parent_vse->length); // Added (float) cast.
vs2008\tessdll.cpp - removed #include "applybox.h" as file does not
exist.
vs2008\tessdll.cpp - removed #include "varabled.h" as file does not
exist.

Most changes were in the project settings, adding paths and file
references,
deleting redundant files and other related issues. I am happy to pass
the
project files if this helps anyone.

Andrew.

On Oct 18, 3:49 pm, Ray Smith <theraysm...@gmail.com> wrote:
> Oops. That was changed to make the timeout in real time instead of CPU time.
> (It used to use clock()).
> Maybe fix it with an ifdef to use clock with windows if there is no
> equivalent to gettimeofday.
> Ray.
>
>
>
>
>
>
>
> On Sun, Oct 17, 2010 at 5:53 AM, Jimmy O'Regan <jore...@gmail.com> wrote:
> > On 17 October 2010 13:20, zdenko podobny <zde...@gmail.com> wrote:
>
> > > On Fri, Oct 1, 2010 at 11:26 PM, Jimmy O'Regan <jore...@gmail.com>

zdenko podobny

unread,
Nov 9, 2010, 1:20:08 PM11/9/10
to tesser...@googlegroups.com
You could save a lot of time if you:
a) read whole thread you reply to
b) ask if somebody did not do it before you (you could recognize it if you do a) :-)

Some changes from 3.00 svn are not 3.01 (including changes relevant to windows build) so it does not make sense to publish adapted project files for 3.01. Anyway most important problem is usage of linux system function gettimeofday()...

BR,

Zd.

Jimmy O'Regan

unread,
Nov 9, 2010, 2:45:44 PM11/9/10
to tesser...@googlegroups.com
On 9 November 2010 18:20, zdenko podobny <zde...@gmail.com> wrote:
> You could save a lot of time if you:
> a) read whole thread you reply to
> b) ask if somebody did not do it before you (you could recognize it if you
> do a) :-)
>
> Some changes from 3.00 svn are not 3.01 (including changes relevant to
> windows build) so it does not make sense to publish adapted project files
> for 3.01. Anyway most important problem is usage of linux system
> function gettimeofday()...

http://doxygen.postgresql.org/gettimeofday_8c-source.html

zdenko podobny

unread,
Nov 28, 2010, 7:21:12 AM11/28/10
to tesser...@googlegroups.com, tesser...@googlegroups.com
Just notice - if somebody did not recognize it yet:

in svn (http://code.google.com/p/tesseract-ocr/source/checkout revision 527) there is 3.01 code that was build successfully on linux (Mandrivalinux cooker 64bit) and Windows (XP SP3, VC++2008 Express). There is info about additional 3.01 code coming from Ray in (near) future.

So please try it on other platforms/systems and report problem/submit patches in http://code.google.com/p/tesseract-ocr/issues/list in Issue).

If you are willing to create C wrapper (see http://code.google.com/p/tesseract-ocr/issues/detail?id=386http://code.google.com/p/tesseract-ocr/issues/detail?id=362http://groups.google.com/group/tesseract-dev/browse_thread/thread/a348e5a6dbade5d7) this could be good time for first version ;-) so it can become part of 3.01 final code.

Zd.

Ray Smith

unread,
Nov 29, 2010, 8:37:53 PM11/29/10
to tesser...@googlegroups.com, tesser...@googlegroups.com
My merge with the latest Google code is now complete and committed.

The svn autotools are currently horribly broken for me. (Using make dist and then trying to build from the tar.gz distribution) I had to make the following patches in order for it to build, but when it did, it worked:

* make whines about missing .Plo files and missing .Po files in .libs/* I had to copy them from my earlier version of 3.01, in which they were all created by make. I suspect this is a problem with the gettext system. I built my makefiles with no options to runautoconf and configure on linux Lucid.

* libtool does not exist in the default distribution. I copied that from my earlier version of 3.01.

This version will not compile with any known version of leptonica! Only 1.67 and above are compatible at the source level, but the distribution of 1.67 builds a .so.0 which tesseract fails to find, even after removing the apt-get version of 1.64. leptonica 1.68 will be out soon to fix this problem, but in the mean time, I am uploading a .deb package of liblept 1.67 that outputs .so.1. To fix this temporarily there is a couple of debian packages in a debian directory that can be used to build on 64 bit linux systems. I may fix this better tomorrow by removing the dependency on the function that needs 1.67. This probably also breaks the Windows build.

Ray.

zdenko podobny

unread,
Nov 30, 2010, 7:52:57 AM11/30/10
to tesser...@googlegroups.com, tesser...@googlegroups.com
Windows build should be fixed in r543. I did not recognize any problem regarding leptonica (but I just run few OCR on my test images) ;-)

Zd.

zdenko podobny

unread,
Dec 1, 2010, 5:17:42 PM12/1/10
to tesser...@googlegroups.com
Ray,

will you post also information regarding (changed/new) training process? 
For the moment I found out:
- there are new files (eng.*cube*, ara.*cube*) - how to create them, how to used them during OCR?
- 'tesseract eurotext.tif eurotext nobatch box.train' create eurotext.tr and eurotext.txt. What is in eurotext.txt?
- 'unicharset_extractor eurotext.box' creates  unicharset with new format (can it be explained?) but type of script is always NULL (eng.unicharset from eng.traineddata uses 'Latin', 'Common' there)? Is it bug or is there reason for it?
- 'mftraining' does not work for me as in past - can you provide example how to use it?
- there is (still) missing info what should be in punc-dawg, number-dawg...
- when I quickly analyzed content of eng.traineddata it seems to me that format of some files (e.g. eng.unicharset) have changed. Is there way how to recognize supported version of tesseract from language data? Or version of language data?

Zd.


On Tue, Nov 30, 2010 at 2:37 AM, Ray Smith <thera...@gmail.com> wrote:

zdenko podobny

unread,
Dec 4, 2010, 9:20:10 AM12/4/10
to John Coppens, tesser...@googlegroups.com
for  1) - can you please create issue (http://code.google.com/p/tesseract-ocr/issues/list) with example image file + config file you used?

Zd.

On Sat, Dec 4, 2010 at 2:06 PM, John Coppens <john.c...@gmail.com> wrote:
On Sat, 04 Dec 2010 09:09:33 +0100

I am using the last SVN, version 3.01, on a Linux 2.6.29 kernel, AMD-64
processor. Here are the issues I commented, first the most important
one:

1) When OCRing in mode 3 (with block recognition) I get rubbish in the
text at seemingly random places, but mostly when switching columns.
Lines appear like:

<><><><><><><>
or
-,9-,9-,9-,9-,9
or
ííííííííííííííí

The original is perfectly clean, as it was generated from a PDF, at 300
dpi (also tried at 450 dpi).

These problems do not appear when using Pageseg_mode 6 (no block
detection). There everything is clean.

2) It seems that the  -psm <n>  command line option doesn't work. I
could only get different modes to work using a configuration file.

3) Though I specified -l spa (and have the data installed), I still get
recognition of certain, non-spanish, characters, such as the cent (c +
|) sign. I believe this has been reported before.



Greetings,
John

Ray Smith

unread,
Dec 8, 2010, 8:52:28 PM12/8/10
to tesser...@googlegroups.com, John Coppens
This problem is now resolved with revision 548.
Reply all
Reply to author
Forward
0 new messages