Version 3.02 in alpha

258 views
Skip to first unread message

Ray Smith

unread,
Feb 2, 2012, 1:55:57 PM2/2/12
to tesser...@googlegroups.com, tesser...@googlegroups.com
Tesseract 3.02 is now available in svn for preliminary testing, currently Linux-only.

There are now 65 languages and some big improvements in layout analysis and character accuracy.
This version will with luck make it into Ubunto LTS Precise Pangolin, so please test to see if your favorite issue is resolved.

Thanks and enjoy!

Ray.

zdenko podobny

unread,
Feb 3, 2012, 5:32:28 PM2/3/12
to tesser...@googlegroups.com, tesser...@googlegroups.com
I just uploaded some fixes to VC2008 build - target was to compile and run tesseract.exe ("tesseract.exe eurotext.tif eurotext" produced output :-) )

Please test it. Feel free to improve it.

I still continue to support the current "vs2008 structure".  When Tom will finalize his contribution[1] I will adapt it to 3.02 version and use it for next tesseract release.

Zdenko

zdenko podobny

unread,
Feb 5, 2012, 12:00:38 PM2/5/12
to tesser...@googlegroups.com, tesser...@googlegroups.com
Ray,

I got 'Empty page!!' message from tesseract 3.02 for attached image (created as 'convert -rotate 10 phototest.tif phototest-r.png'). Tesseract 3.01 was able to handle it [1]...

Zdenko


On Thu, Feb 2, 2012 at 7:55 PM, Ray Smith <thera...@gmail.com> wrote:
phototest-r.png

asmwarrior

unread,
Feb 8, 2012, 1:27:29 AM2/8/12
to tesser...@googlegroups.com, tesser...@googlegroups.com
I'm a guy from mingw/msys world, but currently I have now success in building tesseract under MSYS, see:
https://groups.google.com/d/topic/tesseract-ocr/7MwfC1JdXyA/discussion
I'm asking that some developers can fix this in the next release, Thanks.

Asmwarrior
ollydbg from Codeblocks' forum

zdenko podobny

unread,
Apr 26, 2012, 4:59:35 PM4/26/12
to tesser...@googlegroups.com, tesser...@googlegroups.com
On Thu, Feb 2, 2012 at 7:55 PM, Ray Smith <thera...@gmail.com> wrote:

Ray,

can you please clarify status of tesseract-3.02 release?

Ubuntu 12.4 LTS Precise Pangolin was released today [1] and it provide tesseract-3.02 package. I analyzed it quickly ([2]) and it looks like 3.02 = r675 (at the moment current revision is 724)


--
Zdenko

troplin

unread,
Apr 27, 2012, 6:07:01 AM4/27/12
to tesser...@googlegroups.com, tesser...@googlegroups.com
Hello Zdenko,

I know that on Linux/Unix, it is usual to name a shared library libtesseract302.so, and have a symlink libtesseract.so -> libtesseract302.so.
On Windows, you don't have the possibility of symlinks, so usually you don't code the version into the name of the DLL. At least not minor versions.
Instead you embed the version number into the resources, and if a client wants to restrict itself to a specific version of a DLL, he can do that with an application manifest. That's the "Windows way".
The "lib" prefix is also a bit strange on Windows.
Personally I would prefer the name tesseract3.dll.

Am Freitag, 3. Februar 2012 23:32:28 UTC+1 schrieb Zdenko Podobný:
I just uploaded some fixes to VC2008 build - target was to compile and run tesseract.exe ("tesseract.exe eurotext.tif eurotext" produced output :-) )

Please test it. Feel free to improve it.

I still continue to support the current "vs2008 structure".  When Tom will finalize his contribution[1] I will adapt it to 3.02 version and use it for next tesseract release.

Zdenko

Tom Powers

unread,
Apr 27, 2012, 6:22:27 AM4/27/12
to tesser...@googlegroups.com
On Fri, Apr 27, 2012 at 3:07 AM, troplin <tro...@gmail.com> wrote:
> On Windows, you don't have the possibility of symlinks, so usually you don't
> code the version into the name of the DLL. At least not minor versions.
> Instead you embed the version number into the resources, and if a client
> wants to restrict itself to a specific version of a DLL, he can do that with
> an application manifest. That's the "Windows way".
> The "lib" prefix is also a bit strange on Windows.
> Personally I would prefer the name tesseract3.dll.

See [1] for my rationale behind the file names, including a discussion
of the use of property sheets to simplify use of version numbers. BTW
Windows 7 (and probably Vista) can create honest-to-god symlinks with
mklink [2], I use them all the time now.

[1] http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/building.html

[2] http://technet.microsoft.com/en-us/library/cc753194(v=ws.10).aspx

Tom Powers

unread,
Apr 27, 2012, 6:32:03 AM4/27/12
to tesser...@googlegroups.com
On Thu, Apr 26, 2012 at 1:59 PM, zdenko podobny <zde...@gmail.com> wrote:
> Ubuntu 12.4 LTS Precise Pangolin was released today [1] and it provide
> tesseract-3.02 package. I analyzed it quickly ([2]) and it looks like 3.02
> = r675 (at the moment current revision is 724)
>
> [1] https://lwn.net/Articles/494683/
> [2] http://ch.archive.ubuntu.com/ubuntu/pool/universe/t/tesseract/tesseract_3.02.01-2.diff.gz

It's a bit disappointing that it's so out of date. But then again, I
forgot that there was a deadline in the absence of any
reminders/warnings that it was coming up. O well, this doesn't
necessarily affect an "Official" Windows release since that was always
going to have to be separate. It would be nice to be in synch with the
linux releases but I guess it's not a requirement.

-- Tom

troplin

unread,
Apr 27, 2012, 11:13:13 AM4/27/12
to tesser...@googlegroups.com
Having the hard links is essentially like distributing two separate files. Nobody really cares about the disk space that is saved by using hard links.
I'm fine with that, but I don't really see any use behind it. It just complicates the build system.

Do you also provide these hard links in the windows installer?



Am Freitag, 27. April 2012 12:22:27 UTC+2 schrieb Tom Powers:

zdenko podobny

unread,
Apr 27, 2012, 12:28:49 PM4/27/12
to tesser...@googlegroups.com
Well, it will not be a big issue, if there will be clear statement what to do. (e.g. release r675 as 3.02 and lets plan bugfix release (3.03 or 3.02.1?) at the end next 1-2 months...
And of course it is important to know who will do release ;-).

--
Zdenko

Zdenko Podobný

unread,
Apr 27, 2012, 3:53:23 PM4/27/12
to tesser...@googlegroups.com
Just remark: My message (3. Februar 2012) is outdated at the moment. The former VC++ solution was replaced with new (IMHO much better) solution of Tom Powers (revision 681). All improvements/suggestions/comments should be against current svn.

Regarding name - Tom suggested naming for VC++ build. This is not issue for me, even I think it would be good to have the same in case of multiform libraries... I remember case I need to create python script that should work on linux and windows and I need to use some library via ctype. I needed to detect OS, just because of different names based on OS ;-)
BTW: When I tested mingw+msys build (gcc) on Windows, it named library as "libtesseract-3.dll" automatically :-) So at the moment we have:
  • libtesseract.so.3.0.2 in case of linux (+ symlinks)
  • libtesseract-3.dll in case of mingw on Windows
  • libtesseract302.dll in case of VC++ on Windows

Regarding version in name: Well based on my tests I would suggest not to use the same library name for different versions.  I thought you have the same opinion[1]...
But I miss more version info in language data file (you can not use 3.02 files in 3.01, but you can use 3.01 in 3.02. So you can  pray, that TESSDATA_PREFIX do not point to newer version...)

[1] http://code.google.com/p/tesseract-ocr/issues/detail?id=362#c84

Zdenko


Dňa 27.04.2012 12:07, troplin  wrote / napísal(a):
Hello Zdenko,

I know that on Linux/Unix, it is usual to name a shared library 
libtesseract302.so, and have a symlink libtesseract.so -> 
libtesseract302.so.
On Windows, you don't have the possibility of symlinks, so usually you 
don't code the version into the name of the DLL. At least not minor 
versions.
Instead you embed the version number into the resources, and if a client 
wants to restrict itself to a specific version of a DLL, he can do that 
with an application manifest. That's the "Windows way".
The "lib" prefix is also a bit strange on Windows.
Personally I would prefer the name tesseract3.dll.

Am Freitag, 3. Februar 2012 23:32:28 UTC+1 schrieb Zdenko Podobný:
I just uploaded some fixes to VC2008 build - target was to compile and run 
tesseract.exe ("tesseract.exe eurotext.tif eurotext" produced output :-) )

Please test it. Feel free to improve it.

I still continue to support the current "vs2008 structure".  When Tom will 
finalize his contribution[1] I will adapt it to 3.02 version and use it for 
next tesseract release.

Zdenko

[1] 
https://groups.google.com/group/tesseract-dev/browse_thread/thread/75be5c97eb4d1b3c

On Thu, Feb 2, 2012 at 7:55 PM, Ray Smith <therays....@gmail.com<theray...@gmail.com>

Tom Powers

unread,
Apr 27, 2012, 4:10:26 PM4/27/12
to tesser...@googlegroups.com
On Fri, Apr 27, 2012 at 8:13 AM, troplin <tro...@gmail.com> wrote:
> Having the hard links is essentially like distributing two separate files.
> Nobody really cares about the disk space that is saved by using hard links.
>>
>> On Fri, Apr 27, 2012 at 3:07 AM, troplin <trop...@gmail.com> wrote:
>> > On Windows, you don't have the possibility of symlinks, so usually you
>> > don't
>> > code the version into the name of the DLL. At least not minor versions.
>> > Instead you embed the version number into the resources, and if a client
>> > wants to restrict itself to a specific version of a DLL, he can do that
>> > with
>> > an application manifest. That's the "Windows way".
>> > The "lib" prefix is also a bit strange on Windows.
>> > Personally I would prefer the name tesseract3.dll.
>>
>> See [1] for my rationale behind the file names, including a discussion
>> of the use of property sheets to simplify use of version numbers. BTW
>> Windows 7 (and probably Vista) can create honest-to-god symlinks with
>> mklink [2], I use them all the time now.
>>
>> [1] http://tesseract-ocr.googlecode.com/svn/trunk/vs2008/doc/building.html
>>
>> [2] http://technet.microsoft.com/en-us/library/cc753194(v=ws.10).aspx

As I said before, Windows 7 & Vista symlinks are real symlinks *not*
"hardlinks".

And just as a reminder, according to [1], Windows XP End of Extended
Support is less than 2 years away now (April 8, 2014). Stories like
[2] are starting to show that Windows 7 may soon be more popular than
XP (finally).

[1] http://windows.microsoft.com/en-us/windows/products/lifecycle

[2] http://www.neowin.net/news/windows-xp-market-share-goes-down-in-february-2012

          -- Tom

troplin

unread,
Apr 30, 2012, 4:01:51 AM4/30/12
to tesser...@googlegroups.com, zde...@gmail.com
Am Freitag, 27. April 2012 21:53:23 UTC+2 schrieb Zdenko Podobný:
Just remark: My message (3. Februar 2012) is outdated at the moment. The former VC++ solution was replaced with new (IMHO much better) solution of Tom Powers (revision 681). All improvements/suggestions/comments should be against current svn.

Regarding name - Tom suggested naming for VC++ build. This is not issue for me, even I think it would be good to have the same in case of multiform libraries... I remember case I need to create python script that should work on linux and windows and I need to use some library via ctype. I needed to detect OS, just because of different names based on OS ;-)
BTW: When I tested mingw+msys build (gcc) on Windows, it named library as "libtesseract-3.dll" automatically :-) So at the moment we have:
  • libtesseract.so.3.0.2 in case of linux (+ symlinks)
  • libtesseract-3.dll in case of mingw on Windows
  • libtesseract302.dll in case of VC++ on Windows

Regarding version in name: Well based on my tests I would suggest not to use the same library name for different versions.  I thought you have the same opinion[1]...
But I miss more version info in language data file (you can not use 3.02 files in 3.01, but you can use 3.01 in 3.02. So you can  pray, that TESSDATA_PREFIX do not point to newer version...)

[1] http://code.google.com/p/tesseract-ocr/issues/detail?id=362#c84

Actually, it is the other way round. I would suggest a more consistent versioning scheme:
X.Y.Z where
- Changes in Z mean: no API/ABI changes, only internal changes.
- Changes in Y mean: existing API/ABI does not break, not backwards-incompatible changes, but extensions (e.g. new functions) are allowed.
- Changes in Z mean: potential API/ABI breakage.
But maybe the project is in a too early stage for this. It needs quite a bit of planning and discipline.

In the plugin that I am developing I use dynamic loading via ldload()/LoadLibrary(), so I can handle multiple versions of the same DLL.
If there are no backwards-incompatible changes, new versions are automatically supported. However if the signature of some function changes, there will be problems and I have to change the plugin.

With the versioning scheme from above and a DLL name like libtesseract3.dll, I would be on the safe side. If the name changes with every version, I have to update the plugin with every update of tesseract, even if there are no relevant changes.

troplin

unread,
Apr 30, 2012, 4:06:02 AM4/30/12
to tesser...@googlegroups.com


Am Freitag, 27. April 2012 22:10:26 UTC+2 schrieb Tom Powers:
True, but our company theoretically still supports windows 2000 ;-)
In my expirence, Windows XP and Windows Server 2003 are still in heavy use.

troplin

unread,
May 4, 2012, 3:52:05 AM5/4/12
to tesser...@googlegroups.com
Oh and in the Resources of the DLL the version is specified 3.2.0.0. Wouldn't 3.0.2.0 be more correct?
And maybe it would also be good to encode the SVN revision into the last part of the version number, like 3.0.2.765.
Sorry if I'm pedantic now, but I just want to make sure that it is possible to install the DLLs properly (without version conflicts) as "Side-by-Side Assemblies" (means in WinSxS)


Am Freitag, 27. April 2012 12:22:27 UTC+2 schrieb Tom Powers:
Reply all
Reply to author
Forward
0 new messages