As well as a number of new languages, bugfixes, and man pages.
Windows binaries will follow shortly.
--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.
3.01 will start to hit the repository later today/tomorrow (I'm not
sure whether to a branch or not). It'll be quite unstable for a while
- Google have been working quite actively on Tesseract - so bear with
us while we pass changes back and forth. Having a set date helped in
getting this release out, so we'll aim for that again (though much
sooner this time).
I'll be writing another mail on this topic on the tesseract-dev list.
If you're the type who'll be following SVN, you might want to at least
lurk there.
So... to finally get some answers to some frequently asked questions:
Q. Are Google actively working on Tesseract?
A. Yes.
Q. Will Tesseract support Arabic?
A. Yes. It's not clear yet whether support will come in 3.01 or not,
but much of the mechanisms will be present.
Q. Can something be done about x-height?
A. 3.01 has completely rewritten, data driven, x-height estimation. I
don't know yet whether or not it's an improvement, but I assume so :)
Q. When will Tesseract 3.00 be released?
A. :)
I think the biggest issue was more that Arabic uses a connected script.
Tesseract release notes Sep 30 2010 - V3.00
* Preparations for thread safety:
* Changed TessBaseAPI methods to be non-static
* Created a class hierarchy for the directories to hold instance data,
and began moving code into the classes.
* Moved thresholding code to a separate class.
* Added major new page layout analysis module.
* Added HOCR output.
* Added Leptonica as main image I/O and handling. Currently optional,
but in future releases linking with Leptonica will be mandatory.
* Ambiguity table rewritten to allow definite replacements in place
of fix_quotes.
* Added TessdataManager to combine data files into a single file.
* Some dead code deleted.
* VC++6 no longer supported. It can't cope with the use of templates.
* Many more languages added.
* Doxygenation of most of the function header comments.
As well as a number of new languages, bugfixes, and man pages.
Windows binaries will follow shortly.
Thanks Zdenko!
Thanks for this.
The configure and compile steps seem to work without warnings or
errors. I'm having problems linking, though. I've posted the build
log[1]. I'm using Ubuntu Lucid (gcc 4.4.3).
Any idea what I am doing wrong?
Regards
Jeff
That is odd. Can you post your config_auto.h?
Thanks for the quick reply
Jeff
Ok - first question: are you deliberately building without Leptonica?
At least some of those error messages are mostly my fault - I made the
assumption that anyone who would be building without Leptonica would
also be using --graphics-disabled (that said, that doesn't explain all
of the errors).
If you are, I don't know if I'll be able to get a patch ready for you
before Monday.
If you're not, you'll need libleptonica-dev
> Thanks for the quick reply
np.
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.
I think the amount of people who want both source and binaries is
pretty low and we can safely assume they won't have a problem
downloading both files, so I don't see any need for it.
Zdenko,
Downloaded windows binaries and works fine in WinXP Congratulations!!!. It would have nice if you had
included relevant source code like tesseact.sln for VS2008C++ etc for windows platform also.
--
No. I was starting with my 2.04 build environment and was going to
work from there.
> If you're not, you'll need libleptonica-dev
OK. Added. Still can't link, though. I've posted the build log[1] and
config_auto.h[2].
Thanks for any help
Jeff
[1] http://pastebin.com/ajadLtED
[2] http://pastebin.com/YdzfXb71
Ok... is the debian/rules file the same as the one currently in Debian?
Yup.
Aha. Ok, my guess is that it's related to the change to use shared
libraries. I'll stick the debian stuff in a branch on github
(http://github.com/jimregan/tesseract-ocr/tree/debian-3.00) - there
are a few other differences that I can tackle a little quicker (having
made them :), and that's where 3.01 is currently living, so it will
hopefully make that switch a little easier too.
The ./debian stuff you've got there isn't from sid or squeeze - it
looks as though you've got it from lenny. I started with the ./debian
stuff from sid (squeeze is identical in this case)
Regards
Jeff
I'm attaching my current ./debian as diff.gz against the base
tesseract-3.00 package.
Regards
Jeff
Doh!
Even still, I think the shared library stuff is the problem.
The traineddata files are gzipped: did you uncompress them? (gzip -d)
> On Tue, Oct 5, 2010 at 12:36 AM, Malky <malky.ua@gmail.com> wrote:
>>
>> I've compiled tesseract (and it works) but I don't know how to use the
>> language files from here:
>> https://code.google.com/p/tesseract-ocr/downloads/list
>>
>> I've unpacked language files into /usr/local/share/tessdata/ but I get
>> the error message "Error openning data file /usr/local/share/tessdata/
>> english.traineddata" (or any other language) if I use the -l option
>> even for english. I've tried different language files and the message
>> was the same (of course, different names). If I do not choose the -l
>> option it works (as Engish). So how can I choose the languages indeed?
>>
>
> It looks like problem with paths. Can you please post result of 'which
> tesseract'?
The traineddata files are gzipped: did you uncompress them? (gzip -d)
Well, I won't argue that the documentation needs improvement, but in
this case, I think the documentation *does* cover it. Heck, I went the
extra mile of putting names to each of the ISO codes in the manpage.
(Oh, BTW, 'osd' stands for 'orientation and script detection' -- it's
a pseudo language pack that contains a classifier for distinctive
features of each supported script, which can be used to 1) detect the
orientation of the page, 2) detect the script (which reduces the
language identification problem). The code to use it is available in
3.01.)
Indeed.
I have another, related question. Debian requires each shared library
to have its own package. The only exception is when the SONAME will
only ever change simultaneously for all libraries. Can you confirm
this?
Otherwise I will have to separate:
libtesseract-api3 libtesseract-ccstruct3 libtesseract-ccutil3
libtesseract-classify3 libtesseract-cutil3 libtesseract-dict3
libtesseract-image3 libtesseract-main3 libtesseract-textord3
libtesseract-training3 libtesseract-viewer3 libtesseract-wordrec3
i.e. 12 library packages (and another 12 -dev packages).
Regards
Jeff
libtesseract-api3
libtesseract-ccstruct3
libtesseract-ccutil3
libtesseract-classify3
libtesseract-cutil3
libtesseract-dict3
libtesseract-image3
libtesseract-main3
libtesseract-textord3
libtesseract-training3
libtesseract-viewer3
libtesseract-wordrec3
The exception to this rule is the case that the *sonames* for all
shared libraries in the package will only ever change simultaneously.
Can you confirm this?
Regards
Jeff
That doesn't sound right. I'll check into it at the mentor summit.
From [1] (or the libpkg-guide package on a Debian-based distro):
There are packages like libc6, which contain multiple shared libraries
in one package. This is not encouraged. [2] It becomes more complex
and more difficult to handle complex upgrade patterns.
Regards
Jeff
[1] http://www.netfort.gr.jp/~dancer/column/libpkg-guide/libpkg-guide.html#naminglibpkg
[2] This is the case unless it is confident that shared libraries will
not change interfaces independently, or compatibility issues are
carefully handled. In general, when shared libraries are split, there
is no reason upstream will keep changes to interfaces synchronised.
I discussed this with Ray and he prefers having a single library, so
I'm going to make the multiple libraries a non-default option. This'll
surface in Tesseract 3.01, but I can certainly provide you with a
patch for 3.00 when I get a chance to write it up.
That would be excellent.
Thanks
Jeff
Have you been able to get anywhere with this?
Now squeeze is out, it would be good to get v3 into sid.
Regards
Jeff
On 20 October 2010 23:15, Jimmy O'Regan <jor...@gmail.com> wrote:I discussed this with Ray and he prefers having a single library, so
> On 21 October 2010 06:29, Jeffrey Ratcliffe <jeffrey....@gmail.com> wrote:
>> Debian requires that each shared library have its own package. At the
>> moment, that would require the following extra packages:
>
> That doesn't sound right. I'll check into it at the mentor summit.
I'm going to make the multiple libraries a non-default option. This'll
surface in Tesseract 3.01, but I can certainly provide you with a
patch for 3.00 when I get a chance to write it up.