Re: [tesseract-dev] Re: Plans for 3.04 release

583 views
Skip to first unread message

Jeff Breidenbach

unread,
Jul 14, 2015, 7:23:39 PM7/14/15
to tesser...@googlegroups.com
Okay, things are looking better and better with the Debian packaging
as I iterate. I have a few questions on some advanced aspects.

 Let's start with opencl. First, configure doesn't seem to know that 
--enable-opencl requires libtiff and opencl development libraries. Without 
them compile fails. No problem for me, just mentioning it. More importantly, 
what's the deal with OpenCL? Should I enable it or not? Are there any 
negative consequences?

Next, I am investigating the Noto font situation. Still in progress, but suspect
that Tesseract is using non-canonical fontnames. It is going to take a ton
of manual effort to figure this out completely. Stay tuned, no action required
now.

Third, what's up with doxygen? The tarball on github doesn't have doxygen
so I am going to build and ship it inside the package. i can't find any
doxygen documentation for Tesseract directly viewable on the web.

Finally, Shree was asking about other platforms. I've been asked to keep
a somewhat helpful eye on Android by a vision impaired user. Is there 
anyone in this group that ports Tesseract to Android? If so, are things
in good shape?

Oh, one more thing. I used to have edit rights to the old wiki, where I 
tweaked documentation especially for PDF output. is that something
that I can get again? What do i need to do?

Thanks everyone.

Jeff

Jim O'Regan

unread,
Jul 14, 2015, 8:08:53 PM7/14/15
to tesser...@googlegroups.com
On 15 July 2015 at 00:23, Jeff Breidenbach <breid...@gmail.com> wrote:
> Third, what's up with doxygen? The tarball on github doesn't have doxygen
> so I am going to build and ship it inside the package. i can't find any
> doxygen documentation for Tesseract directly viewable on the web.
>

It's here (for now): http://tesseract-ocr.github.io/

> Oh, one more thing. I used to have edit rights to the old wiki, where I
> tweaked documentation especially for PDF output. is that something
> that I can get again? What do i need to do?

The wiki is here now: https://github.com/tesseract-ocr/tesseract/wiki
as long as you're logged into Github, you should be able to write to
it (wikis are publicly writable by default).

If you'd prefer not to use Github directly, you can clone the wiki's repository:
git clone https://github.com/tesseract-ocr/tesseract.wiki.git
and send the change as a patch.

--
<Sefam> Are any of the mentors around?
<jimregan> yes, they're the ones trolling you

ShreeDevi Kumar

unread,
Jul 14, 2015, 11:22:28 PM7/14/15
to tesser...@googlegroups.com

The wiki links to old downloads page for traineddata

>>>> Other Languages

Tesseract has been trained for many languages, check for your language on the Downloads page. 

Zdenko, please tag the traineddata similar to how you tagged tesseract to create the tarball (similar to 3.02). It will be specially helpful for languages that need multiple files eg. Eng, ara, hin etc.

Thanks!

- sent from my phone. excuse the brevity.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-dev/CAHh9-xuYjbF5Ezj0%3DJVXgxLi1XwiHZNrrf-WRJPZQrjXWbHCwQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ray Smith

unread,
Jul 15, 2015, 1:12:34 AM7/15/15
to tesser...@googlegroups.com
On Tue, Jul 14, 2015 at 4:23 PM, Jeff Breidenbach <breid...@gmail.com> wrote:
Okay, things are looking better and better with the Debian packaging
as I iterate. I have a few questions on some advanced aspects.

 Let's start with opencl. First, configure doesn't seem to know that 
--enable-opencl requires libtiff and opencl development libraries. Without 
them compile fails. No problem for me, just mentioning it. More importantly, 
what's the deal with OpenCL? Should I enable it or not? Are there any 
negative consequences?
The plus side of Open CL is a reasonable speed-up, but there is a down-side in a slight loss of accuracy due to having to process is a slightly different way. 

Next, I am investigating the Noto font situation. Still in progress, but suspect
that Tesseract is using non-canonical fontnames. It is going to take a ton
of manual effort to figure this out completely. Stay tuned, no action required
now.

Third, what's up with doxygen? The tarball on github doesn't have doxygen
so I am going to build and ship it inside the package. i can't find any
doxygen documentation for Tesseract directly viewable on the web.

Finally, Shree was asking about other platforms. I've been asked to keep
a somewhat helpful eye on Android by a vision impaired user. Is there 
anyone in this group that ports Tesseract to Android? If so, are things
in good shape?
There have been several ports to Android, and 3.04 now contains sufficient ifdefs to work I believe.
I don't think the person who did it is around any longer to maintain it though. 

Oh, one more thing. I used to have edit rights to the old wiki, where I 
tweaked documentation especially for PDF output. is that something
that I can get again? What do i need to do?

Thanks everyone.

Jeff

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Jeff Breidenbach

unread,
Jul 15, 2015, 3:00:24 PM7/15/15
to tesser...@googlegroups.com
Unfortunately, while 3.04.00 OpenCL compiles fine I am hitting problems at runtime.

#  tesseract testing/phototest.tif -
[DS] Profile file not available (tesseract_opencl_profile_devices.dat); performing profiling.
Segmentation fault (core dumped)




Jeff Breidenbach

unread,
Jul 15, 2015, 3:03:29 PM7/15/15
to tesser...@googlegroups.com
(Configure flags for reference)

./configure --host=x86_64-linux-gnu --build=x86_64-linux-gnu --prefix=/usr --mandir=\${prefix}/share/man --infodir=\${prefix}/share/info CFLAGS="-g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wall -g -fPIC -DTESSDATA_PREFIX=/usr/share/tesseract-ocr/" CXXFLAGS="-g -O2 -fstack-protector-strong -Wformat -Wall -g -fPIC -DTESSDATA_PREFIX=/usr/share/tesseract-ocr/" LDFLAGS="-Wl,-z,defs -Wl,-z,relro,-lOpenCL,-ltiff" --enable-opencl

Jeff Breidenbach

unread,
Jul 17, 2015, 4:52:12 PM7/17/15
to tesser...@googlegroups.com
Okay, we are starting to hit problems.

PROBLEM #1: GREEK

I tried to ship Ancient Greek from Nick White (http://ancientgreekocr.org). Their 
version number is 2.0. But we've already got a 'grc' marked 3.02.02 in Debian. 
You can't go backwards in version numbers.

I presume the root cause here was a rename from 'grc' to 'ell' at some point 
for Google supported languages. I think I have no choice but to call Nick's
work 3.04.00. (I'll double check with some experts). I have had no response
from emailing Nick so far.

PROBLEM #2: PDF

I was looking at a PDF problem report and noticed that Tesseract PDF output
is no longer validating. (It fails qpdf --check). As the author of the pdf module,
I'm biased, but producing corrupt data is a disaster and I think we need to cut
a new release once it is figured out. Most PDF viewers will recover and silently 
ignore, but this is no good at all. I wonder what happened.

PROBLEM #3: PACKAGING MISTAKE

I made a mistake during packaging resulting in a critical bug. This is totally
me, should be fast and easy to fix, nothing needed from upstream a.k.a.
the folks here. Just mentioning for completeness.

 

Jeff Breidenbach

unread,
Jul 17, 2015, 6:47:44 PM7/17/15
to tesser...@googlegroups.com
Okay, everything is more of less under control except for this:

  tesseract phototest.tif - pdf > phototest.pdf

This is sending activating both the text renderer, and the pdf renderer. 
They both get sent to stdout where they mix together and cause chaos. 
Same thing happens with this command.

   tesseract phototest.tif stdout pdf > phototest.pdf

What's happening is tesseractmain.cpp is setting tessedit_create_pdf without
disabling tessedit_create_txt. I'm not sure how we want to handle this, but
the current situation is no good.

PS. I'm punting on OpenCL for now because I just can't get it work under
any circumstances.

Jeff Breidenbach

unread,
Jul 18, 2015, 1:04:20 AM7/18/15
to tesser...@googlegroups.com
Well, this is easy. Please apply the following patch to fix. Thanks!


diff --git a/tessdata/configs/pdf b/tessdata/configs/pdf
index 0d5f0f1..cc75e69 100644
--- a/tessdata/configs/pdf
+++ b/tessdata/configs/pdf
@@ -1,2 +1,3 @@
+tessedit_create_txt 0
 tessedit_create_pdf 1
 tessedit_pageseg_mode 1


Tom Morris

unread,
Jul 18, 2015, 1:41:50 AM7/18/15
to tesser...@googlegroups.com
If you create a Github pull request with the contents of the patch, it'll be super easy for a committer to merge.

Tom

Jeff Breidenbach

unread,
Jul 18, 2015, 2:07:19 AM7/18/15
to tesser...@googlegroups.com
And the same problem affects most of the other configs...

Jim O'Regan

unread,
Jul 18, 2015, 4:01:36 AM7/18/15
to tesser...@googlegroups.com
On 18 July 2015 at 07:07, Jeff Breidenbach <breid...@gmail.com> wrote:
> And the same problem affects most of the other configs...
>

I've created an issue: https://github.com/tesseract-ocr/tesseract/issues/49
and fixed it for hOCR.

Jim O'Regan

unread,
Jul 18, 2015, 4:04:05 AM7/18/15
to tesser...@googlegroups.com
On 18 July 2015 at 06:41, Tom Morris <tfmo...@gmail.com> wrote:
> If you create a Github pull request with the contents of the patch, it'll be
> super easy for a committer to merge.

git am is super easy too, but if I already have to delete a bunch of
headers and edit the subject, I'm going to go for maximum context
(like this: https://github.com/tesseract-ocr/tesseract/commit/fd429c32a0795552ec8423eb6adf55a69e02e3e6).

So if you don't want that, include a log message somewhere.

Nick White

unread,
Jul 20, 2015, 7:51:16 AM7/20/15
to tesser...@googlegroups.com
Hi Jeff,

Sorry, I've been pretty absent on this list lately.

On Tue, Jul 14, 2015 at 04:23:38PM -0700, Jeff Breidenbach wrote:
> Finally, Shree was asking about other platforms. I've been asked to keep
> a somewhat helpful eye on Android by a vision impaired user. Is there
> anyone in this group that ports Tesseract to Android? If so, are things
> in good shape?

TextFairy is a free software android app which uses Tesseract, and
from what I've heard is very nice. I don't know if it supports
vision impairment usecases well, but at least when he was first
announcing it the developer was very responsive to feedback.
https://play.google.com/store/apps/details?id=com.renard.ocr

On Fri, Jul 17, 2015 at 01:52:12PM -0700, Jeff Breidenbach wrote:
> I tried to ship Ancient Greek from Nick White (http://ancientgreekocr.org).
> Their
> version number is 2.0. But we've already got a 'grc' marked 3.02.02 in Debian.
> You can't go backwards in version numbers.

The version numbers I use on the Ancient Greek training website
aren't tied to Tesseract version numbers at all. They are just based
on when I made significant enough changes for a new 'release'. The
hope was that each one would go into Tesseract soonish after I'd
made changes, but that hasn't really happened.

> I presume the root cause here was a rename from 'grc' to 'ell' at some point
> for Google supported languages. I think I have no choice but to call Nick's
> work 3.04.00. (I'll double check with some experts). I have had no response
> from emailing Nick so far.

Sorry again for not getting back to you on this sooner. 'ell' and
'grc' are different; Ancient Greek has many more diacritics than
modern Greek, and a different dictionary.

The 'grc' in Tesseract's 'tessdata' repository is probably from an
older version of my training, unless someone took it from the issue
I made and forgot to reply to it:
https://code.google.com/p/tesseract-ocr/issues/detail?id=1145

Is Google Code's issue tracker not used anymore? Should I submit a
patch to the repo with the newest grc.traineddata? The easiest thing
for Jeff and everybody is presumably for the newest version to be in
Tesseract ready for the new release. Sorry, I haven't been keeping
up with Tesseract development for a while, so am rusty on where
things are and how things are done now.

Nick

zdenko podobny

unread,
Jul 20, 2015, 8:52:24 AM7/20/15
to tesser...@googlegroups.com
IMO there should be split between google created/supported traineddata and community traineddata
e.g. *_frak.traineddata and grctraineddata. should be removed from tessdata repository.
There are also other Community trainings[1], but they vary in quality and support.





Zdenko 

Nick White

unread,
Jul 20, 2015, 1:58:05 PM7/20/15
to tesser...@googlegroups.com
On Mon, Jul 20, 2015 at 02:51:53PM +0200, zdenko podobny wrote:
> IMO there should be split between google created/supported traineddata and
> community traineddata
> e.g. *_frak.traineddata and grctraineddata. should be removed from tessdata
> repository.

FWIW I don't have strong feelings either way, as long as my grc
training is packaged for Debian (thanks again, Jeff!)

Nick

Jeff Breidenbach

unread,
Jul 20, 2015, 3:22:38 PM7/20/15
to tesser...@googlegroups.com
Using some black magic, I was able to package ancient Greek using
the correct version number. I also promise to learn how to use GitHub 
one of these days and its mysterious yet intriguing "pull requests".


As expected, my available time for packaging is evaporating
as of today. But I think I had enough time to accomplish everything
important. Users in the Debian / Ubuntu family should be in pretty 
good shape. Thanks again for everyone's help.

Jim O'Regan

unread,
Jul 20, 2015, 4:56:02 PM7/20/15
to tesser...@googlegroups.com
On 20 July 2015 at 20:22, Jeff Breidenbach <breid...@gmail.com> wrote:
> Using some black magic, I was able to package ancient Greek using
> the correct version number. I also promise to learn how to use GitHub
> one of these days and its mysterious yet intriguing "pull requests".

There's no real mystery. If you push a branch to a repository that was
forked from another, github offers to create an issue that connects
that branch to the original repository. The rest is UI niceties. One
of those is that you get an 'edit' button on every file when using the
website, whether you can write to it or not: if not, github forks the
repository and creates a new branch before it opens the file for
editing. Clicking save then brings up an offer to create a pull
request.

ShreeDevi Kumar

unread,
Jul 21, 2015, 9:42:56 AM7/21/15
to tesser...@googlegroups.com
Jeff,

There was a request on the forum sometime last year asking whether it was possible to get both text and pdf output from the same run of tesseract. I think that activation of both text renderer and pdf renderer may have been in response to that.

Is it possible for the pdf renderer to also write the text as a separate output file (without invoking text renderer)?

Thanks!


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

Jeff Breidenbach

unread,
Jul 21, 2015, 1:40:43 PM7/21/15
to tesser...@googlegroups.com
If you want to create PDF and TEXT output at the same time, put this
in your config file. 

myconfig, put it inside tessdata with the other configs
========================================
tessedit_create_txt 1
tessedit_create_pdf 1

Then make sure that you invoke the command line such that 
Tesseract writes to files instead of stdout, e.g. 

    tesseract myimage.tif myoutput myconfig

This will read myimage.tif and myconfig, and produce myoutput.pdf and myoutput.txt

Feel free to add this to the FAQ if you feel it is helpful. This does the OCR process
once, then asks the text renderer to produce the text output, and the pdf renderer
to produce pdf output. Which is exactly what you want. 

ShreeDevi Kumar

unread,
Jul 21, 2015, 11:48:44 PM7/21/15
to tesser...@googlegroups.com

Thanks. That's very helpful.

It will be good to add the instructions in FAQ.

We could also add a 'pdftxt' config file with the appropriate settings as part of the package.

- sent from my phone. excuse the brevity.

--
You received this message because you are subscribed to the Google Groups "tesseract-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-de...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-dev.

ShreeDevi Kumar

unread,
Jul 22, 2015, 12:29:55 AM7/22/15
to tesser...@googlegroups.com
I was looking to modify the FAQ and see that all documentation is under 
which has many links to googlecode 

I would suggest adding a new version of wiki documentation starting with 3.04.00 which links to files on github.  Also, the newer FAQ can limit itself to procedures starting with 3.04.00 

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

ShreeDevi Kumar

unread,
Jul 22, 2015, 12:38:29 AM7/22/15
to tesser...@googlegroups.com
Please ignore earlier email. I found that new version of wiki is at 


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Zdenko Podobný

unread,
Jul 22, 2015, 9:53:05 AM7/22/15
to tesseract-dev, shree...@gmail.com
I do not think we should create multiple config files for all possible combination...
Those who need txt and pdf at the same time, can easily adjust it by them-self or to set use tesseract command line options...

ShreeDevi Kumar

unread,
Jul 22, 2015, 10:36:26 AM7/22/15
to tesser...@googlegroups.com

Ok. I'll add to FAQ if that's ok.

- sent from my phone. excuse the brevity

Reply all
Reply to author
Forward
0 new messages