The way the path to tessdata directory is defined.

4,505 views
Skip to first unread message

Dmitry Katsubo

unread,
Jun 6, 2013, 3:02:59 PM6/6/13
to tesser...@googlegroups.com
Dear Tesseract developers,

The source code in mainblk.c reads:

* TESSDATA_PREFIX Environment variable overrules everything.
* Compiled in -DTESSDATA_PREFIX is next.
* An actual value of argv0 is used if not NULL, otherwise current directory.

But actually it is:

* TESSDATA_PREFIX Environment variable overrules everything.
* Compiled in -DTESSDATA_PREFIX overrides argv0.
* argv0 is used if not NULL, otherwise current directory.

I think that things should be all way around: more specific
(application-specific) values should override system-wide settings. In
particular:

* If argv0 is not NULL, it is used.
* Otherwise if TESSDATA_PREFIX environment variable is defined, it is used.
* Otherwise (if defined) compiled-in TESSDATA_PREFIX is used.
* Otherwise current directory is used.

I have problem when trying to use system-wide library
(/usr/lib/libtesseract.so.3) with my trained datasets in multi-threading
environment: obviously setting TESSDATA_PREFIX env does not work
reliably and the only correct way is to pass directory to
TessBaseAPI::Init(), but it is ignored.

The attached patch solves this for me, but it also makes sure that
despite of source (argument, environment, pre-compiled) directory is
post-processed the same way.

--
With best regards,
Dmitry
mainblk.cpp.diff

zdenko podobny

unread,
Jun 7, 2013, 6:03:33 AM6/7/13
to tesser...@googlegroups.com
Hello,

generally, if you have a patch - create issue [1]. Of course if there is something that change current behavior of tesseract - send e-mail to tesseract-dev forum. Sometimes it took a time until your message is approved by moderator, but it is a right place for development discussion.



Zdenko


--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



Dmitry Katsubo

unread,
Jun 7, 2013, 6:54:28 AM6/7/13
to Sriranga(79yrs), tesser...@googlegroups.com
On 07.06.2013 11:35, Sriranga(79yrs) wrote:
I would suggest  better to post under issues with your patch also. Presumed that it will work for windows platform also apart from Linux?

I have submitted issue#938.
I can't 100% guarantee that it works for WIndows, as I don't have means to compile Tesseract under Windows. However I didn't add something new, I have just reshuffled the order the values are checked, so it should work OK.

zdenko podobny

unread,
Jul 14, 2013, 4:56:22 PM7/14/13
to tesser...@googlegroups.com, tesser...@googlegroups.com
I play a little bit with Dmitry Katsubo patch. Based on it I suggest to implement option "--tessdata-dir" for tesseract-ocr executable. It allows user to specify where tesseract-ocr should look for its data (languages and configs). For example something like this should work after applying my patch from issue 938[1] :
    tesseract --tessdata-dir /usr/src/tesseract-ocr/tessdata eurotext.tif stdout

Feel free to test and comment it.

--

Shree Devi Kumar

unread,
Jul 15, 2013, 12:07:47 AM7/15/13
to tesser...@googlegroups.com, tesser...@googlegroups.com
As I understand tessdata-prefix as well as the tessdata-dir option that you are proposing now specify where tesseract-ocr should look for its data (languages and configs) - does it also - without explicitly mentioning it - specify where the 'tesseract.exe' resides?

The reason I ask is this, I had initially downloaded the 3.02 windows package and installed it. Then I downlaoded the latest svn thru VS2008 and compiled it. They are in two different locations.

So, is setting the tessdata-prefix or this new option enough enough for making sure that the correct tesseract executable is used?

Thanks,
Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Nick White

unread,
Jul 15, 2013, 7:16:18 AM7/15/13
to tesser...@googlegroups.com
Hi Shree,

On Mon, Jul 15, 2013 at 09:37:47AM +0530, Shree Devi Kumar wrote:
> As I understand tessdata-prefix as well as the tessdata-dir option that you are
> proposing now specify where tesseract-ocr should look for its data (languages
> and configs) - does it also - without explicitly mentioning it - specify where
> the 'tesseract.exe' resides?

The tessdata-prefix only tells Tesseract where to look for its data
files, that is, config and training files.

The actual directory of the tesseract executable will be chosen by
the operating system (so on unix it will be the first match on
$PATH, whereas on Windows it's IIRC a bit more complicated). If you
have several different versions of Tesseract installed it's safest
to just use the full path name to call it, e.g.
C:\My_Tesseract\bin\tesseract.exe

Nick

Shree Devi Kumar

unread,
Jul 15, 2013, 8:32:50 AM7/15/13
to tesser...@googlegroups.com
Thanks, Nick!

I had been using the full paths. Your response validates that approach.

Regards,
Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


zdenko podobny

unread,
Jul 16, 2013, 5:04:12 PM7/16/13
to tesser...@googlegroups.com, tesser...@googlegroups.com
On Mon, Jul 15, 2013 at 6:07 AM, Shree Devi Kumar <shree...@gmail.com> wrote:
As I understand tessdata-prefix as well as the tessdata-dir option that you are proposing now specify where tesseract-ocr should look for its data (languages and configs) - does it also - without explicitly mentioning it - specify where the 'tesseract.exe' resides?

The reason I ask is this, I had initially downloaded the 3.02 windows package and installed it. Then I downlaoded the latest svn thru VS2008 and compiled it. They are in two different locations.

So, is setting the tessdata-prefix or this new option enough enough for making sure that the correct tesseract executable is used?

This is not about "correct" tesseract executable. This is about "correct" tesseract-ocr data.

Let assume this scenario:
You have installed last official tesseract version in C:\Program Files\Tesseract-OCR\ (so your tessdata_prefix environment variable points to that directory)
You create alternative eng.traineddata file (e.g. without dictionaries) and you placed it to C:\Program Files\Tesseract-OCR-dev\tessdata.

With current executable if you want to use your alternative eng data you must modify tessdata_prefix environment variable to C:\Program Files\Tesseract-OCR-dev\. Than you call:
     tesseract eurotext.tif stdout -l eng
When you want to use "official" data you need to change tessdata_prefix environment variable back to C:\Program Files\Tesseract-OCR\.

Proposed patch make it simpler. You do not need to modify tessdata_prefix environment variable. You can just use option "--tessdata-dir" (that will have higher priority that  environment variable):
     tesseract --tessdata-dir "C:\Program Files\Tesseract-OCR-dev\tessdata" eurotext.tif stdout -l eng

Of course patch also modify behavior of the tesseract-ocr library regarding priority of tessdata_prefix - at the moment environment variable has top priority. Patch change it so the argument has top priority (for more detail see comments in patch for ccutil/mainblk.cpp).

PS1: maybe instead of "--tessdata-dir" we can use "--tessdata_prefix" to be consistent

--
Zdenko

Shree Devi Kumar

unread,
Jul 17, 2013, 2:36:44 AM7/17/13
to tesser...@googlegroups.com
T
​hank you, Zdenko and Nick, for the clarifications.

Shree​

Reply all
Reply to author
Forward
0 new messages