Error opening traineddata files on Mac High Sierra

150 views
Skip to first unread message

Firlefanz

unread,
Apr 10, 2018, 7:14:03 AM4/10/18
to tesseract-ocr
I downloaded deu_frak.traineddata Fraktur.traineddata and frk.traineddata to usr/loca/share/tessdata. But when using

$ tesseract file.tiff -l Fraktur Fraktur

I get the error message

Error opening data file ./tessdata/Fraktur.traineddata 
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory. Failed loading language 'Fraktur' Tesseract couldn't load any languages! Could not initialize tesseract.


Zdenko Podobny

unread,
Apr 10, 2018, 7:22:13 AM4/10/18
to tesser...@googlegroups.com
First of all: your command if wrong. It should be constructed this way:

tesseract image output [options]

See tesseract --help for more details.

Next:  error message is clear:
Error opening data file ./tessdata/Fraktur.traineddata 
You (or your installation) instructed to look for trainneddata in current director (./). Do you have it there?

Next tesseract gave you hint how you can fix the problem (" Please make sure the TESSDATA_PREFIX..."). Did you use it?


Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e190c5c4-9099-4077-98a8-bf03802e509d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Firlefanz

unread,
Apr 10, 2018, 7:43:15 AM4/10/18
to tesseract-ocr

Thank you for your reply. I used the command following this guide https://www.youtube.com/watch?v=QhJiOCwz-_I -- if it's wrong, then I will not follow this guide anymore.

Yes, I have Fraktur.traineddata in usr/loca/share/tessdata

I do not know how to change "the TESSDATA_PREFIX environment variable"

Zdenko Podobny

unread,
Apr 10, 2018, 8:29:42 AM4/10/18
to tesser...@googlegroups.com
If you followed someone tutorial you should complain to its author ;-).

I am not familiar with Mac, but on linux you can do it (in command) this way:

 export  TESSDATA_PREFIX=/usr/loca/share/

Maybe it is similar on Mac. Try to google how to set environment variable on Mac.

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

Fanatico

unread,
Apr 10, 2018, 9:07:18 AM4/10/18
to tesseract-ocr
You installed it using brew or compiled it yourself?

try to type this in the terminal and post here the result

echo $TESSDATA_PREFIX


Firlefanz

unread,
Apr 10, 2018, 10:47:04 AM4/10/18
to tesseract-ocr
Nothing happens if I type in echo $TESSDATA_PREFIX

I thought about installing tesseract 4.0beta, is there a step-by-step-guide how to do this? with brew install tesseract I cannot choose the version, i.e. it's 3.05.01

Fanatico

unread,
Apr 10, 2018, 11:39:05 AM4/10/18
to tesseract-ocr
try this code in the console:
brew info tesseract

This must return some info, one these infos is the path where your tesseract is installed copy it and execute this code on your console:
export TESSDATA_PREFIX=[the path you just copied]

try to execute your code again, if it works you can past this code on you .bash_profile or use it in every new terminal you open

I made an step by step to build and use tesseract 4.0 on my mac you can see it here.

Obs.: Read everything before doing it

Firlefanz

unread,
Apr 11, 2018, 3:33:23 AM4/11/18
to tesseract-ocr

It works! I am so relieved. Thank you all for the help.

Still I have a couple of questions since I've read a couple of tutorials, each using other commands:

1. Converting my Fraktur pdf files in tiff I use imagemagick. Is this the right command? convert -density 300 test.pdf -depth 8 -strip -background white -alpha off test.tiff

2. For tesseract then the command: tesseract test.tiff outtest -l deu_frak
With this I get a txt version of the tiff. 

3. Not that it matters too much (I'm over the moon that it works like this), can I get as an output instead of a txt the original pdf just with a search-and-copy-function?


ShreeDevi Kumar

unread,
Apr 11, 2018, 3:58:46 AM4/11/18
to tesser...@googlegroups.com
1. Check the output tif and adjust convert command if needed

2. Depending on your tesseract version you could try -l frk also. 

3. Yes, you can get a pdf as output.

Search Github issues, there is a long discussion thread regarding best ways to create a pdf output.

Look for pdf and invisible pdf.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

ShreeDevi Kumar

unread,
Apr 11, 2018, 4:21:47 AM4/11/18
to tesser...@googlegroups.com

Firlefanz

unread,
Apr 11, 2018, 7:02:15 AM4/11/18
to tesseract-ocr

Thank you again. I think I'll stay with plain txt -- pdf looks too difficult to achieve.

Now, next problem: Everything worked fine with my 1-page test pdf. I now tried to do the same with a 30 MB 500 pages pdf. After running convert -density 300 test.pdf -depth 8 -strip -background white -alpha off test.tiff it took 2 hours and then suddenly everything went black and I could not do anything. I guess my Mac is too weak to handle this. I guess splitting the pdf in many parts is the only option left? 
With pdftk I used the command "pdftk test.pdf burst" to split the pdf in single pages. I then put around 50 pages in a new folder and used "pdftk *.pdf cat output test.pdf" to combine them. Is there a faster way to do this? I do not know with which command I could split the 500 automatically in bundles of 50.
Reply all
Reply to author
Forward
0 new messages