Beginner question : could not initialize tesseract, missing eng.traineddata file in tessdata

Roparzh Hemon

unread,

Jan 16, 2021, 11:59:19 AM1/16/21

to tesseract-ocr

Hello,

I am a complete beginner to Tesseract. I just installed it on my Ubuntu machine.

Here is a snippet from my Terminal :

$ echo TESSDATA_PREFIX

/home/mbalambala/tesseract/tessdata

$ tesseract Downloads/p1.pdf p1

Error opening data file /home/mbalambala/tesseract/tessdata/eng.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Failed loading language 'eng'

Tesseract couldn't load any languages!

Could not initialize tesseract.

$ ls /home/mbalambala/tesseract/tessdata

configs eng.user-words Makefile.am pdf.tiff

eng.user-patterns Makefile Makefile.in tessconfigs

So it seems I need to produce a eng.traineddate file in my tessdata directory, how do I do this ?

Adriana Camilleri

unread,

Jan 17, 2021, 4:37:22 AM1/17/21

to tesser...@googlegroups.com

Run the following command in order to get the eng.traineddata file within the tessdata directory: wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fa3fd4fb-fb96-4420-8bc0-69e1e4e3798fn%40googlegroups.com.

Roparzh Hemon

unread,

Jan 19, 2021, 11:19:09 AM1/19/21

to tesseract-ocr

I downloaded it as you suggested, and as the terminal output below shows, the file is now present at the correct place :

$file /home/mbalambala/tesseract/tessdata/eng.traineddata

/home/mbalambala/tesseract/tessdata/eng.traineddata : HTML document, UTF-8 Unicode text, with very long lines

$ echo TESSDATA_PREFIX

/home/mbalambala/tesseract/tessdata

but the error message stays exactly the same :

$ tesseract Downloads/p1.pdf p1

Error opening data file /home/mbalambala/tesseract/tessdata/eng.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Failed loading language 'eng'

Tesseract couldn't load any languages!

Could not initialize tesseract.

Whatever the real problem is, the error message is not detecting it.

Shree Devi Kumar

unread,

Jan 19, 2021, 11:30:46 AM1/19/21

to tesseract-ocr

>wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

That is not correct. You need to get the `raw` file.

https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata

wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/47e8b734-5de9-4624-8872-ed91ac8775b4n%40googlegroups.com.

--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Adriana Camilleri

unread,

Jan 19, 2021, 12:18:17 PM1/19/21

to tesseract-ocr

My apologies... hope the error is now fixed.

Roparzh Hemon

unread,

Jan 19, 2021, 12:43:53 PM1/19/21

to tesseract-ocr

shree : your solution worked for me, thanks a lot.

Surya VaraPrasad Alla

unread,

Apr 22, 2024, 11:40:45 AM (11 days ago) Apr 22

to tesseract-ocr

Hello,

I have the similar response

pytesseract.pytesseract.TesseractError: (1, "read_params_file: Can't open tessedit_char_blacklist=,;: Error: Tesseract (legacy) engine requested, but components are not present in external/tesstrain/data/eng_pcb/eng_pcb.traineddata!! Failed loading language 'eng_pcb' Tesseract couldn't load any languages! Could not initialize tesseract.")

tesseract --version:
tesseract -v
tesseract 4.1.1
leptonica-1.82.0
libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
Found AVX512BW
Found AVX512F
Found AVX2
Found AVX
Found FMA
Found SSE
Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8

I am using best float tessdata files from: https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata

also tried some of possibilities in https://github.com/ocrmypdf/OCRmyPDF/issues/209

I am looking for the source of the issue ---> could someone help if understood the source. so I can work further.

Zdenko Podobny

unread,

Apr 22, 2024, 12:43:54 PM (11 days ago) Apr 22

to tesser...@googlegroups.com

No, you are not using best float tessdata files from: https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata

There is nothing like eng_pcb.traineddata. (read your error message)

Zdenko

po 22. 4. 2024 o 17:40 Surya VaraPrasad Alla <asvp...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c0a86f51-b876-40ba-8d46-afdc3eccc96dn%40googlegroups.com.

Surya VaraPrasad Alla

unread,

Apr 25, 2024, 5:35:19 AM (8 days ago) Apr 25

to tesseract-ocr

eng_pcb.traineddata is a traineddata starting with eng.traineddata

i did lstm training to improve the detection of ocr rather than the recognition. i used tesstrain git repo.

final error: couldn't find the legacy components in eng_pcb.traineddata

Zdenko Podobny

unread,

Apr 25, 2024, 7:34:58 AM (8 days ago) Apr 25

to tesser...@googlegroups.com

If you used the tesstrain you trained the lstm engine. Why do you then ask tesseract to use a legacy engine?

Do you understand what you are doing?

Zdenko

št 25. 4. 2024 o 11:35 Surya VaraPrasad Alla <asvp...@gmail.com> napísal(a):

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1629a028-e116-47f9-9253-faa642e4847bn%40googlegroups.com.

Reply all

Reply to author

Forward