Beginner question : could not initialize tesseract, missing eng.traineddata file in tessdata

8,285 views
Skip to first unread message

Roparzh Hemon

unread,
Jan 16, 2021, 11:59:19 AM1/16/21
to tesseract-ocr

Hello,

 I am a complete beginner to Tesseract. I just installed it on my Ubuntu machine.
Here is a snippet from my Terminal : 

$ echo TESSDATA_PREFIX
/home/mbalambala/tesseract/tessdata
$ tesseract Downloads/p1.pdf p1
Error opening data file /home/mbalambala/tesseract/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
$ ls /home/mbalambala/tesseract/tessdata
configs                    eng.user-words Makefile.am pdf.tiff 
eng.user-patterns Makefile              Makefile.in   tessconfigs 



So it seems I need to produce a eng.traineddate file in my tessdata directory, how do I do this ?



Adriana Camilleri

unread,
Jan 17, 2021, 4:37:22 AM1/17/21
to tesser...@googlegroups.com
Run the following command in order to get the eng.traineddata file within the tessdata directory: wget https://github.com/tesseract-ocr/tessdata/blob/master/eng.traineddata

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fa3fd4fb-fb96-4420-8bc0-69e1e4e3798fn%40googlegroups.com.

Roparzh Hemon

unread,
Jan 19, 2021, 11:19:09 AM1/19/21
to tesseract-ocr

I downloaded it as you suggested, and as the terminal output below shows, the file is now present at the correct place :

$file /home/mbalambala/tesseract/tessdata/eng.traineddata
/home/mbalambala/tesseract/tessdata/eng.traineddata : HTML document, UTF-8 Unicode text, with very long lines

$ echo TESSDATA_PREFIX
/home/mbalambala/tesseract/tessdata

but the error message stays exactly the same :

$ tesseract Downloads/p1.pdf p1
Error opening data file /home/mbalambala/tesseract/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.


Whatever the real problem is, the error message is not detecting it.

Shree Devi Kumar

unread,
Jan 19, 2021, 11:30:46 AM1/19/21
to tesseract-ocr

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Adriana Camilleri

unread,
Jan 19, 2021, 12:18:17 PM1/19/21
to tesseract-ocr
My apologies... hope the error is now fixed.

Roparzh Hemon

unread,
Jan 19, 2021, 12:43:53 PM1/19/21
to tesseract-ocr
shree : your solution worked for me, thanks a lot.

Surya VaraPrasad Alla

unread,
Apr 22, 2024, 11:40:45 AM (11 days ago) Apr 22
to tesseract-ocr
Hello,

I have the similar response

pytesseract.pytesseract.TesseractError: (1, "read_params_file: Can't open tessedit_char_blacklist=,;: Error: Tesseract (legacy) engine requested, but components are not present in external/tesstrain/data/eng_pcb/eng_pcb.traineddata!! Failed loading language 'eng_pcb' Tesseract couldn't load any languages! Could not initialize tesseract.")

tesseract --version:
tesseract -v
tesseract 4.1.1
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.4.8

I am using best float tessdata files from: https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata

also tried some of possibilities in https://github.com/ocrmypdf/OCRmyPDF/issues/209

I am looking for the source of the issue ---> could someone help if understood the source. so I can work further.

Zdenko Podobny

unread,
Apr 22, 2024, 12:43:54 PM (11 days ago) Apr 22
to tesser...@googlegroups.com
No, you are not using best float tessdata files from: https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata
There is nothing like eng_pcb.traineddata. (read your error message)


Zdenko


po 22. 4. 2024 o 17:40 Surya VaraPrasad Alla <asvp...@gmail.com> napísal(a):

Surya VaraPrasad Alla

unread,
Apr 25, 2024, 5:35:19 AM (8 days ago) Apr 25
to tesseract-ocr
eng_pcb.traineddata is a traineddata starting with eng.traineddata

i did lstm training to improve the detection of ocr rather than the recognition. i used tesstrain git repo. 

final error: couldn't find the legacy components in eng_pcb.traineddata 

Zdenko Podobny

unread,
Apr 25, 2024, 7:34:58 AM (8 days ago) Apr 25
to tesser...@googlegroups.com
If you used the tesstrain you trained the lstm engine. Why do you then ask tesseract to use a legacy engine? 
Do you understand what you are doing?

Zdenko


št 25. 4. 2024 o 11:35 Surya VaraPrasad Alla <asvp...@gmail.com> napísal(a):
Reply all
Reply to author
Forward
0 new messages