I cannot use traineddata downloaded from Data Files

137 views
Skip to first unread message

坂本聖

unread,
Dec 7, 2019, 11:34:26 AM12/7/19
to tesseract-ocr
Hi,
I want to use tesseract for Chinese words. So, first I tried to execute the command 
sudo apt install tesseract-ocr-chi-sim 
And, I can find chi_sim.traineddata in /usr/share/tesseract-ocr/4.00/tessdata and can check like this (I also downloaded chi_tra and jpn.)

$ tesseract --list-langs

List of available languages (5):

chi_sim

chi_tra

eng

jpn

osd


Actually, I can use tesseract, but I want to do ocr more accurately, so I want to use chi_sim.traineddata downloaded from here.
After I executed the command
sudo apt remove tesseract-ocr-chi-sim
I put the new chi_sim.traineddata in /usr/share/tesseract-ocr/4.00/tessdata, and I tried to use tesseract. However I cannot like this.

$ tesseract 0.jpeg output -l chi_sim

Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/chi_sim.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Failed loading language 'chi_sim'

Tesseract couldn't load any languages!

Could not initialize tesseract.


Then, I tried like this, but I cannot.


$ tesseract 0.jpeg output -l chi_sim --tessdata-dir /usr/share/tesseract-ocr/4.00/tessdata

Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/chi_sim.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Failed loading language 'chi_sim'

Tesseract couldn't load any languages!

Could not initialize tesseract.


Then, I tried to connect path to /usr/share/tesseract-ocr/4.00/tessdata and tried again, but I cannot.


$ export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/

$ tesseract 0.jpeg output -l chi_sim

Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/chi_sim.traineddata

Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.

Failed loading language 'chi_sim'

Tesseract couldn't load any languages!

Could not initialize tesseract.


If I execute the language list, I can find chi_sim.traineddata again.

$ tesseract --list-langs

List of available languages (5):

chi_sim

chi_tra

eng

jpn

osd


Please tell me why I cannot use the traineddata downloaded from https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata? Did I make a mistake?

NY C

unread,
Dec 7, 2019, 9:06:58 PM12/7/19
to tesseract-ocr
Try to set TESSDATA_PREFIX environment variable.
  1. Go to Control Panel -> System -> Advanced System Settings -> Advanced tab -> Environment Variables... button
  2. In System variables window scroll down to TESSDATA_PREFIX. If it's not right, select and click Edit...


坂本聖於 2019年12月8日星期日 UTC+8上午12時34分26秒寫道:

Zdenko Podobny

unread,
Dec 8, 2019, 9:15:31 AM12/8/19
to tesser...@googlegroups.com
How did you downloaded files from repository?
Please check files in  /usr/share/tesseract-ocr/4.00/tessdata/ if there have the same size as in repository.

Zdenko


so 7. 12. 2019 o 17:34 坂本聖 <eclipse.alg...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e93f49e3-978e-458d-8f97-1e0266a318c8%40googlegroups.com.

坂本聖

unread,
Dec 8, 2019, 9:46:48 AM12/8/19
to tesseract-ocr
Thanks for your advice, however I am using ubuntu on wsl (windows subsystem for linux), and I have already tried to set TESSDATA_PEREFIX by executing $ export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/  .
But, I cannot use tesseract.
If I use the traineddata downloaded from sudo apt install tesseract-ocr-chi-sim, I can use tesseract with the data downloaded from data Data files.
Cannot I use tesseract on wsl (Ubuntu)? 

2019年12月8日日曜日 11時06分58秒 UTC+9 NY C:

坂本聖

unread,
Dec 8, 2019, 9:55:36 AM12/8/19
to tesseract-ocr
Thanks for your advice.
I downdloaded files by clicking the "download" button in https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata.
And I moved the chi_sim.traineddata file to  /usr/share/tesseract-ocr/4.00/tessdata/ , and checked the file (which size is 42.3MB)  exactly there.
But, I cannot use tesseract.
As I said, I can use tesseract with the file downloaded by executing sudo apt install tesseract-ocr-chi-sim, but the data downloaded from Data files did not work.
I cannot understand why it did not work.

2019年12月8日日曜日 23時15分31秒 UTC+9 zdenop:
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Zdenko Podobny

unread,
Dec 8, 2019, 11:43:52 AM12/8/19
to tesser...@googlegroups.com
what is output of:
 tesseract --version

Zdenko


ne 8. 12. 2019 o 15:55 坂本聖 <eclipse.alg...@gmail.com> napísal(a):
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fd0e48ec-412c-464d-85bb-5ed65d4419c3%40googlegroups.com.

坂本聖

unread,
Dec 8, 2019, 7:15:15 PM12/8/19
to tesseract-ocr
The output is this one.

$ tesseract --version
tesseract 4.0.0-beta.1
leptonica-1.75.3
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found SSE

2019年12月9日月曜日 1時43分52秒 UTC+9 zdenop:
Reply all
Reply to author
Forward
0 new messages