Training a language not in tesseract but almost similar script/ letters with vietnam language

haru...@gmail.com

unread,

Mar 28, 2019, 2:32:30 PM3/28/19

to tesseract-ocr

I am trying to train a language currently not present in Tesseract.

Working with python on Ubuntu 16.04 LTS, tesseract version 3.04.01 ( installed with sudo apt install tesseract-ocr , and is working perfectly for english language)

I have tested with the following command :

tesseract procssed_image.png stdout -l vie

The output is 90% correct except for some characters that are not in the vietnam language.

Then,

I have created the bazaar file (/usr/share/tesseract-ocr/tessdata/configs/):

load_system_dawg     F
load_freq_dawg          F
user_words_suffix      user-words

created a text file with my custom list of words (around 150 words, one word in each line) and named it as vie.user-words

And then ran the following command:

tesseract procssed_image.png stdout -l vie bazaar

The result was same.

Then when I tried with :

tesseract procssed_image.png stdout -l vie bazaar -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789àâêî

tessedit_char_whitelist <- Here, I am trying to put all the list of characters that is present in my language and other symbols present in the image file.

It shows the following errors and also prints the output ( result is same as before )

read_params_file: Can't open c
read_params_file: Can't open tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789àâêî

Please tell me how to fix this issue? Thank you for your time.

Shree Devi Kumar

unread,

Mar 29, 2019, 12:50:19 AM3/29/19

to tesser...@googlegroups.com

tesseract procssed_image.png stdout -l vie bazaar -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789àâêî

Bazaar should be listed last - see tesseract --help

Check your command syntax

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/55c9df9a-762f-43c3-9538-ba7d0c55dd20%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

haru...@gmail.com

unread,

Mar 30, 2019, 3:39:12 AM3/30/19

to tesseract-ocr

Thank you for the response. I tried by keeping the bazaar at the end and the command runs without any error. However, tesseract is still not able to recognize the extra letters that I have provided in the tessedit_char_whitelist, the output is same. The words/ text is in the image is already there in the vie.user-words file.

1. Is there any wrong in the way I created that file?

2. How should I approach this issue. Do I need to provide any other extra files?

3. Or I need to re-train it separately for the language from scratch?

Thanks.

To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.

Reply all

Reply to author

Forward