Custom words combining letters + digits

307 views
Skip to first unread message

David Novak

unread,
Jul 10, 2019, 10:29:29 AM7/10/19
to tesseract-ocr

Hello,

I have a custom list of words that I'd like to add to (or practically substitute for) the default word list in my language. Some of these words combine letters & digits & punctuation e.g.
0.5KG
0.5L
1.1L
1.25KG
108G
4DOG

I'm using tesseract 4.0. My approach so far:
 - unpack lang.traineddata
 - create cus.lstm-word-dawg  (either just from my wordlist or as combination of standard language list + my list)
 - create new .traineddata from cus.lstm cus.lstm-recoder cus.lstm-unicharset cus.lstm-word-dawg cus.traineddata

It has practically no effect... Often, a word that actually is in the list is recognized wrongly as some string that is not in the list.

I have tried to add these words using --user-words <mylist.txt>: no effect, or the same as my approach
I have tried -c language_model_penalty_non_dict_word=1.0  (I thought it would limit the output to words in cus.lstm-word-dawg): no effect

I'm out of ideas after two weeks of trying. Any tips, please?

Thanks

Shree Devi Kumar

unread,
Jul 10, 2019, 10:57:10 AM7/10/19
to tesser...@googlegroups.com
--user-words does not currently work in tesseract4.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5b015d58-9958-4c1f-a330-abdb001f7957%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Krs Krs 0

unread,
Oct 8, 2019, 5:39:24 PM10/8/19
to tesseract-ocr
what is the scirpt for add model language on tesseract


Le mercredi 10 juillet 2019 16:57:10 UTC+2, shree a écrit :
--user-words does not currently work in tesseract4.

On Wed, Jul 10, 2019 at 7:59 PM David Novak <novak...@gmail.com> wrote:

Hello,

I have a custom list of words that I'd like to add to (or practically substitute for) the default word list in my language. Some of these words combine letters & digits & punctuation e.g.
0.5KG
0.5L
1.1L
1.25KG
108G
4DOG

I'm using tesseract 4.0. My approach so far:
 - unpack lang.traineddata
 - create cus.lstm-word-dawg  (either just from my wordlist or as combination of standard language list + my list)
 - create new .traineddata from cus.lstm cus.lstm-recoder cus.lstm-unicharset cus.lstm-word-dawg cus.traineddata

It has practically no effect... Often, a word that actually is in the list is recognized wrongly as some string that is not in the list.

I have tried to add these words using --user-words <mylist.txt>: no effect, or the same as my approach
I have tried -c language_model_penalty_non_dict_word=1.0  (I thought it would limit the output to words in cus.lstm-word-dawg): no effect

I'm out of ideas after two weeks of trying. Any tips, please?

Thanks

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesser...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages