Improvement of language model / is my understanding correct?

27 views

Skip to first unread message

Lars Fricke

unread,

Jan 22, 2019, 6:22:25 AM1/22/19

to tesseract-ocr

Hello together,

i have a basic understanding problem regarding the adaption of Tesseract4 to a modified language model. Just assume i modify the contents in https://github.com/tesseract-ocr/langdata_lstm/tree/master/deu to fit our text domain better (i know that takes a lot of steps but assume i got it done).

In my understanding the LSTM is trained basically with rendered variations of deu.training_text, so if i change that, i need to retrain the whole Network from scratch.

But what if i don't do that but only compile a new trainddata-file including the "old" LSTM Network but the modified Dictionary-files? Do i still get the effect, that the LSTM-Recognizer https://github.com/tesseract-ocr/tesseract/blob/master/src/lstm/lstmrecognizer.cpp will prefer the words in the modified dictionary by a factor of 2.25 over the non-dictionary words? Would the effect be the same using a custom dictionary or do i get an additional benefit not e.g. by modifying https://github.com/tesseract-ocr/langdata_lstm/blob/master/deu/deu.bad_words that i cannot get with a custom dictionary?

Best Regards,

Lars

Reply all

Reply to author

Forward

0 new messages