Improvement of language model / is my understanding correct?

27 views
Skip to first unread message

Lars Fricke

unread,
Jan 22, 2019, 6:22:25 AM1/22/19
to tesseract-ocr
Hello together,

i have a basic understanding problem regarding the adaption of Tesseract4 to a modified language model. Just assume i modify the contents in https://github.com/tesseract-ocr/langdata_lstm/tree/master/deu to fit our text domain better (i know that takes a lot of steps but assume i got it done).

In my understanding the LSTM is trained basically with rendered variations of deu.training_text, so if i change that, i need to retrain the whole Network from scratch.

But what if i don't do that but only compile a new trainddata-file including the "old" LSTM Network but the modified Dictionary-files? Do i still get the effect, that the LSTM-Recognizer https://github.com/tesseract-ocr/tesseract/blob/master/src/lstm/lstmrecognizer.cpp will prefer the words in the modified dictionary by a factor of 2.25 over the non-dictionary words? Would the effect be the same using a custom dictionary or do i get an additional benefit not e.g. by modifying https://github.com/tesseract-ocr/langdata_lstm/blob/master/deu/deu.bad_words that i cannot get with a custom dictionary?

Best Regards,
Lars


Reply all
Reply to author
Forward
0 new messages