Detection Using LSTM Files

114 views
Skip to first unread message

Ibr

unread,
Jun 5, 2017, 7:29:25 AM6/5/17
to tesseract-ocr
Hi,

assume that I have creates  20 LSTM files for English for example, each LSTM file is for a different font, when I make detection against an image by running the command: tesseract image results -l eng--tessdata-dir ./tessdata --oem 1 does the tesseract check the image against all LSTM files, or just take one of them and make detection against it?

I'm assuming that to make the detection is more accurate I should create many LSTM files for different fonts, because images can be with different fonts from each other so in this way it would be more accurate since I have LSTM file for every possible font, is that correct?

Thanks

ShreeDevi Kumar

unread,
Jun 5, 2017, 9:36:04 AM6/5/17
to tesser...@googlegroups.com
Comments from Ray regarding training text

For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. Instead of taking a few minutes to a couple of hours to train, Tesseract 4.00 takes a few days to a couple of weeks. 

>The text corpus is from *all* the www, taken several years ago, plus more
recent data from wiki-something. The text is divided by language automatically, so there is a separate stream for each of the Devanagari-based languages (as there is for the Latin-based languages) and clipped to 1GB for each language. For each language, the text is frequency counted and cleaned by multiple methods, and sometimes this cleaning is too stringent automatically, or not stringent enough, so forbidden_characters and desired_characters are used as a guide in the cleanup process. There are other lang-specific numbers like a 1-in-n discard ratio for the frequency. For some languages, the amount of data produced at the end is very thin.
​>​
The unicharset is extracted from what remains, and the wordlist that is published in langdata.
​>​
For the LSTM training, I resorted to using Google's parallel infrastructure to render enough text in all the languages.
​>​
However much or little corpus text there is, the rendering process makes 50000 chunks of 50 words to render in a different combination of font and random degradation, which results in 400000-800000 rendered textlines. The words are chosen to approximately echo the real frequency of conjunct clusters (characters in most languages) in the source text, while also using the most frequent words.
​>​
This process is all done without significant manual intervention, but counts of the number of generated textlines indicates when it has gone badly, usually due to a lack of fonts, or a lack of corpus text. I recently stopped training chr, iku, khm, mya after discovering that I have no rendered textlines that contain anything other than digits and punctuation.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ShreeDevi Kumar

unread,
Jun 5, 2017, 9:42:21 AM6/5/17
to tesser...@googlegroups.com
>assume that I have creates  20 LSTM files for English for example, each LSTM file is for a different font, when I make detection against an image by running the command: tesseract image results -l eng--tessdata-dir ./tessdata --oem 1 does the tesseract check the image against all LSTM files, or just take one of them and make detection against it?

​the .lstmf files are created per font​/image. lstmtraining processes all of them together to create one .lstm file for the language. 

Maybe, internally it keeps the .lstmf files. I do not know whether it checks against just of them or creates a combined version to use for recognition


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Reply all
Reply to author
Forward
0 new messages