Fine-turning LSTM for Japanese

Akira Hayakawa

unread,

May 28, 2017, 11:21:25 AM5/28/17

to tesseract-ocr

I am new to tesseract. My aim is to use this software to analyze Japanese doc. The idea in my mind is to start from existing model and fine-tune it by new words that weren't correctly recognized.

I am reading the Wiki and have some questions.

1)

In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

you add training_text to tesstrain.sh

training/tesstrain.sh \--fonts_dir /usr/share/fonts \--training_text ../langdata/ara/ara.training_text \--langdata_dir ../langdata \--tessdata_dir ./tessdata \--lang ara \--linedata_only \--noextract_font_properties \--exposures "0" \--fontlist "Arial" \ --output_dir ~/tesstutorial/aratest

but

In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

You don't. Why?

training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only \--noextract_font_properties --langdata_dir ../langdata \ --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain

My understanding is

1. tesstrain.sh uses text2image command internally to generate images which are in various fonts and reshaped.

2. --linedata_only splits the training text into line and makes images for each line.

3. langdata_dir is essential but training_text isn't. If training_test isn't found, it uses the default $lang/$lang.training_text.

Am I correct?

2)

In the above example, I couldn't have an idea why it should take --tessdata because it seems irrelevant to making training data.

3)

In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

It says the reader should place each projects like this

./langdata./langdata/eng./langdata/ara./tessdata./tesseract./tesseract/tessdata./tesseract/tessdata/configs/./tesseract/trainingetc

and all the following examples are run under tesseract directory. Then I think the examples should take ../tessdata as --tessdata_dir but ./tessdata. I mean the examples should be fixed.

4)

In In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

combine_tessdata -e ../tessdata/ara.traineddata \ ~/tesstutorial/aratuned_from_ara/ara.lstm

This is explained as it extracts the existing LSTM model for Arabic from tessdata but how come?

The combine_tessdata commands extracts LSTM model because the extension of the second parameter is .lstm?

Another question here is why LSTM model is mixed in the traineddata? I think the traineddata file mixes legacy trained model and LSTM model and I am wondering why they aren't separated? Even if the user only uses LSTM both trained model are read? (is it light-weight? then it might be ok)

ShreeDevi Kumar

unread,

May 28, 2017, 2:15:09 PM5/28/17

to tesser...@googlegroups.com

Please see inline replies:

Yes, you are correct.

2)

In the above example, I couldn't have an idea why it should take --tessdata because it seems irrelevant to making training data.

tesseract needs eng and osd traineddata during initialization. The location can be specified via TESSDATA_PREFIX also.

3)

In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

It says the reader should place each projects like this

./langdata./langdata/eng./langdata/ara./tessdata./tesseract./tesseract/tessdata./tesseract/tessdata/configs/./tesseract/trainingetc

That will be the directory structure if you were to clone the tesseract, langdata and tessdata repositories.

It is not recommended to clone the whole tessdata repo (over 1 gb), you can download the traineddata files for the languages you need.

and all the following examples are run under tesseract directory. Then I think the examples should take ../tessdata as --tessdata_dir but ./tessdata. I mean the examples should be fixed.

./tessdata (in tesseract repo) does not have any traineddata files to begin with.

You can change the directories to match your directory configuration.

4)

In In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00---Finetune

combine_tessdata -e ../tessdata/ara.traineddata \ ~/tesstutorial/aratuned_from_ara/ara.lstm

This is explained as it extracts the existing LSTM model for Arabic from tessdata but how come?
The combine_tessdata commands extracts LSTM model because the extension of the second parameter is .lstm?

Yes.

Another question here is why LSTM model is mixed in the traineddata? I think the traineddata file mixes legacy trained model and LSTM model and I am wondering why they aren't separated? Even if the user only uses LSTM both trained model are read? (is it light-weight? then it might be ok)

The 4.0 code is in alpha stage of testing and supports both legacy engine and new LSTM engine and the traineddata file has both models.

You can use combine_tessdata to keep only the LSTM model in the traineddata.

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Akira Hayakawa

unread,

May 28, 2017, 11:49:35 PM5/28/17

to tesseract-ocr

Thanks for the reply. I understand.

There are couple of questions related to this topic.

1)

training_text may only include the text for the next (or new) learning?

For example, the LSTM net have learned a line "I have a pen" and we need it to learn a line "I have a pineapple" then does training_text only include the pineapple line but the pen line is removed?

2)

In https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh

the files in langdata other than training_text are said to be optional.

I suppose these files are internally handled as hints. Am I right?

And what if these files are inconsistent with training_text? For example, wordlist may contain fairly irrelevant words.

Should I erase the optional files if they are inconsistent?

3)

Closely related to 2).

When the langdata doesn't have these optional files. Tesseract internally generates the files from training_text?

4)

Is there no way to fine-tune legacy tesseract?

5)

In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

These is a note:

NOTE Tesseract 4.00 will now run happily with a traineddata file that contains just lang.lstm.The lstm-*-dawgs are optional, and none of the other files are required or used with OEM_LSTM_ONLY as the OCR engine mode. No bigrams, unichar ambigs or any of the other files are needed or even have any effect if present.

Does this mean if we use LSTM only (legacy tesseract is going to be purged in the future release right?), the optionals files like wordlist are entirely needless? This sounds natural to me because as far as I understand the LSTM net only learn a text line from a sequence of byte or image.

btw, What does "dawgs" mean?

ShreeDevi Kumar

unread,

May 29, 2017, 1:14:54 AM5/29/17

to tesser...@googlegroups.com, Ray Smith

Ray is the best person to answer your questions. I can only share my experience trying to train using Devanagari script.

Fine Tune will work if all you want to change is a font, with the same unicharset. This works well for Latin script based languages but not complex scripts.

eg. for devanagari, the consonants, vowel marks, combining marks together make an 'akshara' glyph, the unicharset in the language model has these. If the new training text has additional new akshara glyphs, fine tune training gives errors such as Encoding of string failed!

For Devanagari, I have tried training by changing top layer. This adds the new akshara glyphs. However, for accuracy training has to be done till 0.01% which takes very long - I have not been able to reach that level of accuracy in my training. Again, this may impact the originally trained fonts. Currently using --eval_listfile for a different set of images during training does not work.

-dawgs are a way of compressing the wordlists. https://tesseract-ocr.repairfaq.org/allaboutdawg.html

There is no way to finetune the legacy engine.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--

You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.

To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%40googlegroups.com.

ShreeDevi Kumar

unread,

May 29, 2017, 1:58:27 AM5/29/17

to tesser...@googlegroups.com, Ray Smith

Also look at all three scripts used for training

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh

https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain_utils.sh

https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh

https://github.com/tesseract-ocr/tesseract/blob/8e79297dcefecdb929d753d28554fec51417ec39/ccutil/unicharcompress.cpp

// Most simple scripts

// will encode a single index to a UTF8-string, but Chinese, Japanese, Korean

// and the Indic scripts will contain a many-to-many mapping.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Akira Hayakawa

unread,

May 29, 2017, 4:58:39 AM5/29/17

to tesseract-ocr, thera...@gmail.com

Fine Tune will work if all you want to change is a font, with the same unicharset. This works well for Latin script based languages but not complex scripts.

So you mean it is impossible for fine-turning to learn a new word whose characters are new to the LSTM net?

If it's true, I am very disappointed but I have no idea what causes this limitation because LSTM net only learn the mapping between image and text.

Btw, if it's true there is no chance for empty LSTM net to learn the first word? (i.e. learning from scratch)

Reply all

Reply to author

Forward