I have a question about making a traineddata (tesseract 4.0 LSTM)

63 views

Skip to first unread message

이경준

unread,

Feb 28, 2018, 11:02:00 PM2/28/18

to tesseract-ocr

I have a question about making a traineedata (tesseract 4.0 LSTM)

Tutorial Guide to lstmtraining

Creating Starter Traineddata

NOTE: This is a new step!

Instead of a unicharset and script_dir, lstmtraining now takes a traineddata file on its command-line, to obtain all the information it needs on the language to be learned. The traineddata must contain at least an lstm-unicharset and lstm-recoder component, and may also contain the three dawg files: lstm-punc-dawg lstm-word-dawg lstm-number-dawg A config file is also optional. The other components, if present, will be ignored and unused.

There is no tool to create the lstm-recoder directly. Instead there is a new tool, combine_lang_model which takes as input an input_unicharset and script_dir(script_dir points to the langdata directory) and optional word list files. It creates the lstm-recoder from the input_unicharset and creates all the dawgs, if wordlists are provided, putting everything together into a traineddata file.

above the passage I could not find to make a 'lstm-unicharset' ....... So I have no idea

and. I have a question https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

NOTE Tesseract 4.00 will now run happily with a traineddata file that contains just lang.lstm, lang.lstm-unicharset and lang.lstm-recoder. The lstm-*-dawgs are optional, and none of the other components are required or used with OEM_LSTM_ONLY as the OCR engine mode. No bigrams, unichar ambigs or any of the other components are needed or even have any effect if present. The only other component that does anything is the lang.config, which can affect layout analysis, and sub-languages.

If added to an existing Tesseract traineddata file, the lstm-unicharset doesn't have to match the Tesseract unicharset, but the same unicharset must be used to train the LSTM and build the lstm-*-dawgs files.

at the end of this wiki passage, trainned data is composed by 'lang.lstm, lang.lstm-unicharset, lang.lstm-recoder'(mandatory) /

but firstl `Creating Starter Traineddtat' passage says that trainned data is composed by 'lstm-recoder, lstm-unicharset(mandatory) /

Which is sentence is right?

plz help me.....

Reply all

Reply to author

Forward

Message has been deleted

0 new messages