How jpn word separation improve with fine tuning.

132 views
Skip to first unread message

Yudai Sano

unread,
Dec 1, 2021, 2:07:44 AM12/1/21
to tesseract-ocr
Hi, tesseract-ocr group.

I have a question about the subject.

If I perform OCR in Japanese using best/jpn.traineddata, the address or bank name text will be divided into the following words.

・Ex1
 - Document Text : 東京都渋谷区桜丘町
 - Word output : 東京, 都, 渋谷, 区, 桜丘, 町
・Ex2
 - Document Text : 三菱東京UFJ銀行
 - Word output : 三菱, 東京, UFJ, 銀行

I want to output as one word instead of the above output.
For that reason, I am implementing fine tuning, but the OCR result of the character only changes, and the word breaks are not improved.

Is there a way to improve this situation?

In addition, the methods that have been tried so far are described below.

・ Enviroment
- Tesseract version: 4.1.1
- OS: ubuntu 18.04
- jpn.wordlist: List the character strings you want to recognize as one word
  (ex. 東京都渋谷区桜丘町, 三菱東京UFJ銀行, etc...)
--jpn.training_text: Randomly generated document from jpn.wordlist

・ Command
# create train data
tesstrain.sh \
    --fonts_dir /usr/share/fonts \
    --lang jpn \
    --linedata_only \
    --noextract_font_properties \
    --langdata_dir ./langdata \
    --tessdata_dir /usr/local/share/tessdata \
    --output_dir ./output/jpn \
    --training_text ./langdata/jpn/jpn.training_text

# train
lstmtraining --model_output ./model_output/ \
    --traineddata /usr/local/share/tessdata/best/jpn.traineddata \
    --old_traineddata /usr/local/share/tessdata/best/jpn.traineddata \
    --continue_from ./output/jpn/jpn.lstm \
    --train_listfile ./output/jpn/jpn.training_files.txt \
    --max_iterations 200 \
    --debug_interval -1 \
    --append_index 5 --net_spec'[Lfx512 O1c1]' \
    --learning_rate 20e-4 &> ./output/jpn/train.log

#convert to trained data
lstmtraining --stop_training \
    --continue_from ./model_output/_checkpoint \
    --traineddata /usr/local/share/tessdata/best/jpn/jpn.traineddata \
    --model_output model_output / jpn.traineddata

Please let me know if I have any missing information.

Thank you.
Reply all
Reply to author
Forward
0 new messages