Hi, tesseract-ocr group.
I have a question about the subject.
If I perform OCR in Japanese using best/jpn.traineddata, the address or bank name text will be divided into the following words.
・Ex1
- Document Text : 東京都渋谷区桜丘町
- Word output : 東京, 都, 渋谷, 区, 桜丘, 町
・Ex2
- Document Text : 三菱東京UFJ銀行
- Word output : 三菱, 東京, UFJ, 銀行
I want to output as one word instead of the above output.
For that reason, I am implementing fine tuning, but the OCR result of the character only changes, and the word breaks are not improved.
Is there a way to improve this situation?
In addition, the methods that have been tried so far are described below.
・ Enviroment
- Tesseract version: 4.1.1
- OS: ubuntu 18.04
- jpn.wordlist: List the character strings you want to recognize as one word
(ex. 東京都渋谷区桜丘町, 三菱東京UFJ銀行, etc...)
--jpn.training_text: Randomly generated document from jpn.wordlist
・ Command
# create train data
tesstrain.sh \
--fonts_dir /usr/share/fonts \
--lang jpn \
--linedata_only \
--noextract_font_properties \
--langdata_dir ./langdata \
--tessdata_dir /usr/local/share/tessdata \
--output_dir ./output/jpn \
--training_text ./langdata/jpn/jpn.training_text
# train
lstmtraining --model_output ./model_output/ \
--traineddata /usr/local/share/tessdata/best/jpn.traineddata \
--old_traineddata /usr/local/share/tessdata/best/jpn.traineddata \
--continue_from ./output/jpn/jpn.lstm \
--train_listfile ./output/jpn/jpn.training_files.txt \
--max_iterations 200 \
--debug_interval -1 \
--append_index 5 --net_spec'[Lfx512 O1c1]' \
--learning_rate 20e-4 &> ./output/jpn/train.log
#convert to trained data
lstmtraining --stop_training \
--continue_from ./model_output/_checkpoint \
--traineddata /usr/local/share/tessdata/best/jpn/jpn.traineddata \
--model_output model_output / jpn.traineddata
Please let me know if I have any missing information.
Thank you.