train tesseract to improve the half-width Japanese(Katakana) recognition.

190 views
Skip to first unread message

Li Xianglei

unread,
Nov 8, 2017, 12:02:38 PM11/8/17
to tesseract-ocr
Hi all,
    
      I'm trying to use tesseract to recognize Japanese on image.
      I found that it get a poor accuracy with the  half-width Japanese(Katakana).
      I'am trying to improve the accuracy by fine-tuning , 
      both [ Fine Tuning for  ±  a few characters] and [Training Just a Few Layers] have been tried,
      it seems may improve the  accuracy of half-width Japanese but do a lot of harm to the normal Japanese  recognition.
      Here is the way I do the fine-turing.

   1 add  half-width Japanese to the lang/jpn/jpn.training_text (clone from tesseract-ocr/langdata seems train data for v3)
   
2 Create train data by tesstrain.sh
   
3 combine_tessdata -e /usr/local/tesseract/share/tessdata/jpn.traineddata(which is best/jpn.traineddata) trainhalfwidth/jpn.lstm
   
4 lstmtraining --model_output trainhalfwidth/jpnhw \
                 
--continue_from trainhalfwidth/jpn.lstm \
                 
--traineddata trainhalfwidth/jpn/jpn.traineddata\
                 
--old_traineddata /usr/local/tesseract/share/tessdata/jpn.traineddata \
                 
--train_listfile trainhalfwidth/jpn.training_files.txt --max_iterations 3600 &> trainhalfwidth/basetrain.log

  Any advice? Thank you

   #It seems Ray is working on the train data for lstm, any news so far?

ShreeDevi Kumar

unread,
Nov 8, 2017, 12:21:45 PM11/8/17
to tesser...@googlegroups.com
does your training text include both half width and normal japanese?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/604e4981-9ca4-48be-980d-999df93f73ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Li Xianglei

unread,
Nov 8, 2017, 7:35:50 PM11/8/17
to tesseract-ocr
Yes, I added half-width characters to the given jpn.training_text and takes it as new jpn.training_text.

在 2017年11月9日星期四 UTC+8上午1:21:45,shree写道:
does your training text include both half width and normal japanese?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Nov 8, 2017 at 4:01 PM, Li Xianglei <xiang...@gmail.com> wrote:
Hi all,
    
      I'm trying to use tesseract to recognize Japanese on image.
      I found that it get a poor accuracy with the  half-width Japanese(Katakana).
      I'am trying to improve the accuracy by fine-tuning , 
      both [ Fine Tuning for  ±  a few characters] and [Training Just a Few Layers] have been tried,
      it seems may improve the  accuracy of half-width Japanese but do a lot of harm to the normal Japanese  recognition.
      Here is the way I do the fine-turing.

   1 add  half-width Japanese to the lang/jpn/jpn.training_text (clone from tesseract-ocr/langdata seems train data for v3)
   
2 Create train data by tesstrain.sh
   
3 combine_tessdata -e /usr/local/tesseract/share/tessdata/jpn.traineddata(which is best/jpn.traineddata) trainhalfwidth/jpn.lstm
   
4 lstmtraining --model_output trainhalfwidth/jpnhw \
                 
--continue_from trainhalfwidth/jpn.lstm \
                 
--traineddata trainhalfwidth/jpn/jpn.traineddata\
                 
--old_traineddata /usr/local/tesseract/share/tessdata/jpn.traineddata \
                 
--train_listfile trainhalfwidth/jpn.training_files.txt --max_iterations 3600 &> trainhalfwidth/basetrain.log

  Any advice? Thank you

   #It seems Ray is working on the train data for lstm, any news so far?

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

Li Xianglei

unread,
Nov 9, 2017, 9:29:21 PM11/9/17
to tesseract-ocr
Recently I modified the tesstrain_utils.sh and --max_pages=3 option for text2image command,
it seems the the normal Japanese now can work happlily, but the half-width characters still in a poor accuracy.
Now I wonder how many characters should I add to the jpn.training_text, 
the wiki [ Fine Tuning for  ±  a few characters] said it should be 20-repeat of the  ±, but I tried about 20-repeat for every half-width characters and it seems no use.
When the count of repeat came to 30 and it seems getting better but not good enough,
then I tried the 150-repeat level and the results gone worse.

在 2017年11月9日星期四 UTC+8上午8:35:50,Li Xianglei写道:

Li Xianglei

unread,
Nov 9, 2017, 9:36:38 PM11/9/17
to tesseract-ocr
Recently I modified the tesstrain_utils.sh and --max_pages=3 option for text2image command,
 Got an error, I mean I modified the  tesstrain_utils.sh and remove the --max_pages=3 option.


在 2017年11月10日星期五 UTC+8上午10:29:21,Li Xianglei写道:
Reply all
Reply to author
Forward
0 new messages