Error while creating training data for Japanese

352 views
Skip to first unread message

atuyosi....@gmail.com

unread,
Apr 3, 2017, 10:08:53 AM4/3/17
to tesseract-ocr
Hi, 

I'm trying to creating training data for Japanese (jpn.traineddata).

I run 'tesstrain.sh' with '--linedataonly' option, and the script has finished ( return code 0 ) .
But log file has contained some error messages ( repeated 22 times ).

```
$ ../tesseract-ocr/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang jpn --linedata_only   --noextract_font_properties --langdata_dir ../langdata   --tessdata_dir /usr/local/share --output_dir ~/work/jpntrain
```


---
[Sun Apr 2 07:42:30 UTC 2017] /usr/local/bin/tesseract /tmp/tmp.pwcwGMb5hs/jpn/jpn.IPAPMincho.exp0.tif /tmp/tmp.pwcwGMb5hs/jpn/jpn.
IPAPMincho.exp0 lstm.train ../langdata/jpn/jpn.config
[Sun Apr 2 07:42:30 UTC 2017] /usr/local/bin/tesseract /tmp/tmp.pwcwGMb5hs/jpn/jpn.IPAGothic.exp0.tif /tmp/tmp.pwcwGMb5hs/jpn/jpn.I
PAGothic.exp0 lstm.train ../langdata/jpn/jpn.config
Error opening data file /usr/local/share/tessdata/jpn_vert.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
Failed loading language 'jpn_vert'
---

It seems that 'tesstrain.sh' requires 'jpn_vert.traineddata`, but this file not provide on tessdata repository.

How I get this file? Or, Can I substitute  'jpn.traineddata' for  'jpn_vert.traineddata' ?


I've found that there is `jpn_vert' directory on langdata repository, but only some config files.


Thanks.

ShreeDevi Kumar

unread,
Apr 3, 2017, 10:37:59 AM4/3/17
to tesser...@googlegroups.com
jpn.config in langdata/jpn is loading jpn_vert as a sublanguage

tessedit_load_sublangs jpn_vert

You can try without that

Also look at the settings for jpn in training/language_specific.sh

You may need to change the following also ..


# The following fonts will be rendered vertically in phase I.
VERTICAL_FONTS=( \
    "TakaoExGothic" \ # for jpn
    "TakaoExMincho" \ # for jpn
    "AR PL UKai Patched" \ # for chi_tra
    "AR PL UMing Patched Light" \ # for chi_tra
    "Baekmuk Batang Patched" \ # for kor
    )


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscribe@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c776398d-0b2f-483d-a9ec-63476eaf0586%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

atuyosi....@gmail.com

unread,
Apr 4, 2017, 12:58:21 AM4/4/17
to tesseract-ocr
Hi ShreeDev,

Thank you for the quick response. 

I modified langdata/jpn/jpn.config and training/language_specific.sh as you suggested. 

Things went well. No error has occurred. 

Thanks.


2017年4月3日月曜日 23時37分59秒 UTC+9 shree:
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages