How to train tesseract with ancient Greek character

58 views
Skip to first unread message

易鑫

unread,
Apr 3, 2019, 12:37:29 AM4/3/19
to tesseract-ocr
Hello,everyone:

       I want to recognize the content in the table image.(You can get it in the attach file).It contains Chinese characters and some English letters, the most troublesome problem is that it contain a ancient Greek character "Φ".

I do not how to train the model. I tried add Greek font but no use. The first step is error.

This is my command:

src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text ../training_data/chi_sim_tuned.txt \
--langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang chi_sim --linedata_only --noextract_font_properties  --exposures "0" \
--workspace_dir ~/share/workspace/tmp \
--save_box_tiff \
 --fontlist  "NSimSun" \
        "Times New Roman" \
       "Arial Unicode MS" \
       "SimSun" \
       "Noto Sans CJK SC" \
"Noto Sans Mono CJK SC" \
"GFS Artemisia"  \
--output_dir ~/tesstutorial/chi_sim_train \
--overwrite

Can someone help me,thanks in advances.




src_5.jpg

易鑫

unread,
Apr 3, 2019, 9:55:38 PM4/3/19
to tesseract-ocr
Does anybody knows how to solve this problems?thanks.

易鑫 <yixinl...@gmail.com> 于2019年4月3日周三 下午12:37写道:
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/840f88a5-05a5-48c3-8478-e5544bdee192%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shree Devi Kumar

unread,
Apr 4, 2019, 10:05:30 AM4/4/19
to tesser...@googlegroups.com
You don't need to add "GFS Artemisia"  as it may not have the Chinese characters.

Just add Greek character "Φ" to your training text. 
I think all fonts that you are using support it.
Verify in generated tif files that it is getting rendered.


For more options, visit https://groups.google.com/d/optout.


--

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

易鑫

unread,
Apr 7, 2019, 9:45:03 PM4/7/19
to tesseract-ocr
 thanks a lot.I will try.

Shree Devi Kumar <shree...@gmail.com> 于2019年4月4日周四 下午10:05写道:

易鑫

unread,
Apr 9, 2019, 2:44:38 AM4/9/19
to tesseract-ocr
I have tried,but still can not recognize " Φ  ".

易鑫 <yixinl...@gmail.com> 于2019年4月8日周一 上午9:44写道:
Reply all
Reply to author
Forward
0 new messages