run text2image failed ,text2image not support chinese name fonts?

192 views
Skip to first unread message

bruce

unread,
Nov 6, 2018, 4:56:24 AM11/6/18
to tesseract-ocr
I use the command as follows to find the fonts I can use to train my language.
text2image.exe --text=chi_sim.txt --outputbase=chi_sim.庞中华行书.exp0 --fints_dir=C:\Windows\Fonts --find_fonts
and i got the result as follows:
                                                Font MStiffHeiPRC failed with 414359 hits = 100.00%
                                                Font MStiffHeiPRC failed with 414359 hits = 100.00%
                                                Font MStiffHeiPRC failed with 414359 hits = 100.00%
                                                Font MStiffHeiPRC failed with 414359 hits = 100.00%
                                                Font MStream PRC failed with 414359 hits = 100.00%
                                                Font MSung PRC failed with 414359 hits = 100.00%
                                                Font MSung PRC failed with 414359 hits = 100.00%
                                                庞中华行书 Light : 414361 hits = 100.00%, raw = 3440 = 100.00%
                                                Font 剑客毛笔行书 failed with 414357 hits = 100.00%
                                                Font 可可漫雪体 failed with 414360 hits = 100.00%
                                                Font 多米手写体 failed with 414253 hits = 99.97%
                                                Font 字体中国-锐博体V1 failed with 414359 hits = 100.00%
                                                Font 孙运和酷楷 failed with 414359 hits = 100.00%
                                                Font 建刚静心楷 failed with 414359 hits = 100.00%
                                                Font 张维镜手写楷书 Medium failed with 410014 hits = 98.95%
                                                Font 徐金如硬笔行楷X failed with 413042 hits = 99.68%



Than I use command like this:text2image.exe --text=chi_sim.txt --outputbase=chi_sim.庞中华行书.exp0 --ptsize 36 --font "庞中华行书" --fonts_dir C:\Windows\Fonts
I got an error resut as follows:
                                               Could not find font named '庞中华行书'.
                                               Pango suggested font 'MingLiU'.
                                               Please correct --font arg.

text2image not support chinese name fonts?How could i use these chinese name fonts?

Zdenko Podobny

unread,
Nov 6, 2018, 5:11:00 AM11/6/18
to tesser...@googlegroups.com
Hello,

Please see bug-report and suggested solution:

I guess problem is in pango, but we would like to test it. Are you able to create simple test case (provide small chi_sim.txt and share font if it is possible) for this issue?

Zdenko


ut 6. 11. 2018 o 10:56 bruce <luyu...@sina.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a9a31397-9196-4923-aa79-43d151d534a1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

bruce

unread,
Nov 6, 2018, 8:55:57 PM11/6/18
to tesseract-ocr
hi,zdenop ,thank you for your reply.
my environment is:
                             windows 7 professional 64bit
                             tesseract version:https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v4.0.0.20181030.exe


                      https://drive.google.com/open?id=15C-v4ped8ssFGXW0pSKw6CMSQgW2s0WV                         

I tried the fonts of all Chinese names.All got the same error message.and the link just two of these fonts. you can test .
I guess the --fonts parameter doesn't support chinese character?

在 2018年11月6日星期二 UTC+8下午6:11:00,zdenop写道:
在 2018年11月6日星期二 UTC+8下午6:11:00,zdenop写道:
在 2018年11月6日星期二 UTC+8下午6:11:00,zdenop写道:
在 2018年11月6日星期二 UTC+8下午6:11:00,zdenop写道:
在 2018年11月6日星期二 UTC+8下午6:11:00,zdenop写道:
在 2018年11月6日星期二 UTC+8下午6:11:00,zdenop写道:

Zdenko Podobny

unread,
Nov 8, 2018, 4:03:00 PM11/8/18
to tesser...@googlegroups.com
What is output of command "chcp" (in command line)?
 
Zdenko


st 7. 11. 2018 o 2:55 bruce <luyu...@sina.com> napísal(a):

bruce

unread,
Nov 9, 2018, 1:33:19 AM11/9/18
to tesseract-ocr
hi,Zdenko
   I have tried the command under two cmd window encodings(chcp 65001 and  chcp 936).
   I got the same failure results. 
   results as follows:
chcp936.png
chcp65001.png   
   

在 2018年11月9日星期五 UTC+8上午5:03:00,zdenop写道:

Zdenko Podobny

unread,
Nov 9, 2018, 3:44:41 AM11/9/18
to tesser...@googlegroups.com
I want to know what is origin output of chcp;-)

I think there are (at least) 2 issues:
  1. encoding console problem (windows only - on linux it it correct)
  2. font related issue (at the moment I am not sure if font itself or pango or text2image)
Regarding 1.: 
When I run:
 text2image.exe --fonts_dir=i1252 --fontconfig_tmpdir=%temp% --list_available_fonts
I got output:
  0: ĺ­tčż?ĺ'ŚéćĄ
  1: 庞中华行äą| Light

When I set chcp 65001 result is still wrong:
  0: ĺ­™čż ĺ’Śé…·ćĄ·
  1: 庞中华行书 Light

When the output is redirected to file (text2image.exe --fonts_dir=i1252 --fontconfig_tmpdir=%temp% --list_available_fonts >font_list.txt) font names are correct:
  0: 孙运和酷楷
  1: 庞中华行书 Light

When I use "wrong console output" text2image is able to find and use font:
text2image.exe --fonts_dir=i1252  --fontconfig_tmpdir=%temp% --text i1252/chi_sim_test.txt --outputbase=chi_sim.test.exp0 --font="ĺ­™čż ĺ’Śé…·ćĄ·", but it crash the same way as on linux (issue 2) as described in issue 1252:
ERROR: Illegal UTF8 encountered.
Index 0 char = 0xffffffa2
Index 1 char = 0xffffffd2
Index 2 char = 0xffffffd4
Index 3 char = 0xd
Index 4 char = 0xa
WARNING: Illegal UTF8 encountered

** (text2image.exe:22496): WARNING **: 09:33:51.804: Invalid UTF-8 string passed to pango_layout_set_text()
**
ERROR:c:\users\zdeno\.cppan\storage\src\81\8f\8aa5\pango\pango-glyph-item.c:319:pango_glyph_item_iter_next_cluster: assertion failed: (iter->start_char < iter->end_char

So one thing is to fix windows issue for correctly handling input/output from/to console (BTW is it UTF-8 or UTF-16), but it will not solve issue that these font are still not usable in text2image.

 Zdenko


pi 9. 11. 2018 o 7:33 bruce <luyu...@sina.com> napísal(a):

bruce

unread,
Nov 13, 2018, 1:29:45 AM11/13/18
to tesseract-ocr
hi,zdenop
My origin output of chcp is "936"
As you said,I think it should be a problem with console coding.But i  don't know how to solve this coding problem.
In the end, I solved this problem in another way.I use software named "fontcreator" to modify the name of the fonts and changed the name to English.

在 2018年11月9日星期五 UTC+8下午4:44:41,zdenop写道:
Reply all
Reply to author
Forward
0 new messages