Shapeclustering Not Responding

57 views
Skip to first unread message

xyq...@gmail.com

unread,
Jul 17, 2018, 5:16:26 AM7/17/18
to tesseract-ocr
Hi all,

I'm trying to train Tesseract, I've gone through the first few step including 
1. getting TIF's 
2. creating the box files 
3. correcting the box files 
4. training(tesseract [language].[fontname].exp[samplenumber].tif [language].[fontname].exp[samplenumber] box.train) 
5. creating the unicharset file 
6. creating the font_properties file, 
so now I already have the files of : tif, .box, .tr, font_properties, unicharset, all the steps before the shapeclustering were successfully and there is no error. 
But when I ran: shapeclustering -F font_properties -U unicharset -O  [language].unicharset  [language].[fontname].exp0.tr, the command prompt is not responding, it's not finished but there's no output. 
Can anyone tell me why and how to solve it? Thanks in advance.

yixinl...@gmail.com

unread,
Dec 29, 2018, 3:37:50 AM12/29/18
to tesseract-ocr
I also encounter this problem,I tried tesseract 3.5 and  tesseract 4.0, the result is same.

在 2018年7月17日星期二 UTC+8下午5:16:26,xyq...@gmail.com写道:

Zdenko Podobny

unread,
Dec 29, 2018, 4:02:12 AM12/29/18
to tesser...@googlegroups.com
Please provide real information and data - not "meta" description of you process.

Zdenko


so 29. 12. 2018 o 9:37 <yixinl...@gmail.com> napísal(a):
--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f75115f5-7613-4a6a-a95a-a0b933b2c88a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

易鑫

unread,
Dec 29, 2018, 4:50:34 AM12/29/18
to tesser...@googlegroups.com
I use tesseract-ocr-w64-setup-v4.0.0.20181030 and   jTessBoxEditor-2.2.0 in windows10. I use 3 images for test,you can find it in the attach files sample.zip.

1. I use jTessBoxEditor to merge the 3 images.
    The merged file name is  "langyp.fontyp.exp0.tif"
 2. generate box file
     tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng --psm 7 --oem 3 batch.nochop makebox
     Then generate  langyp.fontyp.exp0.box file
3.Open JTessBoxEditor -> Box Editor --> open langyp.fontyp.exp0.tif --> modify mistakes
image.png

   
4. generate font_properties
    echo "fontyp 0 0 0 0 0" > font_properties
5. generate training file
       tesseract langyp.fontyp.exp0.tif langyp.fontyp.exp0 -l eng --psm 7 --oem 3 nobatch box.train 
      Then langyp.font.exp0.tr file
6. generate charset file
       unicharset_extractor langyp.fontyp.exp0.box  
      Then generate unicharset file
7. generate shape file
       shapeclustering -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
8. mftraining -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr
9. cntraining langyp.fontyp.exp0.tr
10.rename normproto fontyp.normproto
rename inttemp fontyp.inttemp
rename pffmtable fontyp.pffmtable 
rename unicharset fontyp.unicharset
rename shapetable fontyp.shapetable
11.combine_tessdata fontyp.
12.Then you can get the fontyp.traineddata file
But when I follow these steps at step 7,after typing "shapeclustering -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr" this command,
the teriminal does not have any output even though wating for more than 20 minutes.

If I skip the step 7 do step8, after typing "mftraining -F font_properties -U unicharset -O langyp.unicharset langyp.fontyp.exp0.tr" this command,
only one warning "No shape table file present: shapetable"

Then, the teriminal does not have any output even though waiting for long time.




sample.zip

易鑫

unread,
Jan 1, 2019, 11:06:11 PM1/1/19
to tesser...@googlegroups.com
The issues has been resolved.The reason is that the "font_properties" file must be formatted with UTF-8.

易鑫 <yixinl...@gmail.com> 于2018年12月29日周六 下午5:50写道:
Reply all
Reply to author
Forward
0 new messages