training tesseract

125 views
Skip to first unread message

namv...@gmail.com

unread,
Nov 13, 2016, 12:41:30 PM11/13/16
to tesseract-ocr
hi
sorry for my English , I hope you can help me.
I've trained tesseract for persian by running following commands (https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract) :

training/text2image --text=training_text.txt --outputbase=per.Arial.exp0 --font='Arial' --fonts_dir=/home/bita/TrainingPersian

tesseract per.Arial.exp0.tif per.Arial.exp0 box.train

unicharset_extractor per.Arial.exp0.box

set_unicharset_properties per.Arial.exp0.box
(by reading this issue: https://github.com/tesseract-ocr/tesseract/issues/318 and put Arabic.unicharset and
Arabic.xheights in script_dir path )
set_unicharset_properties -U unicharset -O new_unicharset -X xheights --script_dir=/home/bita/langdata

mv unicharset unicharset_Old

mv new_unicharset unicharset

shapeclustering -F font_properties -U unicharset per.Arial.exp0.tr

mftraining -F font_properties -U unicharset -X xheights -O per.unicharset per.Arial.exp0.tr

cntraining per.Arial.exp0.tr

wordlist2dawg frequent_words_list per.freq-dawg per.unicharset

wordlist2dawg words_list per.word-dawg per.unicharset

mv shapetable per.shapetable

mv normproto per.normproto

mv inttemp per.inttemp

mv pffmtable per.pffmtable

combine_tessdata per.


and for testing the result I've taken a screen shot from one part of my training text and  increase the resolution up to 300 dpi by GIMP (I tried to make an image that doesn't have noise) , but the accuracy is not good at all.

How can I increase the accuracy?
which font size should I choose when I take the screenshot?
the structure of Persian Language is much different from English, for example the shape of one character is modify depending on where it is locate in word (first, middle ,last)  but in unicharset
for all of these,the main character recognized.
also the character are connected in words (somethings like handwritten in English)

so does Tesseract work for language like Persian or Arabic?


Reply all
Reply to author
Forward
0 new messages