Bad results on custom traditional chinese

45 views
Skip to first unread message

laurent....@gmail.com

unread,
May 17, 2016, 11:21:12 AM5/17/16
to tesseract-ocr
Hi,

Here's the point: I have to train tesseract on a new font in traditional chinese. For now, all the results were not good enough.
I've just tried to train it with only a small set of characters and 1 input image.
Then I took a sample of that image to test it.

The image is:


And the detected text is: 客戶服務 置龍擇語言 設交置 社交
I'm using tesseract 3.02 on Windows.

The questions are:
 - What kind of machine learning concept tesseract use ?
 - How can I have better results with tesseract ?
    - Do I have to train it with a lot of different images ?
    - Do I have some parameters to play with on the training part ?


Thanks.

Auto Generated Inline Image 1

laurent....@gmail.com

unread,
May 18, 2016, 9:09:32 AM5/18/16
to tesseract-ocr
I have more questions:
 - How does tesseract use the unicharambigs files ?
 - I do have different results whether I'm trying to recognize the text with PSM_SINGLE_WORD, PSM_SINGLE_BLOCK or PSM_SINGLE_LINE. And not the same one that give me the best results for every images. Why ?
 - How can I make tesseract read the third character of the image as 1 character and not as 2 (月艮) ?

Reply all
Reply to author
Forward
0 new messages