Issue 870 in tesseract-ocr: Ocr error for chinese language

tesser...@googlecode.com

unread,

Mar 13, 2013, 5:01:14 AM3/13/13

to tesserac...@googlegroups.com

Status: New
Owner: ----

New issue 870 by landisk...@qq.com: Ocr error for chinese language
http://code.google.com/p/tesseract-ocr/issues/detail?id=870

What steps will reproduce the problem?
1.C:\Program Files\Tesseract-OCR>tesseract.exe me.TIF me -l chi_sim
2.
3.

What is the expected output? What do you see instead?
A simple text, I can't recognize the result.

What version of the product are you using? On what operating system?
tesseract-ocr v3.02, simplified chinese traindata, Windows XP SP3
Simplified Chinese.

Please provide any additional information below.
This is a very simple test, but failed to get a simple result.
Are there anybody can tell me how to do this simple job.
Please email to me, thanks.
The attach files are the simple picture and the complicate ocred result.

Windows XP SP3
Tesseract Open Source OCR Engine v3.02
Simplified Chinese Language Traindata

Attachments:
me.TIF 22.7 KB
me.txt 133 bytes

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

tesser...@googlecode.com

unread,

Mar 13, 2013, 5:03:05 AM3/13/13

to tesserac...@googlegroups.com

Comment #1 on issue 870 by landisk...@qq.com: Ocr error for chinese language
http://code.google.com/p/tesseract-ocr/issues/detail?id=870

C:\Program Files\Tesseract-OCR>tesseract.exe me.TIF me -l chi_sim

Too many unichars in ambiguity on line 7658512
Too many unichars in ambiguity on line 7658512
Too many unichars in ambiguity on line 7772696
Tesseract Open Source OCR Engine v3.02 with Leptonica

tesser...@googlecode.com

unread,

May 25, 2013, 11:20:38 PM5/25/13

to tesserac...@googlegroups.com

Comment #2 on issue 870 by mip...@gmail.com: Ocr error for chinese language
http://code.google.com/p/tesseract-ocr/issues/detail?id=870

This is hardly an error. You should try tweaking the input and the control
parameters. Tesseract recognized all the characters in the attached image,
with following configurations:
chop_enable T
segment_segcost_rating F
enable_new_segsearch 0
language_model_ngram_on 0
textord_force_make_prop_words F

The configs are borrowed from:
https://code.google.com/p/tesseract-ocr/wiki/ControlParams

I'm not familiar with Tesseract but I have a feeling that the trained data
were optimized for fonts with even stroke widths.

Attachments:
cnml.png 30.8 KB

tesser...@googlecode.com

unread,

Dec 20, 2013, 5:10:52 AM12/20/13

to tesserac...@googlegroups.com

Updates:
Status: No-longer-an-issue

Comment #3 on issue 870 by zde...@gmail.com: Ocr error for chinese language
http://code.google.com/p/tesseract-ocr/issues/detail?id=870

(No comment was entered for this change.)

Reply all

Reply to author

Forward