Re: tesseract-ocr does not very well on chinese

3,106 views
Skip to first unread message

Sven Pedersen

unread,
Nov 2, 2012, 10:02:49 AM11/2/12
to tesser...@googlegroups.com
Preprocessing can help. Give us some example images and we may be able to help.
--Sven

On Fri, Nov 2, 2012 at 7:25 AM, Rong Xiao <run...@gmail.com> wrote:
> hi,I have tried tesseract-ocr on chinese,but I found that it can do well on
> only few fonts. I want to know what kind of fonts are included in
> chi_sim.traineddata? If I expect better accuracy, need I train it by myself
> ?
>
> thanks
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en



--
``All that is gold does not glitter,
not all those who wander are lost;
the old that is strong does not wither,
deep roots are not reached by the frost.
From the ashes a fire shall be woken,
a light from the shadows shall spring;
renewed shall be blade that was broken,
the crownless again shall be king.”

Sven Pedersen

unread,
Nov 12, 2012, 4:45:02 PM11/12/12
to tesser...@googlegroups.com
To get better results you will need to increase the contrast and add a border. That image is very poor quality for text, Generally you'll want a bitmap type image format like TIFF or PNG, not JPG (which is for pictures). Read the FAQ for more info on preparing images for OCR, especially the part about x-height.

As far as I know, Google has not released the full training data, however you can tell a lot by unpacking the language files.
--Sven


On Sun, Nov 4, 2012 at 8:00 PM, Rong Xiao <run...@gmail.com> wrote:







such as this image.it 's not very complex.
Message has been deleted

Jay Zahn

unread,
Jun 14, 2014, 9:06:23 PM6/14/14
to tesser...@googlegroups.com
Is there any special treatment for handwritten characters? I tried some characters but got varied results. Usually the simple characters are detected accurately but compound characters can be totally off. For example

Is interpreted as two characters 青 and 争。But this is actually a relatively good case. For 

It is totally off, which interprets the character as three part from top to bottom, and the bottom is interpreted as the symbol ^.  The worst case is 

which is completely garbage output. 

In all my user cases, I need only detect a single Chinese character a time. My question is, what can I do to improve the accuracy of the recognition? Thanks

Max Heiber

unread,
Sep 9, 2015, 1:30:31 PM9/9/15
to tesseract-ocr, sven.p...@gmail.com
Could you advise on how to get better results for images like the attached? The Chinese characters are very clear, but Tesseract generates the wrong results.



Thanks very much for your help!

Max Heiber

unread,
Sep 9, 2015, 1:30:38 PM9/9/15
to tesseract-ocr, sven.p...@gmail.com
Here's an example where the Chinese characters are very large and clear, but Tesseract gets the wrong result. Could you advise on what image processing could help Tesseract's accuracy?

Thanks for your help!
testout.png

Tom Morris

unread,
Sep 10, 2015, 3:04:59 PM9/10/15
to tesseract-ocr, sven.p...@gmail.com
On Wednesday, September 9, 2015 at 1:30:38 PM UTC-4, Max Heiber wrote:
Here's an example where the Chinese characters are very large and clear, but Tesseract gets the wrong result. Could you advise on what image processing could help Tesseract's accuracy?

What have you tried so far?

I got the following with about 30 seconds of playing with an image editor:

爸爸说我


It looks correct to me, but I don't read Chinese.


Basically I just thresholded to send anything that wasn't very white to be completely black.  I didn't even both inverting the white on black.


Tom
tesseract-zh-testout-corrected.png

Max Heiber

unread,
Sep 13, 2015, 9:28:53 PM9/13/15
to tesseract-ocr, sven.p...@gmail.com
Hi Tom,

Thanks! Setting the threshold worked for me. 

Much appreciated,

Max

harrison wang

unread,
Jun 5, 2020, 2:18:47 AM6/5/20
to tesseract-ocr
Hello Tom, 

I did the same and binarize the picture correctly as below:
sample_baidu.jpg



I tried with some online free OCR tools and it give correct results, but with Tesseract I got incorrect/incomplete result.

Please advice how I can get a better output.

Thanks
Harrison
sample_baidu.jpg
Reply all
Reply to author
Forward
0 new messages