How to fix the code to remove the result lastly decided by Tesseract which size is smaller than specific blob size?

148 views
Skip to first unread message

葉家忠

unread,
Aug 28, 2014, 4:04:33 AM8/28/14
to tesser...@googlegroups.com
I use Tesseract to recognize the simplified chinese character

Since some noise of the source image  can't be removed, so I decide to fix the source code to remove the incorrect result.

Since the each of the chinese charactor size is fix-sized, so the nose can be found easily because its size will be much smaller than a normal character. 

I've tried to set the parameter "textord_heavy_nr" to  true to remove the noise, but it won't work  because in some case it will remove some importart parts of a chinese character which is quite necessary to form a complete chinese character

Can any one tell me how to fix the code that remove the result lastly decided by Tesseract which size is smaller than specific blob size?

I really thank you for helping~



ps: the attached file show 3 characters but it will be recognized as 4 characters because of the noise. 
ScreenClip.png

Paul

unread,
Aug 30, 2014, 6:16:09 AM8/30/14
to tesser...@googlegroups.com
I suffered from similar issues and fixed the problem by adding a line to textord/colfind.cpp:

Between

#endif // GRAPHICS_DISABLED

and

SetBlockRuleEdges(input_block);

I added:

input_block->noise_blobs.clear(); // remove noise blobs

This will remove noise blobs during the segmentation of blocks and prevent noise blobs from being added to the text block around them. I think it is a dirty hack, but it will probably give you better results. Maybe we have to tackle this problem in a more in-depth solution in the future.

Changing the constant

const double kMinMediumSizeRatio = 0.25;

to

const double kMinMediumSizeRatio = 0.15;

in blobbox.cpp also helped to improve the results. You can try to adjust that constant to your needs.

Paul

葉家忠

unread,
Sep 1, 2014, 3:00:29 AM9/1/14
to tesser...@googlegroups.com
Really thank you for kindly help~

I try what you said above but get nothing changed, 
When I traced the code in debug mode, I found the codes mentioned above are never run once, 
I wonder if there is any parameter I should set it true?

please teach me more, 
Thanks again~ 


2014년 8월 30일 토요일 오후 6시 16분 9초 UTC+8, Paul 님의 말:

Paul

unread,
Sep 22, 2014, 5:29:41 PM9/22/14
to tesser...@googlegroups.com
Those sections are definitely run. Which version of Tesseract are you using?

葉家忠

unread,
Sep 22, 2014, 8:12:17 PM9/22/14
to tesser...@googlegroups.com

3.02... I've found the code snipet you said but can't have it executed..

2014. 9. 23. 오전 5:29에 "Paul" <pa...@vorb.de>님이 작성:
--
You received this message because you are subscribed to a topic in the Google Groups "tesseract-ocr" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/tesseract-ocr/M80Et5GOZXA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to tesseract-oc...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20489d50-3464-427d-b599-896f519d5599%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul

unread,
Sep 24, 2014, 4:56:17 AM9/24/14
to tesser...@googlegroups.com
I use 3.04. You may try to upgrade.

zdenko podobny

unread,
Sep 24, 2014, 7:10:15 AM9/24/14
to tesser...@googlegroups.com
You do not use 3.04 version ;-)
There is development version of tesseract marked that way, but it is not finished yet (AFAIK Ray commit some changes).

Zdenko

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.
Visit this group at http://groups.google.com/group/tesseract-ocr.

Paul

unread,
Sep 24, 2014, 7:57:05 AM9/24/14
to tesser...@googlegroups.com
Yes, Zdenko, I know. But people have to call it some way. I use commit number 648e7ca31109 of the current master branch.

Paul

葉家忠

unread,
Nov 21, 2014, 3:48:20 AM11/21/14
to tesser...@googlegroups.com
Excuse me, could you tell me where I can download 3.0.4 version? 
BTW, I use windows environment to run tesseract, and compile it using .net 2010.

Many thanks~


Paul於 2014年9月24日星期三UTC+8下午7時57分05秒寫道:
Reply all
Reply to author
Forward
0 new messages