Re: Remove noise more aggressively

1,767 views
Skip to first unread message

Shree Devi Kumar

unread,
May 29, 2013, 3:18:36 AM5/29/13
to tesser...@googlegroups.com
I have read that using imagemagick with a script such as textcleaner gives good results. You could try and see if it works for you.

http://www.fmwconcepts.com/imagemagick/textcleaner/index.php

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Wed, May 29, 2013 at 12:09 PM, Johannes Richter <super...@gmail.com> wrote:
There are multiple options to improve this particular case. 
  • You could preprocess the image, to supress this kind of noise. (Look for Opening and Closing - Operators)
  • There is a tesseract-parameter, which takes the minimum size of a blob, just count the "noise", add some pixels(just to be sure) and let tesseract filter this
  • You could do the blob-size-filtering by yourself
Characters like {, . '} may get deleted too.

Am Dienstag, 28. Mai 2013 19:38:14 UTC+2 schrieb Dmitry Katsubo:
Dear Tesseract community,

I would love to hear somebody's advise about how to reduce noise in the following example (part of original image):







For this image library returns text <13'0> with apostrophe triggered by noise. Would be fantastic if this noise could be suppressed by means of Tesseract. Perhaps I should to the direction of image pre-processing like unpaper as suggested here? Following another post I have set "textord_heavy_nr" setting to "1" with no visible effect. If one can suggest any further options to play with, I will appreciate.

In my case I am ready to sacrifice "real" characters from this set {, . '} i.e. if they are not recognized it's not a big deal. Completely blacklisting them I think is not right because in general if they are recognized correctly this would be a plus.

Thanks in advance.

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com
To unsubscribe from this group, send email to
tesseract-oc...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
 
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-oc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Dmitry Katsubo

unread,
May 30, 2013, 10:57:05 AM5/30/13
to tesser...@googlegroups.com, Johannes Richter
Many thanks for the information, Johannes!

I have played with textord_max_noise_size and it turned out that noise in my particular case is not removed even when I set textord_max_noise_size=45. Above that value almost all other characters have been considered as noise.

However textord_heavy_nr=1 worked well for me. It looks like this very setting works on its own and does not depend on values for other settings mentioned.

On 30.05.2013 9:11, Johannes Richter wrote:
The parameter i meant is  "textord_max_noise_size" and it defines the maximum size of noise in pixels. You could also try the one you have found in the list "textord_heavy_nr".

"Opening and Closing Operators" are morphological operators. I searched Wikipedia fo a nice example, but the english version is only a stub.
In your case the opening-operation is the way to go. Many image processing frameworks include morphological operations. If your software does not provide a opening operator look for erosion and dilation.(opening is just a erosion followed by dilation)

I made a quick example in gimp.
the picture "before.png" shows my object (the circle) with some noise i want to remove. I executed the erosion operation on this picture with a proper filter mask. The result is in picture "after erosion.png". The circle has changed in size (and shape). As last step i executed the dilation operation in gimp. The resulting image "after dilation.png" shows only the circle.

Depending on your objects and noise you need to choose a proper filter mask for this operations. This operation will change the shape of your characters slightly.

-- 
With best regards,
Dmitry

Dmitry Katsubo

unread,
Jun 6, 2013, 11:13:49 AM6/6/13
to tesser...@googlegroups.com, jm, Johannes Richter
Hi Jozef,

Thanks for great advise. After playing around I have found that advise in this post works for me:
convert image.tiff -write MPR:source -morphology close rectangle:3x4 -clip-mask MPR:source -morphology erode:8 square +clip-mask image-close.tif
Looks like I need to pipe images through ImageMagick but I can't decide when it is really necessary. Perhaps I can run Tesseract twice: first time to determine confidence level and then make cleanup & recognize again (if needed).

Many thanks again for pointing the direction!

On 30.05.2013 18:50, jm wrote:
Regarding open and close operators:
 
First, look at
  pixDilate
  pixErode
 
I think that this code snippet says it all (open is erode and dilate)
 
 PIX *
00405 pixOpen(PIX  *pixd,
00406         PIX  *pixs,
00407         SEL  *sel)
00408 {
00409 PIX  *pixt;
00410
00411     PROCNAME("pixOpen");
00412
00413     if ((pixd = processMorphArgs2(pixd, pixs, sel)) == NULL)
00414         return (PIX *)ERROR_PTR("pixd not returned", procName, pixd);
00415
00416     if ((pixt = pixErode(NULL, pixs, sel)) == NULL)
00417         return (PIX *)ERROR_PTR("pixt not made", procName, pixd);
00418     pixDilate(pixd, pixt, sel);
00419     pixDestroy(&pixt);
00420
00421     return pixd;
00422 }
Cheers,
Jozef
Reply all
Reply to author
Forward
0 new messages