Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Preprocessing a scanned image in preparation of OCR

741 views
Skip to first unread message

Helmut Jarausch

unread,
Jul 28, 2010, 3:03:56 AM7/28/10
to
Hi,

I have a "matrix" of numbers where every other row (line) is
"highlighted" by a (black and white) "raster" of tiny dots
to ease the human readers scanning a row.

But when processing such an image by an OCR software like cuneiform
or tesseract, there is a high recognition error rate for those digits
which are "highlighted".

Has any body an idea what to do?

Thanks for a hint,
Helmut.

(I cannot attach the scanned image since it contains confidential
material.)

rich

unread,
Jul 28, 2010, 6:35:43 AM7/28/10
to

Thats an interesting one. Obviously it depends where the rows of dots are
positioned and how regular the spacing is.
If used as an underline then "just maybe" set up a grid to mask them.
(filters -> render -> pattern -> grid) as

http://www.imageno.com/2sk91hg02ryfpic.html

This one is of course very artificial, since I made it in inkscape, but
left side shows the grid in place in its own layer, right side shows
after 'nudging' the grid layer up with the cursor keys to cover the dots.
I did not try it but it should be possible to rotate if scan is slightly
skewed.

If the dots go through the text..??? Might be possible to set up a mask
in Inkscape.


--
rich

Wilfried

unread,
Jul 28, 2010, 8:59:40 AM7/28/10
to
Helmut Jarausch <jara...@igpm.rwth-aachen.de> wrote:

> I have a "matrix" of numbers where every other row (line) is
> "highlighted" by a (black and white) "raster" of tiny dots
> to ease the human readers scanning a row.
>
> But when processing such an image by an OCR software like cuneiform
> or tesseract, there is a high recognition error rate for those digits
> which are "highlighted".
>
> Has any body an idea what to do?

This is indeed an interesting problem.
I created a sample with MS Word, using a textbox with Arial size 11 on a
10% dots fill, then printed this into a bitmap file with 300 dpi res
using the free "pdfill" printer.
Loaded this bitmap into Gimp but the only filter which seems to do what
is needed (despeckle) performs not so good.
But I was successful applying IrfanView's "median filter" twice. After
that, the text was successfully recognized by Abbyy Fine Reader.

Now I wonder which filter in Gimp does the same as IrfanView's "median
filter".
According to
http://dossy.org/2007/08/what-is-gimps-equivalent-of-photoshops-median-filter/
the median filter is called despeckle in Gimp. But despeckle performs
differently from IrfanView's median filter and either does not remove
the dots or in most settings makes the text unreadable. Best result is
with radius 1, white level must be 256, black level in the inclusive
range [0, 254], but this also makes the lines very thin.

If anyone is interested in my test file(s), I can send them on request.

--
Wilfried Hennings
The reply address is invalid. Please reply in the newsgroup or use the address in the next line.
whiskey hotel underscore november golf at golf mike xray dot delta echo

Helmut Jarausch

unread,
Jul 28, 2010, 10:12:54 AM7/28/10
to

Thanks, I'll have a try on it.

Here is an example of the scanned image

http://tinypic.com/r/28tfh2r/3

Many thanks,
Helmut.

rich

unread,
Jul 28, 2010, 10:38:43 AM7/28/10
to
On Wed, 28 Jul 2010 14:12:54 +0000, Helmut Jarausch wrote:

> On Wed, 28 Jul 2010 14:59:40 +0200, Wilfried wrote:
>
>> Helmut Jarausch <jara...@igpm.rwth-aachen.de> wrote:
>>
>>> I have a "matrix" of numbers where every other row (line) is
>>> "highlighted" by a (black and white) "raster" of tiny dots to ease the
>>> human readers scanning a row.

<snip>


> Here is an example of the scanned image
>
> http://tinypic.com/r/28tfh2r/3
>
> Many thanks,
> Helmut.

I see now, roughly equivalent to the listing paper you got for dot-matrix
printers many tears ago.

still a good problem

--
rich

rich

unread,
Jul 28, 2010, 12:53:13 PM7/28/10
to
On Wed, 28 Jul 2010 14:12:54 +0000, Helmut Jarausch wrote:

> On Wed, 28 Jul 2010 14:59:40 +0200, Wilfried wrote:
>
>> Helmut Jarausch <jara...@igpm.rwth-aachen.de> wrote:
>>
>>> I have a "matrix" of numbers where every other row (line) is
>>> "highlighted" by a (black and white) "raster" of tiny dots to ease the
>>> human readers scanning a row.
>>>
>>> But when processing such an image by an OCR software like cuneiform or
>>> tesseract, there is a high recognition error rate for those digits
>>> which are "highlighted".
>>>

>

> Here is an example of the scanned image
>
> http://tinypic.com/r/28tfh2r/3
>
> Many thanks,
> Helmut.

Again maybe a solution

Filters -> generic -> convolution matrix

a screen shot here + an ocr using YAGF/cuneiform 0.9

http://www.imageno.com/xoextgqosk1upic.html


--
rich

Ofnuts

unread,
Jul 28, 2010, 12:57:50 PM7/28/10
to

Have you tried a plain the-hell-with-it gaussian blur? It make the dots
disappear (I tried 5 pixels on your sampl) and the OCR should be able to
make up for the fuzzy digits.

--
Bertrand

Ilya Zakharevich

unread,
Jul 28, 2010, 7:35:24 PM7/28/10
to
On 2010-07-28, Helmut Jarausch <jara...@igpm.rwth-aachen.de> wrote:
> Hi,
>
> I have a "matrix" of numbers where every other row (line) is
> "highlighted" by a (black and white) "raster" of tiny dots
> to ease the human readers scanning a row.
>
> But when processing such an image by an OCR software like cuneiform
> or tesseract, there is a high recognition error rate for those digits
> which are "highlighted".
>
> Has any body an idea what to do?

Assuming: http://i31.tinypic.com/28tfh2r.jpg.

There are entries in Gimp menu named Grow/Shrink. They never worked
for me, but MAYBE they work for b/w images. (If not, I have a script
which does the same for arbitrary images.)

Grow the white area by 1px, then shrink it by 1px. (I do not remember
what "Grow" grows, whites or blacks - experiment.)

Hope this helps,
Ilya

Wilfried

unread,
Jul 29, 2010, 4:50:15 AM7/29/10
to
Helmut Jarausch <jara...@igpm.rwth-aachen.de> wrote:

I found another possibility:
ImageMagick http://www.imagemagick.org has a median filter which removes
the dots from the test file effectively.
Call (ImageMagick must be called from a console window):
convert 28tfh2r.jpg -median 1 out.tif
or
convert 28tfh2r.jpg -median 2 out.tif

The median filter with radius 2 reduces the dots even more than with
radius 1 but at the cost of more heavily blurred edges of the
characters. In my view it depends on the OCR software which radius
setting is better.

--
Wilfried Hennings
bitte in der Newsgruppe antworten, die Mailadresse ist ungültig

Helmut Jarausch

unread,
Aug 5, 2010, 2:56:29 AM8/5/10
to

Hi Ilya,
I cannot find the Grow/Shrink menu items (I'm using Gimp-2.7.2/GIT)

Thanks for a pointer,
Helmut.


--
Helmut Jarausch
Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany

Ofnuts

unread,
Aug 5, 2010, 3:21:51 AM8/5/10
to

The only one I know are in the Selection menu (in my 2.4). They
grow/shrink the selection. For instance if you make a 20x20 selection
and "grow" it 120 pixels, you get a 40x40 selection with round corners.

For the problem at at hand, i think that what is suggested its:

- select by color on the background (the dots won't be included in the
selection)
- "grow" the selection by just enough pixels to include them
- "shrink" the selection by the same mount, this won't restore the
eselection around the dots (they will remain "plugged") but that will
otherwise put back the on the letters edges.
- erase or color-fill the selection to remove the dots.

This can work if no letter part is thinner than the dots, and if the
close areas in the characters (0, 6, 8, for instances are bigger than
twice the "grow" factor).

--
Bertrand

Mervyn Carter

unread,
Aug 5, 2010, 4:03:49 AM8/5/10
to
In article of Wed, 28 Jul 2010, Helmut Jarausch writes
Can I dare to suggest a crude but simple improvement ?

The dots scan into the greyscale image about 2x2 pixels, of which some
are hard black but some are grey levels

How about using Threshold to eliminate the grey ones. I've tried it on
your sample and at between 25 and 35 a lot of the dot pixels are
eliminated without too much damage to the figures

It's crude, but it may be enough to give your OCR enough of a hand to
improve recognition markedly

Good luck with it Friend
--
Yours sincerely

Mervyn Carter
================================================
People can be divided into 10 groups
1 group who understand binary maths
1 group who don't understand binary maths
================================================

Mervyn Carter

unread,
Aug 5, 2010, 4:13:01 AM8/5/10
to
In article of Thu, 5 Aug 2010, Mervyn Carter writes

>In article of Wed, 28 Jul 2010, Helmut Jarausch writes
>>Hi,
>>
>>I have a "matrix" of numbers where every other row (line) is
>>"highlighted" by a (black and white) "raster" of tiny dots
>>to ease the human readers scanning a row.
>>
>>But when processing such an image by an OCR software like cuneiform
>>or tesseract, there is a high recognition error rate for those digits
>>which are "highlighted".
>>
>>Has any body an idea what to do?
>>
>>Thanks for a hint,
>>Helmut.
>>
>>(I cannot attach the scanned image since it contains confidential
>>material.)
>Can I dare to suggest a crude but simple improvement ?
>
>The dots scan into the greyscale image about 2x2 pixels, of which some
>are hard black but some are grey levels
>
>How about using Threshold to eliminate the grey ones. I've tried it on
>your sample and at between 25 and 35 a lot of the dot pixels are
>eliminated without too much damage to the figures
>
>It's crude, but it may be enough to give your OCR enough of a hand to
>improve recognition markedly
>
>Good luck with it Friend
Further to the above - do a 2px Gaussian blur on it FIRST, then use
Threshold - at about 125 it's near perfect for OCR

Cheers
--
Yours again

Mervyn Carter
================================================
The brain is a wonderful organ. It starts working the moment you
get up in the morning and does not stop until you get into the
office"
Robert Frost
================================================

Ilya Zakharevich

unread,
Aug 5, 2010, 6:36:26 AM8/5/10
to
On 2010-08-05, Ofnuts <o.f.n...@la.poste.net> wrote:
>>> There are entries in Gimp menu named Grow/Shrink. They never worked for
>>> me, but MAYBE they work for b/w images. (If not, I have a script which
>>> does the same for arbitrary images.)
>>>
>>> Grow the white area by 1px, then shrink it by 1px. (I do not remember
>>> what "Grow" grows, whites or blacks - experiment.)

>> I cannot find the Grow/Shrink menu items (I'm using Gimp-2.7.2/GIT)

Filters/Generic/ Dilate and Erode.

For me, they do just some bullshit unrelated to their documentation.

I put my script into ilyaz.org/software/gimp. It puts the commands
into Filters/Edge-detect.

Yours,
Ilya

0 new messages