I have a "matrix" of numbers where every other row (line) is
"highlighted" by a (black and white) "raster" of tiny dots
to ease the human readers scanning a row.
But when processing such an image by an OCR software like cuneiform
or tesseract, there is a high recognition error rate for those digits
which are "highlighted".
Has any body an idea what to do?
Thanks for a hint,
Helmut.
(I cannot attach the scanned image since it contains confidential
material.)
Thats an interesting one. Obviously it depends where the rows of dots are
positioned and how regular the spacing is.
If used as an underline then "just maybe" set up a grid to mask them.
(filters -> render -> pattern -> grid) as
http://www.imageno.com/2sk91hg02ryfpic.html
This one is of course very artificial, since I made it in inkscape, but
left side shows the grid in place in its own layer, right side shows
after 'nudging' the grid layer up with the cursor keys to cover the dots.
I did not try it but it should be possible to rotate if scan is slightly
skewed.
If the dots go through the text..??? Might be possible to set up a mask
in Inkscape.
--
rich
> I have a "matrix" of numbers where every other row (line) is
> "highlighted" by a (black and white) "raster" of tiny dots
> to ease the human readers scanning a row.
>
> But when processing such an image by an OCR software like cuneiform
> or tesseract, there is a high recognition error rate for those digits
> which are "highlighted".
>
> Has any body an idea what to do?
This is indeed an interesting problem.
I created a sample with MS Word, using a textbox with Arial size 11 on a
10% dots fill, then printed this into a bitmap file with 300 dpi res
using the free "pdfill" printer.
Loaded this bitmap into Gimp but the only filter which seems to do what
is needed (despeckle) performs not so good.
But I was successful applying IrfanView's "median filter" twice. After
that, the text was successfully recognized by Abbyy Fine Reader.
Now I wonder which filter in Gimp does the same as IrfanView's "median
filter".
According to
http://dossy.org/2007/08/what-is-gimps-equivalent-of-photoshops-median-filter/
the median filter is called despeckle in Gimp. But despeckle performs
differently from IrfanView's median filter and either does not remove
the dots or in most settings makes the text unreadable. Best result is
with radius 1, white level must be 256, black level in the inclusive
range [0, 254], but this also makes the lines very thin.
If anyone is interested in my test file(s), I can send them on request.
--
Wilfried Hennings
The reply address is invalid. Please reply in the newsgroup or use the address in the next line.
whiskey hotel underscore november golf at golf mike xray dot delta echo
Thanks, I'll have a try on it.
Here is an example of the scanned image
http://tinypic.com/r/28tfh2r/3
Many thanks,
Helmut.
> On Wed, 28 Jul 2010 14:59:40 +0200, Wilfried wrote:
>
>> Helmut Jarausch <jara...@igpm.rwth-aachen.de> wrote:
>>
>>> I have a "matrix" of numbers where every other row (line) is
>>> "highlighted" by a (black and white) "raster" of tiny dots to ease the
>>> human readers scanning a row.
<snip>
> Here is an example of the scanned image
>
> http://tinypic.com/r/28tfh2r/3
>
> Many thanks,
> Helmut.
I see now, roughly equivalent to the listing paper you got for dot-matrix
printers many tears ago.
still a good problem
--
rich
> On Wed, 28 Jul 2010 14:59:40 +0200, Wilfried wrote:
>
>> Helmut Jarausch <jara...@igpm.rwth-aachen.de> wrote:
>>
>>> I have a "matrix" of numbers where every other row (line) is
>>> "highlighted" by a (black and white) "raster" of tiny dots to ease the
>>> human readers scanning a row.
>>>
>>> But when processing such an image by an OCR software like cuneiform or
>>> tesseract, there is a high recognition error rate for those digits
>>> which are "highlighted".
>>>
>
> Here is an example of the scanned image
>
> http://tinypic.com/r/28tfh2r/3
>
> Many thanks,
> Helmut.
Again maybe a solution
Filters -> generic -> convolution matrix
a screen shot here + an ocr using YAGF/cuneiform 0.9
http://www.imageno.com/xoextgqosk1upic.html
--
rich
Have you tried a plain the-hell-with-it gaussian blur? It make the dots
disappear (I tried 5 pixels on your sampl) and the OCR should be able to
make up for the fuzzy digits.
--
Bertrand
Assuming: http://i31.tinypic.com/28tfh2r.jpg.
There are entries in Gimp menu named Grow/Shrink. They never worked
for me, but MAYBE they work for b/w images. (If not, I have a script
which does the same for arbitrary images.)
Grow the white area by 1px, then shrink it by 1px. (I do not remember
what "Grow" grows, whites or blacks - experiment.)
Hope this helps,
Ilya
I found another possibility:
ImageMagick http://www.imagemagick.org has a median filter which removes
the dots from the test file effectively.
Call (ImageMagick must be called from a console window):
convert 28tfh2r.jpg -median 1 out.tif
or
convert 28tfh2r.jpg -median 2 out.tif
The median filter with radius 2 reduces the dots even more than with
radius 1 but at the cost of more heavily blurred edges of the
characters. In my view it depends on the OCR software which radius
setting is better.
--
Wilfried Hennings
bitte in der Newsgruppe antworten, die Mailadresse ist ungültig
Hi Ilya,
I cannot find the Grow/Shrink menu items (I'm using Gimp-2.7.2/GIT)
Thanks for a pointer,
Helmut.
--
Helmut Jarausch
Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany
The only one I know are in the Selection menu (in my 2.4). They
grow/shrink the selection. For instance if you make a 20x20 selection
and "grow" it 120 pixels, you get a 40x40 selection with round corners.
For the problem at at hand, i think that what is suggested its:
- select by color on the background (the dots won't be included in the
selection)
- "grow" the selection by just enough pixels to include them
- "shrink" the selection by the same mount, this won't restore the
eselection around the dots (they will remain "plugged") but that will
otherwise put back the on the letters edges.
- erase or color-fill the selection to remove the dots.
This can work if no letter part is thinner than the dots, and if the
close areas in the characters (0, 6, 8, for instances are bigger than
twice the "grow" factor).
--
Bertrand
The dots scan into the greyscale image about 2x2 pixels, of which some
are hard black but some are grey levels
How about using Threshold to eliminate the grey ones. I've tried it on
your sample and at between 25 and 35 a lot of the dot pixels are
eliminated without too much damage to the figures
It's crude, but it may be enough to give your OCR enough of a hand to
improve recognition markedly
Good luck with it Friend
--
Yours sincerely
Mervyn Carter
================================================
People can be divided into 10 groups
1 group who understand binary maths
1 group who don't understand binary maths
================================================
Cheers
--
Yours again
Mervyn Carter
================================================
The brain is a wonderful organ. It starts working the moment you
get up in the morning and does not stop until you get into the
office"
Robert Frost
================================================
>> I cannot find the Grow/Shrink menu items (I'm using Gimp-2.7.2/GIT)
Filters/Generic/ Dilate and Erode.
For me, they do just some bullshit unrelated to their documentation.
I put my script into ilyaz.org/software/gimp. It puts the commands
into Filters/Edge-detect.
Yours,
Ilya