fax -> tiff -> ocr ->txt , bad quality of text

191 views
Skip to first unread message

Lukasz Szybalski

unread,
Jan 28, 2008, 1:12:57 PM1/28/08
to tesseract-ocr
Hello,
I have thousands of faxes that are stored in my linux machine. I would
like to convert them to txt files using tesseract, but for some reason
I am getting really bad quality of txt files.

Does tesseract need to be trained first before any ocr can be done?
(eng)

The text looks like:

YES I] NO [I
1081 CASCADE on
CITY
STATE ZIP |
AGENCY D DIRECT EIILI.
^uR0R^ `L 60506 eanzmjesrnrntj
PHONE NUMBER(S) | 2 | 4 I] 6 pay In 2 I]
4 D 6 CI11 pay
H= (eso) s4o.3o1s w= (ass)
999.9999 . . , ,


What can I do to improve this process?

Lucas

Hussein Al-Hussein

unread,
Jan 28, 2008, 1:31:14 PM1/28/08
to tesser...@googlegroups.com
With faxes, the problem is usually with the preprocessing; i.e. the image enhancement.  The noise may be high and the resolution may be low.  If tesseract cannot handle it then passing these images through a preprocessor to convert them to better looking images would improve a lot the accuracy of the recognition. 
 
Hussein

Lukasz Szybalski

unread,
Jan 28, 2008, 4:09:48 PM1/28/08
to tesser...@googlegroups.com
On Jan 28, 2008 2:24 PM, Hussein Al-Hussein <al_o...@hotmail.com> wrote:
>
> Hello,
>
> Your fax images look fine and cleaner than most fax images I have seen.
> The resolution is 96 DPI and the background is clean and the characters are
> not damaged or touching.


> Do you know if tesseract reads images of 1 bit/pixel (binarized and
> compacted)? Your images are 1 bit/pixel. If tesseract expects 8 or 24
> bits/pixel then open one image using paint on Windows and save it (SAVE AS)
> BMP or JPG with 256 colors (8 bits/pixel) and run tesseract on the result to
> see if it will work.


> > I used identify on one of the faxes.
> > identify fax000130161.tif
> > fax000130161.tif[0] TIFF 1728x2148 DirectClass 137kb
> > fax000130161.tif[1] TIFF 1728x2148 DirectClass 137kb
> > fax000130161.tif[2] TIFF 1728x2148 DirectClass 137kb

> >
> > If you could tell me what to look for, what is best resolution or some
> > specific property of that file I should be passing to tesseract.
> > Are they too big, small?
> >
> > I assume I would be able to convert them using convert from
> > imagemagic, so If you could also provide me with the command line
> > arguments that would be great.
> >
> > convert --resolution --some other arguments filename.tiff

I just ocr my whole directory and results very a lot.
for i in *.tif; do tesseract $i $i; done

Some pages are near perfect some are really bad.
What is the best / optimal file resolution preprocessed files should have?
How does tesseract deal with 3 page tif files? Maybe 60% of my images
are multi page, but only the first page gets converted?

What would be the imagemagic command to convert 3 page image to 1
page? Or is there a command for tesseract that would tell it to scan
all pages?

Lucas

Scan...@gmail.com

unread,
Jan 28, 2008, 7:46:19 PM1/28/08
to tesseract-ocr
You want to covert you faxes which may be 100 x 200 to 300 x 300 dpi.
This may be the reason for poor results. Also, you would want to split
each of your multiple page files to single pages and then merge the
text results back together. Check out lib tiff for functions.

Obviously you need to be a programmer to do this stuff.

On Jan 28, 4:09 pm, "Lukasz Szybalski" <szybal...@gmail.com> wrote:

Lukasz Szybalski

unread,
Jan 29, 2008, 2:54:37 PM1/29/08
to tesser...@googlegroups.com
On Jan 28, 2008 6:46 PM, gl...@jetsoftdev.com <Scan...@gmail.com> wrote:
>
> You want to covert you faxes which may be 100 x 200 to 300 x 300 dpi.
These are common resolutions I am using:
Resolution: 204x196
Resolution: 204x196
Resolution: 204x196
Resolution: 204x98
Resolution: 204x196
Resolution: 204x196
Resolution: 204x196
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 300x300
Resolution: 300x300
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x196
Resolution: 204x98
Resolution: 204x98
Resolution: 204x196
Resolution: 204x196
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 204x196
Resolution: 204x98
Resolution: 204x196
Resolution: 204x196
Resolution: 204x196
Resolution: 204x196
Resolution: 408x391
Resolution: 408x391
Resolution: 408x391
Resolution: 204x98
Resolution: 204x98
Resolution: 204x98
Resolution: 408x391
Resolution: 408x391
Resolution: 204x98
Resolution: 408x391

Is 204 not enough?
Also I tried resizing Resolution: 204x98 to 300x300 but that wasn't
quite readable. Does aspect ratio deal with resolution or is that a
seperate thing? I am thinking that maybe the 204x98 should be
resampled to 300x150 or something like that?

Since I didn't get an answer I assume tesseract doesn't need training
for eng language since it is prebuild with it. correct?

Lucas

> This may be the reason for poor results. Also, you would want to split
> each of your multiple page files to single pages and then merge the
> text results back together. Check out lib tiff for functions.

I will see if "ImageMagic" is able to do that. I think there should be
tool available to do that.

--
--
Paper Less?
http://lucasmanual.com/mywiki/ImageManagement

Jeffrey Ratcliffe

unread,
Jan 29, 2008, 3:14:14 PM1/29/08
to tesser...@googlegroups.com
On 29/01/2008, Lukasz Szybalski <szyb...@gmail.com> wrote:
> These are common resolutions I am using:

To get reasonable results, the resolution should be 300dpi in both
directions. I use 400dpi. With 204x98, there is not much you can do,
as the information is already lost. Resampling to 300dpi will just
make the image larger without adding back the lost information.

Regards

Jeff

Hussein Al-Hussein

unread,
Jan 29, 2008, 3:31:05 PM1/29/08
to tesser...@googlegroups.com

> Date: Tue, 29 Jan 2008 13:54:37 -0600
> From: szyb...@gmail.com

> Is 204 not enough?
> Also I tried resizing Resolution: 204x98 to 300x300 but that wasn't
> quite readable. Does aspect ratio deal with resolution or is that a
> seperate thing? I am thinking that maybe the 204x98 should be
> resampled to 300x150 or something like that?
>
> Since I didn't get an answer I assume tesseract doesn't need training
> for eng language since it is prebuild with it. correct?
>
> Lucas

I have written an OCR for Canon before and below 300 DPI in both directions, characters suffer damage and become two thin for small sizes and not bold.  For 300 DPI, the character is around 3 pixels thick at size 12. For 200, it is around 2 pixels thick.  So, low reslution gives very thin charcter strokes.
 
Do not change the aspect ration; many recognition engines rely on it as a factor of identifying the character.  Also the segmentation code uses it as one of the many facotrs to identify text from non-text.
 
From the three images you sent me, I was suprized that the quality of the characters looks very good and clean compared to other fax pages I see.  So, my belief is that some of the pre-processor code may be assuming high resolution etc.  If I were to read your images, I would read them fine.
 
For instance, if someone is sampling the page every 3 or four pixels in both directions to check for skew detection etc, then most of the characters in your faxes would be skipped albeit being clear to the human eye
 
Hussein

Ray Smith

unread,
Jan 29, 2008, 6:39:50 PM1/29/08
to tesser...@googlegroups.com
Aspect ratio is part of the reason for the poor results. Apart from the fact that a lot of the information is already lost, Tesseract does not correctly interpret 200x100 as requiring line doubling to restore the correct aspect ratio. This exacerbates the loss of information. Resampling to 200x200 would improve the situation.
Ray.

Scan...@gmail.com

unread,
Jan 31, 2008, 8:41:12 PM1/31/08
to tesseract-ocr
You can convert to gray scale and then use various interpolation
methods to resample even x y.

Or you could train a language for each aspect ration to make them like
a font.

On Jan 29, 6:39 pm, "Ray Smith" <theraysm...@gmail.com> wrote:
> Aspect ratio is part of the reason for the poor results. Apart from the fact
> that a lot of the information is already lost, Tesseract does not correctly
> interpret 200x100 as requiring line doubling to restore the correct aspect
> ratio. This exacerbates the loss of information. Resampling to 200x200 would
> improve the situation.
> Ray.
>
> On 1/29/08, Hussein Al-Hussein <al_om...@hotmail.com> wrote:
>
>
>
> >  ------------------------------
> > > Date: Tue, 29 Jan 2008 13:54:37 -0600
> > > From: szybal...@gmail.com
Reply all
Reply to author
Forward
0 new messages