Using OCR on fax-to-email

87 views
Skip to first unread message

Jennifer Bell

unread,
Jul 10, 2008, 6:09:45 PM7/10/08
to get.theinfo
Has anyone had any experience using optical character reconition
software like gocr on documents that were originally faxes? I was
wondering if there were any statistics on how much reliability
dropped.

The project is to find ways to process responses to Access to
Information Requests from the Canadian government that come in via
fax. There's a planning group for the tool here:

http://groups.google.com/group/visiblegovernment_atoi

Jennifer Bell

Lukasz Szybalski

unread,
Jul 10, 2008, 6:47:57 PM7/10/08
to get-t...@googlegroups.com
On Thu, Jul 10, 2008 at 5:09 PM, Jennifer Bell
<jenn...@visiblegovernment.ca> wrote:
>
> Has anyone had any experience using optical character reconition
> software like gocr on documents that were originally faxes? I was
> wondering if there were any statistics on how much reliability
> dropped.


Well, no statistics from me but I have used "tesseract" (top 3 engines
in the 1995 UNLV Accuracy) http://code.google.com/p/tesseract-ocr/


I receive about 500 faxes a day and wanted to process them. My problem
was that the files needed to be in certain resolution for ocr to work
correctly. Here is the discussion:
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/24bb53ae551eb38c/c25481d22fb10fe3?lnk=gst&q=szybalski#c25481d22fb10fe3
From my initial research I though they were able to achieved 99%+ accuracy.

I didn't have time too look into how to pre-process fax images to
200x200, but if you will be looking into this please let me know as I
would like to get the processing rolling.

Thanks,
Lucas

Jennifer Bell

unread,
Jul 13, 2008, 2:56:16 PM7/13/08
to get.theinfo
Thank you for the information. I've received word that the Government
of Canada does not plan to support receiving Access to Information
requests by email, so it looks like using fax for both sending
requests and receiving responses may to be the way to go.

For more background, the intent of the project is to duplicate the UK
site WhatDoTheyKnow.com in Canada. WhatDoTheyKnow.com allows users to
track their outstanding Access to Information requests and share
responses. In doing so, the site collects overall statistics on
response times, as well as number of requests by department or by
topic. The goal is to highlight the current issues in the Access to
Information system, and target which areas of government information
ought to be exposed by default.

Any Canadians on the list, or any people with experience in fax
processing, are welcome to join the discussion group to contribute to
the planning process or show support for the project in general:

http://groups.google.com/group/visiblegovernment_atoi

Jennifer Bell
visiblegovernment.ca

On Jul 10, 6:47 pm, "Lukasz Szybalski" <szybal...@gmail.com> wrote:
> On Thu, Jul 10, 2008 at 5:09 PM, Jennifer Bell
>
> <jenni...@visiblegovernment.ca> wrote:
>
> > Has anyone had any experience using optical character reconition
> > software like gocr on documents that were originally faxes? I was
> > wondering if there were any statistics on how much reliability
> > dropped.
>
> Well, no statistics from me but I have used "tesseract" (top 3 engines
> in the 1995 UNLV Accuracy)http://code.google.com/p/tesseract-ocr/
>
> I receive about 500 faxes a day and wanted to process them. My problem
> was that the files needed to be in certain resolution for ocr to work
> correctly. Here is the discussion:http://groups.google.com/group/tesseract-ocr/browse_thread/thread/24b...

Lukasz Szybalski

unread,
Jul 13, 2008, 11:56:42 PM7/13/08
to get-t...@googlegroups.com
On Sun, Jul 13, 2008 at 1:56 PM, Jennifer Bell
<jenn...@visiblegovernment.ca> wrote:
>
> Thank you for the information. I've received word that the Government
> of Canada does not plan to support receiving Access to Information
> requests by email, so it looks like using fax for both sending
> requests and receiving responses may to be the way to go.
>
> For more background, the intent of the project is to duplicate the UK
> site WhatDoTheyKnow.com in Canada. WhatDoTheyKnow.com allows users to
> track their outstanding Access to Information requests and share
> responses. In doing so, the site collects overall statistics on
> response times, as well as number of requests by department or by
> topic. The goal is to highlight the current issues in the Access to
> Information system, and target which areas of government information
> ought to be exposed by default.
>

Just out of curiosity? What type of images are these? Are these fax
images that are received by (hylafax fax server) or these are actual
paper faxes that are somehow scanned into images?

Lucas

Jennifer Bell

unread,
Jul 14, 2008, 12:02:46 PM7/14/08
to get.theinfo

> Just out of curiosity? What type of images are these? Are these fax
> images that are received by (hylafax fax server) or these are actual
> paper faxes that are somehow scanned into images?

Well... they're received directly as fax. The current thought is to
use a small business service such as this one:

http://www.soho.ca/benefits/srFax.htm

To avoid managing the fax lines, pending further investigation. Let
me know if this is a bad idea.

Really, only the cover page has to be accurately text-decoded so that
the response can automatically be associated with an account, and the
requester notified. Perhaps this can be done using a keyword
followed by a tracking number. We can provide the return cover page
with the request, or ask that they include the keyword and tracking
number in a printed cover page. Responses where the keyword isn't
recognized can be routed by hand, by volunteers, but it's hoped that
this is the minority case.

If the rest of the sent documents can be OCR'd if the user decides to
share them on the site, bonus.

Jennifer

Lukasz Szybalski

unread,
Jul 14, 2008, 1:01:21 PM7/14/08
to get-t...@googlegroups.com
On Mon, Jul 14, 2008 at 11:02 AM, Jennifer Bell
<jenn...@visiblegovernment.ca> wrote:
>
>
>> Just out of curiosity? What type of images are these? Are these fax
>> images that are received by (hylafax fax server) or these are actual
>> paper faxes that are somehow scanned into images?
>
> Well... they're received directly as fax. The current thought is to
> use a small business service such as this one:
>
> http://www.soho.ca/benefits/srFax.htm

Questions to answer here are:

What is a cost per 1000 faxes,
How many faxes can they receive at time (Just 1 line, everybody else
waits in line?)
What is the file format they will email you? (default I think is non
searchable pdf) (option might be a tif file)
How will you transfer files from email account to some kind of
database/managing system. (I guess you could point it to some mailbox
and write a little program to extract attachment and add it to the
database.
Will they provide incoming fax#, date,time. Is that in the email and
available to you or is that in the fax image?
Are they using Class 1 faxing standard that most fax machine
understand, or higher?

Alternative, if you already managing faxlines is to use open source
hylafax fax server on a pc, and few $30 cheap modems. I guess it
really depends on how much control you want to have over incoming
faxes.


>
> To avoid managing the fax lines, pending further investigation. Let
> me know if this is a bad idea.
>
> Really, only the cover page has to be accurately text-decoded so that
> the response can automatically be associated with an account, and the
> requester notified.

What happens if there is no cover page?

Perhaps this can be done using a keyword
> followed by a tracking number. We can provide the return cover page
> with the request, or ask that they include the keyword and tracking
> number in a printed cover page.

If you sending a page and want to receive it back, I guess you could
use bar codes or tracking number as you mentioned.


Responses where the keyword isn't
> recognized can be routed by hand, by volunteers, but it's hoped that
> this is the minority case.
>
> If the rest of the sent documents can be OCR'd if the user decides to
> share them on the site, bonus.


Lucas

Jennifer Bell

unread,
Jul 15, 2008, 10:22:06 PM7/15/08
to get.theinfo


> Questions to answer here are:
>
> What is a cost per 1000 faxes,
> How many faxes can they receive at time (Just 1 line, everybody else
> waits in line?)
> What is the file format they will email you? (default I think is non
> searchable pdf) (option might be a tif file)

Yes. These are good questions for selecting the fax provider.

> How will you transfer files from email account to some kind of
> database/managing system. (I guess you could point it to some mailbox
> and write a little program to extract attachment and add it to the
> database.

Yes.

> Will they provide incoming fax#, date,time. Is that in the email and
> available to you or is that in the fax image?

Date ought to be in the SMTP header, or whatever, of the email. I
think that's the best reflection of when it was received.

> Are they using Class 1 faxing standard that most fax machine
> understand, or higher?

Will add to the list of things to check, thanks.

> Alternative, if you already managing faxlines is to use open source
> hylafax fax server on a pc, and few $30 cheap modems. I guess it
> really depends on how much control you want to have over incoming
> faxes.
>

That's true, and is an option. However, there may be some benefit to
using a 3rdparty if they keep accessable records of incoming/outgoing
fax times. It may help in resolving disputes w/ regards to time
tracking accuracy. The system is, by nature, somewhat adversarial.

>
>
> > To avoid managing the fax lines, pending further investigation. Let
> > me know if this is a bad idea.
>
> > Really, only the cover page has to be accurately text-decoded so that
> > the response can automatically be associated with an account, and the
> > requester notified.
>
> What happens if there is no cover page?

Then the keyword won't be recognized, and it email will be routed to
the account by a volunteer. If it comes in with no info whatsoever
with regards to what it's to do with or where it ought to go, we'll
flag an error and send it back.

> Perhaps this can be done using a keyword
>
> > followed by a tracking number. We can provide the return cover page
> > with the request, or ask that they include the keyword and tracking
> > number in a printed cover page.
>
> If you sending a page and want to receive it back, I guess you could
> use bar codes or tracking number as you mentioned.

Right now, there's a slight preference for keyword / number
recognition over bar code (interesting suggestion, thanks) because
then we're not locking them in to having to print out our cover sheet
and send it. They can always do their own, as long as the keyword / #
is typed on the cover sheet. Even if they handwrite it and there's a
lag of a day or so becuase of the volunteer routing, there'll be an
accurate record of the time it took for the agency to respond b/c of
the timestamp of the email.

I appreciate the suggestions. Please keep them coming if there are
other things we should be thinking of.

Jennifer
Reply all
Reply to author
Forward
0 new messages