automatic evaluation ideas

ian

unread,

Jun 17, 2009, 6:21:29 PM6/17/09

to moz-ho...@googlegroups.com

Another answer to the various layout concerns in the RC1 release is to
consider what, if any, plans are in place for automatic evaluation of
OCR quality, using either simple spell checking or OCR software-reported
confidence levels.

As things are currently arranged, the main focus of attention when using
the extension are the line segment images and matching text snippets.
Full page images and full page formatted text previews are available but
they aren't where the real work gets done. As such it makes sense that
the segments are arranged in the center of the screen.

Comparing each line of text to its matching image is the most thorough
way to proofread with the information we have, but there are
disadvantages as well. Loading every line is resource intensive and
reading every line twice for accuracy is slower and more boring than the
task needs to be, given the high accuracy rate of the OCRing. Also,
since you only have access to a page's worth of editable text at a time,
you can't run spell check or perform global word replacements against
the whole document. For hard to OCR words/characters, like accented
characters and special punctuation, these global tools would allow the
correction of a large number of errors.

As an alternative approach, we could use some automatic tools for
highlighting portions of the document for human OCR review. If it were
possible to only load image segments for words/characters that OCR
reported trouble deciphering, or words that were spelled incorrectly,
that would alleviate a lot of the difficulty with loading images for
each line.

Then we could load the entire document into the browser at once, as we
do now for the preview, but in editable form and with line images
displayed only in those portions of the document where review is really
needed. Any global errors could meanwhile be taken care of using
standard browser search and replace functions. In this arrangement of
things the traditional preview and full page image panes might not be
necessary, or only needed in an on-demand fashion.

What do you think?

-Ian

Jim Garrison

unread,

Jun 18, 2009, 6:01:56 AM6/18/09

to ian, moz-ho...@googlegroups.com

ian wrote:
> Comparing each line of text to its matching image is the most thorough
> way to proofread with the information we have, but there are
> disadvantages as well. Loading every line is resource intensive and

Don't worry too much about computational resources. There's no reason
you have to load a large number of lines at once...

> As an alternative approach, we could use some automatic tools for
> highlighting portions of the document for human OCR review. If it were
> possible to only load image segments for words/characters that OCR
> reported trouble deciphering, or words that were spelled incorrectly,
> that would alleviate a lot of the difficulty with loading images for
> each line.

If OCRopus 0.4 reports such information, this could be a great idea. I
haven't compiled the new OCRopus myself, but if you have please attach a
sample document so I can see if anything has changed (even if said
feature is not available).

> Then we could load the entire document into the browser at once, as we
> do now for the preview, but in editable form and with line images
> displayed only in those portions of the document where review is really
> needed. Any global errors could meanwhile be taken care of using
> standard browser search and replace functions. In this arrangement of
> things the traditional preview and full page image panes might not be
> necessary, or only needed in an on-demand fashion.

I think you're thinking in the right direction.

- Jim

ian

unread,

Jun 18, 2009, 8:54:15 AM6/18/09

to Jim Garrison

Jim Garrison wrote:
> ian wrote:
>> Comparing each line of text to its matching image is the most thorough
>> way to proofread with the information we have, but there are
>> disadvantages as well. Loading every line is resource intensive and
>
> Don't worry too much about computational resources. There's no reason
> you have to load a large number of lines at once...
>
>> As an alternative approach, we could use some automatic tools for
>> highlighting portions of the document for human OCR review. If it were
>> possible to only load image segments for words/characters that OCR
>> reported trouble deciphering, or words that were spelled incorrectly,
>> that would alleviate a lot of the difficulty with loading images for
>> each line.
>
> If OCRopus 0.4 reports such information, this could be a great idea. I
> haven't compiled the new OCRopus myself, but if you have please attach a
> sample document so I can see if anything has changed (even if said
> feature is not available).

I'm still trying to get it to compile as well, I'll give it another try
today, but the documentation seems to indicate that it has the
capability. See: https://docs.google.com/View?id=dfxcv4vc_67g844kf
Specifically the "OCR Engine-Specific Markup" section where it lists
this as one of the markup components:

# x_confs c1 c2 c3 …

*
OCR-engine specific character confidences

Tesseract appears to be able to output confidence levels for words or
individual characters:
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/d8841792145f327f/14aa64858be0e993?lnk=gst&q=confidence#14aa64858be0e993
but at the moment the relevant API calls appear to only be implemented
in a windows DLL:
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/e2a361adc378e086/5e70052f2ff8d154?lnk=gst&q=confidence#5e70052f2ff8d154

I pinged the tesseract list to see if things have changed since the 2007
discussion. Thread at:
http://groups.google.com/group/tesseract-ocr/browse_thread/thread/894b87d0bfabe1fe#

-Ian

Jim Garrison

unread,

Jun 29, 2009, 7:59:56 PM6/29/09

to ian, moz-ho...@googlegroups.com

> I pinged the tesseract list to see if things have changed since the 2007
> discussion. Thread at:
> http://groups.google.com/group/tesseract-ocr/browse_thread/thread/894b87d0bfabe1fe#

Sounds like things haven't changed... ? :-)

Reply all

Reply to author

Forward