As things are currently arranged, the main focus of attention when using
the extension are the line segment images and matching text snippets.
Full page images and full page formatted text previews are available but
they aren't where the real work gets done. As such it makes sense that
the segments are arranged in the center of the screen.
Comparing each line of text to its matching image is the most thorough
way to proofread with the information we have, but there are
disadvantages as well. Loading every line is resource intensive and
reading every line twice for accuracy is slower and more boring than the
task needs to be, given the high accuracy rate of the OCRing. Also,
since you only have access to a page's worth of editable text at a time,
you can't run spell check or perform global word replacements against
the whole document. For hard to OCR words/characters, like accented
characters and special punctuation, these global tools would allow the
correction of a large number of errors.
As an alternative approach, we could use some automatic tools for
highlighting portions of the document for human OCR review. If it were
possible to only load image segments for words/characters that OCR
reported trouble deciphering, or words that were spelled incorrectly,
that would alleviate a lot of the difficulty with loading images for
each line.
Then we could load the entire document into the browser at once, as we
do now for the preview, but in editable form and with line images
displayed only in those portions of the document where review is really
needed. Any global errors could meanwhile be taken care of using
standard browser search and replace functions. In this arrangement of
things the traditional preview and full page image panes might not be
necessary, or only needed in an on-demand fashion.
What do you think?
-Ian
Don't worry too much about computational resources. There's no reason
you have to load a large number of lines at once...
> As an alternative approach, we could use some automatic tools for
> highlighting portions of the document for human OCR review. If it were
> possible to only load image segments for words/characters that OCR
> reported trouble deciphering, or words that were spelled incorrectly,
> that would alleviate a lot of the difficulty with loading images for
> each line.
If OCRopus 0.4 reports such information, this could be a great idea. I
haven't compiled the new OCRopus myself, but if you have please attach a
sample document so I can see if anything has changed (even if said
feature is not available).
> Then we could load the entire document into the browser at once, as we
> do now for the preview, but in editable form and with line images
> displayed only in those portions of the document where review is really
> needed. Any global errors could meanwhile be taken care of using
> standard browser search and replace functions. In this arrangement of
> things the traditional preview and full page image panes might not be
> necessary, or only needed in an on-demand fashion.
I think you're thinking in the right direction.
- Jim
Sounds like things haven't changed... ? :-)