Using corrected text in second pass

109 views
Skip to first unread message

Graham Seaman

unread,
Feb 18, 2021, 3:07:52 PM2/18/21
to tesseract-ocr
I'm newish to tesseract so this may be a FAQ (though I've looked and its not in the actual FAQs!) - please point me to the right place if it is.

My use case:

There are lots of pdfs of scanned books around which include moderately good ocr-ed text (eg on archive.org). There are also lots of epub, text or html books which have been created from this ocr output text, manually corrected (eg. gutenberg.org). There is no feedback loop between the two - the manually corrected text is never used to improve the text embedded in the pdf. This also applies if I scan books myself and manually correct the extracted ocr text - there is no way I know of to generate a pdf with fully correct embedded text using my manual corrections.

One way to fix this might be if tesseract could take a manually corrected text as a kind of 'hint' file along with the original scanned pages, and then do a second pass to generate the final pdf version, with fully correct embedded text.  Obviously there could be problems around keeping the scan processing and the hint text in sync, but generally this sounds to me like it should be do-able. Would it be? Or is there an existing way to solve the same problem? (preferably not trying to edit hocr files!)

Graham

Tom Morris

unread,
Feb 19, 2021, 10:44:24 AM2/19/21
to tesseract-ocr
On Thursday, February 18, 2021 at 3:07:52 PM UTC-5 gra...@theseamans.net wrote:

There are lots of pdfs of scanned books around which include moderately good ocr-ed text (eg on archive.org).

OCR quality varies widely (even wildly) across scans and vintages of OCR, so it's worth checking your "moderately good" assumption for any edition/scan that you want to work with. Poor quality OCR will make the task impossible
 
There are also lots of epub, text or html books which have been created from this ocr output text, manually corrected (eg. gutenberg.org).

Gutenberg (and pgdp) are just "manually corrected" (or at least they didn't used to be) due to Gutenberg's "editionless" policy and specific editorial decisions made by individual pgdp project coordinators. In the same way the OCR noise increases the difficulty of the task, the further the pgdp draft drifts from a 1-to-1 transcription, the harder the alignment task becomes.
 
There is no feedback loop between the two - the manually corrected text is never used to improve the text embedded in the pdf. This also applies if I scan books myself and manually correct the extracted ocr text - there is no way I know of to generate a pdf with fully correct embedded text using my manual corrections.

One way to fix this might be if tesseract could take a manually corrected text as a kind of 'hint' file along with the original scanned pages, and then do a second pass to generate the final pdf version, with fully correct embedded text.  Obviously there could be problems around keeping the scan processing and the hint text in sync, but generally this sounds to me like it should be do-able. Would it be?

Alignment/synchronization is exactly the crux of the problem. The OCR output is text plus bounding box information. In the simple case, with good page segmentation, low OCR error rates, predictable pgdp editorial decisions (hyphenated words split across line endings closed up, etc), it's simply a matter of replacing "the quick brown fox jumped over the lazy dag" with "the quick brown fox jumped over the lazy dog", but what if the ground truth says "the quick brown fox jumped over the lazy cat" or "the quick fox jumped over the dog"? Is that due to us working with a different edition (PG never used to record editions - does it now?) or ...? The easy solution would be to only fix isolated errors with high confidence replacements, but it's unclear how much that would leave unfixed. That would be an interesting analysis. There are a number of ancillary issues lurking under the covers like dealing with running headers/footers, signature numbers/marks, etc 

I think it would be an interesting project, but it wouldn't be trivial. I don't think it needs to involve Tesseract since you could do it entirely as a post-processing step using the hOCR output and your ground truth text.

Tom

Graham Seaman

unread,
Feb 19, 2021, 1:57:28 PM2/19/21
to tesser...@googlegroups.com
Thanks Tom - I probably shouldn't have given the Gutenberg example since
it introduces extra problems. In my actual process at the moment I have
the source scans, OCR output texts, and corrected text files produced by
myself, so there are fewer variables to worry about. In particular, page
divisions, running headers etc can still be there in my corrected text
file. Also, since the text comes from the actual PDF there are no
problems with variant editions etc: if the OCR says 'dog' and my text
says 'cat', then the OCR is wrong and needs correcting.

So taking Tom's conclusion:

> I don't think it needs to involve Tesseract since you could do it
> entirely as a post-processing step using the hOCR output and your
> ground truth text.

Trying to think this through:

I can try to keep track of the current page in both files just by
counting, and so assume I'm always working within a page.

For a simple one-column page I guess the process starts with a
text-alignment/best match problem. I have dim memories of there being
standard algorithms for this, and with all the gene sequencing stuff now
presumably there are lots more, and python would be a likely bet for
cookbook style examples.

Once I have the best fit for alignment of the two texts, I can both
replace incorrect letters in the hOCR, and delete unwanted letters and
their location from the hOCR. But the third possibility seems harder: in
my experience it is quite common for OCR output to miss out whole words.
How would I generate the location information for a word which is
missing from the hOCR? Similarly, if there is a poorly scanned bit at
the edge of a page where the OCR output is just gibberish, how do I know
the locations of the characters to replace them with? Try to interpolate
from the positions of surrounding text I guess, so you would get
locations which are actually slightly off (this would not matter at all
for searching within the pdf, and maybe not much for copy-and-paste?)

Then what happens with multi-column layout, or text that flows round
image boxes? Can I still use my hypothetical text-alignment algorithm? I
have no experience with hOCR and don't know how the tesseract hOCR
outputter linearizes these things. Are there fixed rules for how the
hOCR data is ordered in the file? Are there any helpful texts about
hOCR? I found a formal grammar which was no help (to me) at all, but
nothing else so far.

Graham


On 19/02/2021 15:44, Tom Morris wrote:
> On Thursday, February 18, 2021 at 3:07:52 PM UTC-5 gra...@theseamans.net
> wrote:
>
>
> There are lots of pdfs of scanned books around which include
> moderately good ocr-ed text (eg on archive.org <http://archive.org>).
>
>
> OCR quality varies widely (even wildly) across scans and vintages of
> OCR, so it's worth checking your "moderately good" assumption for any
> edition/scan that you want to work with. Poor quality OCR will make the
> task impossible
>  
>
> There are also lots of epub, text or html books which have been
> created from this ocr output text, manually corrected (eg.
> gutenberg.org <http://gutenberg.org>).
>
>
> Gutenberg (and pgdp) are just "manually corrected" (or at least they
> didn't used to be) due to Gutenberg's "editionless" policy and specific
> editorial decisions made by individual pgdp project coordinators. In the
> same way the OCR noise increases the difficulty of the task, the further
> the pgdp draft drifts from a 1-to-1 transcription, the harder the
> alignment task becomes.
>  
>
> There is no feedback loop between the two - the manually corrected
> text is never used to improve the text embedded in the pdf. This
> also applies if I scan books myself and manually correct the
> extracted ocr text - there is no way I know of to generate a pdf
> with fully correct embedded text using my manual corrections.
>
> One way to fix this might be if tesseract could take a manually
> corrected text as a kind of 'hint' file along with the original
> scanned pages, and then do a second pass to generate the final pdf
> version, with fully correct embedded text.  Obviously there could be
> problems around keeping the scan processing and the hint text in
> sync, but generally this sounds to me like it should be do-able.
> Would it be?
>
>
> Alignment/synchronization is exactly the crux of the problem. The OCR
> output is text plus bounding box information. In the simple case, with
> good page segmentation, low OCR error rates, predictable pgdp editorial
> decisions (hyphenated words split across line endings closed up, etc),
> it's simply a matter of replacing "the quick brown fox jumped over the
> lazy /dag/" with "the quick brown fox jumped over the lazy *dog*", but
> what if the ground truth says "the quick brown fox jumped over the lazy
> cat" or "the quick fox jumped over the dog"? Is that due to us working
> with a different edition (PG never used to record editions - does it
> now?) or ...? The easy solution would be to only fix isolated errors
> with high confidence replacements, but it's unclear how much that would
> leave unfixed. That would be an interesting analysis. There are a number
> of ancillary issues lurking under the covers like dealing with running
> headers/footers, signature numbers/marks, etc 
>
> I think it would be an interesting project, but it wouldn't be trivial.
> I don't think it needs to involve Tesseract since you could do it
> entirely as a post-processing step using the hOCR output and your ground
> truth text.
>
> Tom
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/4CPxCBbiOt0/unsubscribe
> <https://groups.google.com/d/topic/tesseract-ocr/4CPxCBbiOt0/unsubscribe>.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-oc...@googlegroups.com
> <mailto:tesseract-oc...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5fc0ee4a-7a9b-40f9-91e0-57ec7cb54bd3n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5fc0ee4a-7a9b-40f9-91e0-57ec7cb54bd3n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tom Morris

unread,
Feb 21, 2021, 11:31:33 PM2/21/21
to tesseract-ocr
For alignment you're probably thinking of Burrows-Wheeler: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform
There's a more fully worked, and more topical, example in ReTAS:

All of that deals with linear texts, though. Once you venture into two dimensional space and fixing/redoing page segmentation, you're operating in a much more complex domain.
You can see some experimentation that I did with the OCR output of the Oxford English Dictionary here: https://github.com/tfmorris/oed/blob/master/oedabby.py
It was years ago, but, as I remember it, I started down the path of merging/splitting existing bounding boxes, and basically ended up deciding that I was going to have to punt and re-segment/layout the entire page from scratch using character positions (and I never tried it, so it might not have worked).

The hOCR output should (although I haven't looked recently) mirror the page segmentation output, ie. text blocks in reading order with interspersed graphics blocks for images, etc. That's all fine in the normal case, but if you get lines merged across columns/blocks or widows/orphans from drop caps or antiquated typesetting conventions, sorting things out is much more difficult. In the easy case, the hOCR output should be trivial to follow/match.

Good luck!

Tom
Reply all
Reply to author
Forward
0 new messages