Strange OCR results from table of contents

89 views

Skip to first unread message

Lars Aronsson

unread,

Jan 19, 2024, 1:44:13 PM1/19/24

to tesser...@googlegroups.com

I'm running a standard Ubuntu Linux with Tesseract 5.3.0 and
it gives very good results in almost every situation, with
one strange exception: Tables of contents.

Here is a typical page from a book in Danish language, printed in 1897,
https://runeberg.org/voroldtid/0344.html

Below the image is the raw OCR text from tesseract -l dan
using for input the full resolution JPEG image (2464 x 3610 pixels).

The OCR text has some initial garbage, but then the text
follows in near perfect quality.

Here is the table of content from the same book,
https://runeberg.org/voroldtid/0011.html

Below the image is the OCR text after manual proofreading,
but the original raw OCR output from Tesseract is seen here:

https://runeberg.org/rc.pl?action=show&version=1&src=voroldtid/0011

A typical line there reads:

URMMDEnFs]dretStenaldersåBopladserikereen ER 5 ANSE URE

Instead of the desired:

I. Den ældre Stenalders Bopladser ......... 7.

How come? Is it the unusual line spacing that makes Tesseract
confused? Or the dotted line? Why does it fill in letters
where there should be word-separating spaces?

--
Lars Aronsson (la...@aronsson.se)
Project Runeberg - free Nordic literature - http://runeberg.org/

Tom Morris

unread,

Jan 19, 2024, 3:44:04 PM1/19/24

to tesseract-ocr

On Friday, January 19, 2024 at 8:44:13 AM UTC-5 Lars Aronsson wrote:

How come? Is it the unusual line spacing that makes Tesseract
confused? Or the dotted line? Why does it fill in letters
where there should be word-separating spaces?

I think the simplest and most likely explanation is that there wasn't any text like that in its training set. You might be able to improve the situation by creating ground truth text and images for table of content lines and fine tuning the Danish model, either to create an enhanced model, if it doesn't degrade performance on normal text too much or to create a separate dan_toc model which can be used on pages which are identified as tables of contents pages.

As an aside, it also looks like you're got some page segmentation issues since the last line on the page ("Literatur-Fortegnelse til Bronzealderen") is being output at the top. This might be something you could clean up by post-processing the HOCR output or by doing the page segmentation yourself.