Greetings

Paul Fogel

unread,

Sep 24, 2012, 7:37:00 PM9/24/12

to htrc-ocr-c...@googlegroups.com

Greetings to all of you who have joined the group. Please invite anyone that you think would be interested in taking part in these discussions, particularly those actively working or experimenting with OCR corrections.

We didn't have any clear ideas as to what this group would do or discuss, but at a minimum I'm hoping that a good discussion starts and that the conversations develop lives of their own.

If you would like, introduce yourself and let us know about your OCR interests.

Some feedback about the group would be nice:

are there any immediate work items that this group should take on?
should this be a discussion group or a community development forum?
do we need a proper discussion forum? a wiki? a Google Site?
what are the topics that people would like to discuss? Please suggest topics and feel free to get as specific as you'd like. Here are some very general ideas to get us started:

correction candidate selection
crowdsourcing
applying dictionaries
OCR engines - comparisons, features, etc.
dumb & clever cleanup tricks & tips
dumb & clever indexing tricks & tips
recommender/spelling tools
correction interfaces
interesting stories from OCR hell

should we reach beyond the HathiTrust/HTRC community to include others (IMPACT, etc.)?

Here is one thing to get us started. At UNCAMP, several people mentioned to me some work currently being done at Texas A&M. Can anyone provide more information, links or email addresses?

Cheers,

Paul

Tom Burton-West

unread,

Sep 25, 2012, 12:13:25 PM9/25/12

to htrc-ocr-c...@googlegroups.com

Hello all,

Here is a bit of background on my interests in OCR

I work on HathiTrust Full-text search (http://www.hathitrust.org/ ). Dirty OCR affects indexing and retrieval in a number of ways.

OCR engines will produce "garbage" words when they accidentally interpret graphics as text, when they fail to correctly interpret tables, when they interpret musical scores as text, and in many other circumstances. Among other things this causes the number of unique words in the index to increase to the point where there are a significant issues:

When we were scaling up HathiTrust Full-text search to deal with millions of books, we discovered that because of dirty OCR we had over 2.5 billion unique terms in our index. We discovered this because at the time the search software we were using, Lucene/Solr, had a limit of the size of an integer (about 2.1 billion terms) and our index blew up. Thanks to the open-source community, Lucene/Solr was modified so that it now can handle about 284 billion unique terms. (See http://www.hathitrust.org/blogs/large-scale-search/too-many-words .) Once we got past that we found that the in-memory term index was taking up too much memory. We had to do some reconfiguration to solve that problem http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again .)

One other implication of the large number of unique terms caused by dirty OCR is that we can't allow wildcards or truncation in Full-text search because the number of terms makes this take too long. (An early test with three letters followed by a wildcard took 15 minutes to return results and during that time all other searches went from a response time of less than a second, to taking over a minute)

Given the above issues, we would like to investigate methods to remove "garbage" words from the index. What makes this a hard problem is that we have material in over 400 languages, and we have use cases where our users may be looking for a term or name that occurs only once. We need to be able to remove the "garbage" without removing too many real words.

Other impacts of dirty OCR on search include:

1) If the OCR engine doesn't produce any OCR for a volume or if it incorrectly identifies the language of the volume and produces nonsense, the book can not be retrieved when doing a full-text search.

2) The relevance ranking algorithms are based on statistics about how often a term occurs in a document, how many documents contain the term, and the total number of terms in the document. All these statistics are affected by OCR errors and we believe that the relevance ranking is seriously impaired for older materials and materials in less popular languages.

There are a number of information retrieval techniques that are suggested in the research literature to improve search of dirty OCR. However, many of these approaches, such as using character n-grams, do not scale to a corpus of 10 million books.

The above issues motivate us to be interested in approaches to removing "garbage" OCR, and in correcting OCR errors. In spelling and OCR correction, there is always the possibility of correcting an erroroneous word to the wrong word. For information retrieval purposes we want to err on the side of under-correcting rather than risk adding words that don't actually occur in the text to the index.

I look forward to hearing from the rest of the group.

Tom

Tom Burton-West

Information Retrieval Programmer

Digital Library Production Service

University of Michigan Library

tbur...@umich.edu

Tom Burton-West

Tom

--

Andy Mardesich

unread,

Sep 25, 2012, 9:09:39 PM9/25/12

to htrc-ocr-c...@googlegroups.com

Hi folks,

I’m attaching to this email our notes from our OCR discussion at HTRC UnCamp 2012.

You may also be interested in viewing notes from the entire conference, which I discovered attached to a page on the HTRC Wiki here:

http://wiki.htrc.illinois.edu/pages/viewpageattachments.action?pageId=5210128&metadataLink=true

(HTRC_uncamp_notes_full.docx)

Andy

----------------------------------------

Andrew Mardesich

California Digital Library

HTRCUncamp12_Disc_OCRCorrection_20120911.pdf

Reply all

Reply to author

Forward