Assistance with OCR on frames from screen capture

Titus Barik

unread,

Jun 24, 2016, 2:19:10 AM6/24/16

to tesseract-ocr

Hi all,

We need to process the title bars from a set of screen recordings for a programming IDE. An example of a title is:

Java - commons-collections4/src/main/java/org/apache/commons/collections4/list/LazyList.java - Eclipse

The videos have already been recorded so we are stuck with the quality of the frames as is (I have included an example of this image as an attachment).

When running it under tesseract with stock settings, the output is instead:

> tesseract title_lazylist.png stdout

lava , (ammon5chIIemansA/src/msun/Java/arg/apame/wmmans/calIemansA/nst/Lazyust Java , Eclipse

I expect that recognition will be poor with default settings, but I'm unclear on what I should be doing to proceed in this particular case -- whether it is to apply some filter on the image first as a pre-processing step, if I should have custom config settings (such as "load_system_dawg 0") or some combination of both.

I'm not an expert in OCR so any suggestions are appreciated.

The version of tesseract is:

tesseract 3.05.00dev
leptonica-1.73
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0

Thanks,

Titus

title_lazylist.png

Stef

unread,

Jun 24, 2016, 4:01:42 AM6/24/16

to tesseract-ocr

You could try and scale up the image before OCR. See section "Scale text up" here.

Stef

Tom Morris

unread,

Jun 24, 2016, 1:00:26 PM6/24/16

to tesseract-ocr

I'd also get rid of the blue and make the text black on white. Presumably the colors are constant, so this should be an easy pre-processing step.

Tom

Titus Barik

unread,

Jun 25, 2016, 10:37:45 AM6/25/16

to tesseract-ocr

Thanks! Applying the simple 3x linear scaling to the image improved the results of recognition dramatically. The output is now:

Java - commons-collections4/src/main/java/org/apache/commons/collections4/Iist/LazyListjava - Eclipse

In the word "/list/", it is actually being recognized as a capital I (eye), not a lowercase l (el). This is not a huge problem, but I'm wondering if there's a simple way to correct for these sorts of issues in tesseract. With a user dictionary or user patterns? Otherwise, I can just fix these on a case-by-case basis using an external Python script.

On Friday, June 24, 2016 at 4:01:42 AM UTC-4, Stef wrote:

title_lazylist_3x_opencv.png

Titus Barik

unread,

Jun 25, 2016, 10:55:13 AM6/25/16

to tesseract-ocr

Replying to myself, but looking at the FAQ, perhaps the unicharambigs file is the way to go for simple replacements?
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract#the-unicharambigs-file

Stef

unread,

Jun 25, 2016, 4:17:12 PM6/25/16

to tesseract-ocr

It would be interesting to hear whether you had any success with user-word /user-patterns. I could never see any changes in the recognition result by applying user dictionaries. This seems to be confirmed by others too, see for instance here, here or here. So I'm still looking for a reproducible working example.

Stef

Reply all

Reply to author

Forward