Image pre-processing for good OCR results

10,568 views
Skip to first unread message

Jon Andersen

unread,
Feb 20, 2011, 9:02:24 PM2/20/11
to tesser...@googlegroups.com
Hi,

My project at http://RecordAGrave.com is about recording headstones from graves and posting the text and images on the Net so that people can research their family history.  I would appreciate some advice on how to pre-process these headstone images to get the best results from Tesseract OCR.  I have thousands of 1-2 MB jpg images of headstones to process.

Example images:
I am a software developer so I can script up pre-processing steps to prepare the input for Tesseract.

Any advice on improving OCR accuracy through pre-processing steps?

Thanks so much,

-Jon

Vicky Budhiraja

unread,
Feb 20, 2011, 11:14:05 PM2/20/11
to tesser...@googlegroups.com

Hi Jon,

 

Like each morning, I check my emails and I saw those headstones Images from Graves. I am a God fearing person. So, I was not able to ignore your email.

 

Regarding the preprocessing step, I suggest to apply Local Minima method for background removal. However, you might require to adjust your window size in order to achieve the best results. I did some experiments with the MATLAB code, and I got some good results. Testing on a larger sample set, may improve the step.

 

Please tell me what project you are working on, maybe I will be able to contribute better? Just lemme know if you need any type of help!

 

Best Regards,

Vicky

--
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To post to this group, send email to tesser...@googlegroups.com.
To unsubscribe from this group, send email to tesseract-oc...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

Dmitry Silaev

unread,
Feb 20, 2011, 11:54:15 PM2/20/11
to tesser...@googlegroups.com
Jon,

I don't know if it's intended but all your links to images report
"We're sorry. The page you tried to access is not available". In that
way nothing can be advised on your issue...

Warm regards,
Dmitry Silaev

Jon Andersen

unread,
Feb 21, 2011, 10:45:51 AM2/21/11
to tesser...@googlegroups.com
Whoops, sorry - links were broken for a bit.  I just fixed the image links, they should work now.

Thanks!!

-Jon

Kip Hughes

unread,
Feb 20, 2011, 11:29:51 PM2/20/11
to tesser...@googlegroups.com
Hi Vicky,

I have an interest in theology and just wanted to know which of the god(s) are you "god fearing" of? In my experience, the phrase "god fearing" has been used predominantly by Christians. I checked your LinkedIn profile and confirmed you are from India.

Less than 3% of Indians are Christians -- so, based on this statistic, I would guess you are not a Christian. Over 80% of Indians are Hindus -- and if I had to make a guess about any Indian's religion, I would go with that one. Are you a Hindu? Hinduism a polytheistic religion, isn't it? Why would you only be a "God fearing person" versus "gods fearing person?"

Finally, is there some significance that headstones have in your religion (whatever it may be) that made you unable to ignore Jon's email?

Hope you don't mind the questions. They are really just due to my interest in world religions and world views.

Thanks,
KIP

Cong Nguyen

unread,
Feb 21, 2011, 11:32:26 PM2/21/11
to tesser...@googlegroups.com

Dear Jon,

 

Try to analyze with some preprocessing steps as belows:

 

Step1: Detect ROI

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366756516993234

 

Setp2: Apply low-pass  fft  filter, with parameters:

    - intensity threshold is 130

    - fft cutoff: 15%

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366759922523650

 

Step3: Scale image with scale factor

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366756371708834

 

Step4: try to recognize use Tesseract/others

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366764338605922

 

Step5: post-processing requires????

 

Good luck,

Cong.

--

Kip Hughes

unread,
Feb 21, 2011, 11:54:39 PM2/21/11
to tesser...@googlegroups.com
Hi Vicky, thanks for the reply. Let's definitely take this offline.

Tesseract-OCR Newsgroup Subscribers: sorry for this. I hit "reply" thinking my response was going to Vicky alone.

On Tue, Feb 22, 2011 at 5:32 PM, Vicky Budhiraja <vicky...@gmail.com> wrote:

Hi KIP,

 

I am a Hindu, but not converged to a particular God!

 

The thing that made me so sensitive about that email Images were as follows:

1- Whenever a person is dead in India and he/she is being taken to last cremention ground on the way, whenever or whoever looks that ‘yatra’ always remember his God, he believes into

2- When you are looking at the headstone-images that is of someone dead (no longer alive) you have to pay respect and I pay it by remembering my God plus simply forwarding all of my help to Jon

3- Last, I am currently working on background removal from medical scans in DICOM images, so it was technically related!

 

I hope that answers your questions. If not, let us discuss this offline otherwise we can be painted as an OT J

Vicky Budhiraja

unread,
Feb 21, 2011, 11:32:15 PM2/21/11
to tesser...@googlegroups.com

Hi KIP,

 

I am a Hindu, but not converged to a particular God!

 

The thing that made me so sensitive about that email Images were as follows:

1- Whenever a person is dead in India and he/she is being taken to last cremention ground on the way, whenever or whoever looks that ‘yatra’ always remember his God, he believes into

2- When you are looking at the headstone-images that is of someone dead (no longer alive) you have to pay respect and I pay it by remembering my God plus simply forwarding all of my help to Jon

3- Last, I am currently working on background removal from medical scans in DICOM images, so it was technically related!

 

I hope that answers your questions. If not, let us discuss this offline otherwise we can be painted as an OT J

 

Best Regards,

Vicky

 

 

From: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] On Behalf Of Kip Hughes


Sent: Monday, February 21, 2011 10:00
To: tesser...@googlegroups.com

Vicky Budhiraja

unread,
Feb 21, 2011, 11:47:24 PM2/21/11
to Jon Andersen, tesser...@googlegroups.com
Hi Jon,

The code I have written is in MATLAB. Will you be able to convert it into
OpenCV code? Lemme know.

In OpenCV if you apply simple thresholding, it should work. My method
(local-minima) is a little complicated (and accurate) then simple
thresholding. Therefore, hard to implement in C++ because of interpolation
step. I think OpenCV can do this, but we need to have a closer look for this
step.

Best Regards,
Vicky


-----Original Message-----
From: Jon Andersen [mailto:jand...@gmail.com]
Sent: Monday, February 21, 2011 23:42
To: Vicky Budhiraja
Subject: Re: Image pre-processing for good OCR results

Vicky,

Thank you so much for responding! I appreciate your help with this
project.

I have taken thousands of photos of headstones, and am trying to use
Tesseract on them. I will make the results available through
findagrave.com, so that people can search for their relatives.

Here is a whole directory of sample images:
http://freepages.genealogy.rootsweb.ancestry.com/~janderse/cemeteries/Star%2
0of%20David%20Memorial%20Gardens/Garden%20of%20Haifa/

Could you send me the code or results that you found? I am trying to
use OpenCV to do the image pre-processing.

Thanks!!!

-Jon

Dmitry Silaev

unread,
Feb 22, 2011, 2:12:23 AM2/22/11
to tesser...@googlegroups.com
Jon,

You will certainly need to implement mostly the steps that Cong Nguyen
suggests. However complications arise if you wish to do pre-processing
in a pure automatic way. You are going to precess real photographic
images, and therefore fonts, backgrounds, lighting conditions, etc.
differ much. And that's why a "one fits all" method (particularly for
ROI detection and background removal) won't work. You will encounter
that your fixed pipeline works fine with the first and second images
but fails with the third one.

There are two possible ways to solve this. If you still want to do it
automatically you'll need to choose several algorithms for every
pipeline stage and implement a logic that would automatically, based
on some metric, decide for each image which algorithm would work (or
have worked) best. Or you can give up automatic approach and switch to
manual selection of pre-processing scenarios for each image according
to your experience.

The next complication is getting results from Tesseract. Since the
quality of text in photographic images is really low, usually you
can't rely on that Tesseract's top-choice recognition results
represent actual text. Imho the best approach here is to get all
Tesseract's choices for every character and then remove uncertainty
using language model (bigram and trigram statistics). This is the best
you can do because dictionary won't help you much, at least for last
names.

And then you'll have to locate names within the recognition results.
The first problem here is in that they can be few per headstone. The
second one is in that Tesseract will try to recognize as text
everything it sees in the image, including noise left from
pre-processing. So this task can also pose some difficulties. But this
seems to be mainly a question of engineering, not of research...

To conclude, it all depends on how serious you are about investing
your time and efforts into your project ))

HTH

Warm regards,
Dmitry Silaev

Tom Morris

unread,
Feb 22, 2011, 1:30:24 PM2/22/11
to tesseract-ocr
On Feb 20, 9:02 pm, Jon Andersen <jande...@gmail.com> wrote:

> My project athttp://RecordAGrave.comis about recording headstones from
> graves and posting the text and images on the Net so that people can
> research their family history.  I would appreciate some advice on how to
> pre-process these headstone images to get the best results from Tesseract
> OCR.  I have thousands of 1-2 MB jpg images of headstones to process.

Post-image capture is too late for one of the most important
enhancements, namely high contrast lighting. It's not really an issue
with stones that have the carving painted or are otherwise naturally
high contrast, but for many stones sharp oblique lighting is important
to get an image that's readable by humans, let alone OCR software.

Once you've got the best quality image capture you can manage, you'll
probably find that you need to use different image processing
pipelines for different types of stones and carving, so the first step
will be to categorize the stone and figure out which pipeline to run
it through (or run it through them all and compare the results).

In addition to image processing, you may also be able to improve
results by making use of the fact that the vocabulary and layout of
the text is much more constrained than free text.

It'll be interesting to see what kind of results you get. I suspect
it's going to be a fairly challenging project for the general case,
but you may be able to pick of the low hanging fruit and gradually
expand the types of stones you can handle.

Tom

Andres

unread,
Feb 22, 2011, 4:01:54 PM2/22/11
to tesser...@googlegroups.com
Hello,

A few comments from my side, sorry for being disordered, but I have not much time right now.

In OpenCV you can use thresholding with the Otsu algorithm, it’s not documented in the documentation of the threshold function, but the parameter is CV_THRESH_OTSU.

Otsu thresholding involves the calibration of the parameters by performing a previous histogram:
http://en.wikipedia.org/wiki/Otsu%27s_method

I tried it in my project (a licence plate recognition system) and I visually got too much better results, but surprisingly for Tesseract it was worse. It changed the thickness of the draw of the letters, and when I trained Tesseract the letters were bolder than the results of the Otsu threshold, so perhaps there is the explanation for my problem. So, perhaps it would be a good solution for you.

If you want to make some rapid tests with OpenCV for preprocessing you can use this:

http://code.google.com/p/cvpreprocessor/

It’s not a complete tool but it helps.

I think that your system is close to mine in certain aspects. I was thinking in doing some skeletonization or something like that for the fonts and then training Tesseract with these modified letters. Then doing the same process with the acquired images and executing Tesseract. I didn’t try that yet.

Skeletonization:
http://homepages.inf.ed.ac.uk/rbf/HIPR2/skeleton.htm

In accordance with what Tom Morris said, you have some constraints in text layout. Tesseract gives you the coordinates of each character. You can work with that. Perhaps you will need some grouping algorithm like k-means to make some statitstics: http://en.wikipedia.org/wiki/Kmeans
OpenCV has an implementation of k-means, ask me for a snippet in case of needing it.

Question to Cong Nguyen: The program that you used here, is something that is available on the web or is something that you have for your projects ? :
https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576366764338605922

Cheers,

Andres
www.visiondepatentes.com.ar



2011/2/22 Tom Morris <tfmo...@gmail.com>

Jon Andersen

unread,
Feb 22, 2011, 5:11:43 PM2/22/11
to Vicky Budhiraja, tesser...@googlegroups.com
Vicky,

I may be able to convert your local-minima code to OpenCV code; can you send me the result files as well as the filter?

I wrote some Python code that uses OpenCV to crop the headstone images to show just the stone.  Its not perfect, but it works OK.  The Hough algorithm and the other corner-detection algorithms weren't working at all for me.  So I just thresholded based on the average saturation value, row-by-row, column-by-column, to find a rectangle that was saturated enough.  Then crop to that rectangle.  Overly simple and dumb; however, it does somewhat work, whereas the other algorithms just gave me insane corners and didn't detect the headstone at all.

Reference images:

Thanks!!

-Jon Andersen
Software engineer
Citrix Systems, Inc
954-973-4908 (home)

Cong Nguyen

unread,
Feb 22, 2011, 8:57:03 PM2/22/11
to tesser...@googlegroups.com

Dear Andres,

 

The recognition results which I showed, have achieved after I had used my simple tesseract engine 3.01 .net wrapper (link here: http://code.google.com/p/tesseractdotnet/).

 

ROI detection is cropping ROI manually, after that I used my company software to filter.

 

About filtering, you can analyze on control set to find out solution to estimate parameters feasibly.

 

Thanks,

Cong.

 

 

From: tesser...@googlegroups.com [mailto:tesser...@googlegroups.com] On Behalf Of Andres
Sent: Wednesday, February 23, 2011 4:02 AM
To: tesser...@googlegroups.com
Subject: Re: Image pre-processing for good OCR results

 

Hello,

Cong Nguyen

unread,
Feb 22, 2011, 9:36:26 PM2/22/11
to tesser...@googlegroups.com

Dear Jon,

 

Beginning for analyzing; I try also to detect lines, corners; but results are not good. I think due to images are low contrast.

 

Please try to analyze with some data line profiles:

 

ROI-left-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706091073985362

 

ROI-top-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706094761082706

 

ROI-right-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706102033630978

 

ROI-bottom-profile:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576706106389606898

 

After doing ROI detection, may be you need to align image.

My solution for this step is:

-          detect all lines (Hough transform approach), and then keep all lines have slops are similar to horizontal lines.

-          Estimate base-slop based on mean slop

-          Align image

Here are detected lines:

https://picasaweb.google.com/congnguyenba/TesseractBasedOCR#5576709473940745778

 

Hope it’s helpful to you!

 

Good luck,

Cong.

 

TP

unread,
Feb 23, 2011, 9:50:42 AM2/23/11
to tesser...@googlegroups.com
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to tesser...@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-oc...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

I guess I'm a bit surprised that no one has yet mentioned the fact
that the Leptonica C Image Processing Library
(http://www.leptonica.com) is now required to build tesseract-ocr --
or soon will be... the current state of tesseract-ocr is a bit hazy.
My understanding is that eventually (not in the near future though)
tesseract-ocr will only use Leptonica PIXs as its in-memory image
representation.

A still unofficial, easier to read, Sphinx generated version of the
Leptonica documentation is at
http://tpgit.github.com/UnOfficialLeptDocs/. Dan is currently
hammering away at v1.68 and it should be out soon (this week?). At
which point I'll also update my unofficial version of the
documentation.

My admittedly quick/biased opinion was that OpenCV focused on Computer
Vision and that Leptonica has more "pure" Image Processing routines. I
also find Leptonica's source code fairly easy to read because one of
the purposes of the library is to try to teach image processing
concepts.

In any case, if you're planning on using tesseract-ocr 3.x, then you
already must have liblept, so you might as well try it out.

-- TP

Giuseppe Menga

unread,
Feb 28, 2011, 10:17:25 AM2/28/11
to tesser...@googlegroups.com
at Politecnico di Torino we are using the release 3.0.0 of tesseract, with the standard english training.
Obviously the software doesn’t recognize pages of text rotated upside down and we would not expect it does, however with surprise, it recognizes with a little worse performance text rotated of 90° counter clockwise, but not clockwise.
How that is possible?
We have to recognize text we don’t know in advance the orientation, and I know that Leptonica should be used for page layout analysis.
However, does tesseract offers internal facilities to recognize text orientation?
And if so, how to activate these facilities or at least to return tentative baselines?
Giuseppe

Jimmy O'Regan

unread,
Feb 28, 2011, 1:35:21 PM2/28/11
to tesser...@googlegroups.com
On 28 February 2011 15:17, Giuseppe Menga <me...@polito.it> wrote:
> at Politecnico di Torino we are using the release 3.0.0 of tesseract, with
> the standard english training.
> Obviously the software doesn’t recognize pages of text rotated upside down
> and we would not expect it does, however with surprise, it recognizes with a
> little worse performance text rotated of 90° counter clockwise, but not
> clockwise.
> How that is possible?

It's a side-effect of support for Japanese, Chinese, etc.

> We have to recognize text we don’t know in advance the orientation, and I
> know that Leptonica should be used for page layout analysis.
> However, does tesseract offers internal facilities to recognize text
> orientation?
> And if so, how to activate these facilities or at least to return tentative
> baselines?

There's an orientation/script detection module in the 3.01 code, but I
haven't even tried to use it, so I couldn't say.

--
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

patrickq

unread,
Feb 28, 2011, 1:44:40 PM2/28/11
to tesseract-ocr
ScanBizCards (iPhone version) is using the Tesseract 3.0 orientation
detection, works quite well - accurate in 95%+ of cases and the 5%
failure cases are oftentimes because we scan business cards where
there isn't a lot of text to go by + there is a lot of non-text
confusing the detection.

Patrick

Giuseppe Menga

unread,
Feb 28, 2011, 1:50:31 PM2/28/11
to tesser...@googlegroups.com
Patrick
just a hint of how to use the orientation functionality of Tesseract
Giuseppe

-----Messaggio originale-----
From: patrickq
Sent: Monday, February 28, 2011 7:44 PM
To: tesseract-ocr
Subject: Re: text rotated upside down or of 90�

ScanBizCards (iPhone version) is using the Tesseract 3.0 orientation
detection, works quite well - accurate in 95%+ of cases and the 5%
failure cases are oftentimes because we scan business cards where
there isn't a lot of text to go by + there is a lot of non-text
confusing the detection.

Patrick

On Feb 28, 1:35 pm, "Jimmy O'Regan" <jore...@gmail.com> wrote:
> On 28 February 2011 15:17, Giuseppe Menga <me...@polito.it> wrote:
>
> > at Politecnico di Torino we are using the release 3.0.0 of tesseract,
> > with
> > the standard english training.

> > Obviously the software doesn�t recognize pages of text rotated upside

> > down
> > and we would not expect it does, however with surprise, it recognizes
> > with a

> > little worse performance text rotated of 90� counter clockwise, but not


> > clockwise.
> > How that is possible?
>
> It's a side-effect of support for Japanese, Chinese, etc.
>

> > We have to recognize text we don�t know in advance the orientation, and

> > I
> > know that Leptonica should be used for page layout analysis.
> > However, does tesseract offers internal facilities to recognize text
> > orientation?
> > And if so, how to activate these facilities or at least to return
> > tentative
> > baselines?
>
> There's an orientation/script detection module in the 3.01 code, but I
> haven't even tried to use it, so I couldn't say.
>
> --
> <Leftmost> jimregan, that's because deep inside you, you are evil.
> <Leftmost> Also not-so-deep inside you.

--

patrickq

unread,
Feb 28, 2011, 10:19:24 PM2/28/11
to tesseract-ocr
Sure. After you call SetImage to provide the bitmap, just do this:

OSResults *orientationStruct = new OSResults();

bool gotOrientation = myTess->DetectOS(orientationStruct);

int bestOrientation = -1;
float bestOrientationScore = 0;
if ((gotOrientation) && (orientationStruct->orientations != NULL)) {
for (int i=0; i<4; i++) {
if (orientationStruct->orientations[i] > bestOrientationScore) {
bestOrientation = i;
bestOrientationScore = orientationStruct->orientations[i];
}
}
}

// This is the result we were asked for
results.textOrientation = bestOrientation;

"bestOrientation" will be the index of the entry in the length 4 array
of orientation tests which got the highest score and indicates the
text orientation (I'll leave it as an exercise to the reader to figure
out how to map 0,1,2,3 to pointing up, right, down and left ...). You
can get a sense of Tess confidence in the result by examining the
value of the score that won.

Patrick

On Feb 28, 1:50 pm, "Giuseppe Menga" <me...@polito.it> wrote:
> Patrick
> just a hint of how to use the orientation functionality of Tesseract
> Giuseppe
>
> -----Messaggio originale-----
> From: patrickq
> Sent: Monday, February 28, 2011 7:44 PM
> To: tesseract-ocr
> Subject: Re: text rotated upside down or of 90
>
> ScanBizCards (iPhone version) is using the Tesseract 3.0 orientation
> detection, works quite well - accurate in 95%+ of cases and the 5%
> failure cases are oftentimes because we scan business cards where
> there isn't a lot of text to go by + there is a lot of non-text
> confusing the detection.
>
> Patrick
>
> On Feb 28, 1:35 pm, "Jimmy O'Regan" <jore...@gmail.com> wrote:
>
>
>
> > On 28 February 2011 15:17, Giuseppe Menga <me...@polito.it> wrote:
>
> > > at Politecnico di Torino we are using the release 3.0.0 of tesseract,
> > > with
> > > the standard english training.
> > > Obviously the software doesn t recognize pages of text rotated upside
> > > down
> > > and we would not expect it does, however with surprise, it recognizes
> > > with a
> > > little worse performance text rotated of 90 counter clockwise, but not
> > > clockwise.
> > > How that is possible?
>
> > It's a side-effect of support for Japanese, Chinese, etc.
>
> > > We have to recognize text we don t know in advance the orientation, and

Cong Nguyen

unread,
Feb 28, 2011, 10:27:21 PM2/28/11
to tesser...@googlegroups.com
Dear Giuseppe,

Could you post some samples to analyze?

If you are afraid that tesseract page layout doesn't work on rotated image,
you can run step-by-step as belows:

1. Firstly, you can call tesseract to FindLinesCreateBlockList (have a look
at TessBaseAPI class), you should achieved a BLOCK_LIST.

2. Now, please check BLOCK_LIST:
I showed here only member fields:
...
ROW_LIST rows; //< rows in block
...
FCOORD skew_; //< Direction of true horizontal.
ICOORD median_size_; //< Median size of blobs.

And here are ROW class:
....
inT32 kerning; //inter char gap
inT32 spacing; //inter word gap
TBOX bound_box; //bounding box
float xheight; //height of line
float ascrise; //size of ascenders
float descdrop; //-size of descenders
WERD_LIST words; //words
QSPLINE baseline; //baseline spline
...

A page included block(s), and a block included row(s)....

3. Try to visualize any things you need to have an overview of
segmentation/detection step worked...

Also, if you want to understand how to tesseract works, please read some
papers in doc folder, they have been published by Ray.

Hope it's helpful to you!

Cong.

Cameron Christiansen

unread,
Jan 9, 2015, 2:38:58 PM1/9/15
to tesser...@googlegroups.com
I know I'm a bit late to the party, but I came across this and thought I should post my approach to this problem. I published a paper on it and it can be found at: 
Christiansen, Cameron S., and William A. Barrett. "Data acquisition from cemetery headstones." IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics, 2013

Or for a demo which may or may not be up long term: http://cameronchristiansen.com/headstone-cleaner/
Reply all
Reply to author
Forward
0 new messages