Shirorekha Clipping code in Tesseract-3.0

Debayan Banerjee

unread,

Mar 31, 2011, 1:45:47 AM3/31/11

to indi...@googlegroups.com

Hi,

Indu S just pointed to me that Tesseract-3.0 implements shirorekha
clipping code (somewhat like
http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf).
The file is http://code.google.com/p/tesseract-ocr/source/browse/trunk/textord/devanagari_processing.cpp
. Apparently it incorrectly clips the maatraa sometimes. See the
attachment.

--
Debayan Banerjee

pageseg_split_debug.png

pranay prateek

unread,

Mar 31, 2011, 3:55:17 AM3/31/11

to indi...@googlegroups.com, Debayan Banerjee

Hi
I am a newbie to this. Can anyone please tell me how can i run this Shirorekha code on an image and get the output image with
clips in the maatra as Debayan has done. I have already checked out the whole tesserect-ocr source and it has the textord folder.

Regards
Pranay

--
"You aren't remembered for doing what is expected of you."

Indu

unread,

Mar 31, 2011, 4:38:25 AM3/31/11

to indi...@googlegroups.com

Just change the value of boolean variables in texord/devanagari_processing.cpp

INT_VAR(devanagari_split_debuglevel, 1,
"Debug level for split shiro-rekha process.");

BOOL_VAR(devanagari_split_debugimage, 1,
"Whether to create a debug image for split shiro-rekha process.");
you will get the segmented image.

--
Regards

Indu

pranay prateek

unread,

Mar 31, 2011, 7:24:17 AM3/31/11

to indi...@googlegroups.com, Indu

Okay.. so I was able to install tesseract R582 and when I ran it on a test image for hindi, it gave segmentation fault.

Splitting shiro-rekha ...
Split strategy = Minimal
Initial pageseg available = no
Skipping splitting CC at (11, 12): stroke width too huge..
Skipping splitting CC at (127, 14): stroke width too huge..
Cube ERROR (ConvNetCharClassifier::RunNets): NeuralNet is NULL
Cube ERROR (ConvNetCharClassifier::RunNets): NeuralNet is NULL

and it just crashes after that.

Also, as pointed by Debayan the Shirorekha splitting isn't accurate. And the darkened area of the text is totally ignored.
Lots of work needed on it I guess.
Does tesseractindic also employ similar matraa- splitting algo or is it different?

Regards
Pranay

hin_tess.png

pageseg_split_debug.png

Debayan Banerjee

unread,

Mar 31, 2011, 7:31:59 AM3/31/11

to indi...@googlegroups.com

On 31 March 2011 16:54, pranay prateek <prana...@gmail.com> wrote:
> Okay.. so I was able to install tesseract R582 and when I ran it on a test
> image for hindi, it gave segmentation fault.
>
>
> Splitting shiro-rekha ...
> Split strategy = Minimal
> Initial pageseg available = no
> Skipping splitting CC at (11, 12): stroke width too huge..
> Skipping splitting CC at (127, 14): stroke width too huge..
> Cube ERROR (ConvNetCharClassifier::RunNets): NeuralNet is NULL
> Cube ERROR (ConvNetCharClassifier::RunNets): NeuralNet is NULL
>
> and it just crashes after that.
>
> Also, as pointed by Debayan the Shirorekha splitting isn't accurate. And the
> darkened area of the text is totally ignored.
> Lots of work needed on it I guess.
> Does tesseractindic also employ similar matraa- splitting algo or is it
> different?

After an initial reading of the code it seems they have used
morphological operations like erosion, dilation etc for maatraa
clipping. I have used projection operations.
When I wrote the code I had no idea about morphological operations.
They are better because they are not affected by local noise or random
dark pixels.
I suggest you spend some time reading about morphological operations
in Digital Image Processing. Google is your friend. Also check out
Ocropus' maatraa clipping operations
(http://sites.google.com/site/ocropus/old-documentation/morphological-operations)
.

--
Debayan Banerjee

Debayan Banerjee

unread,

Mar 31, 2011, 3:57:16 PM3/31/11

to indi...@googlegroups.com

On 31 March 2011 17:01, Debayan Banerjee <deba...@gmail.com> wrote:
> On 31 March 2011 16:54, pranay prateek <prana...@gmail.com> wrote:
>> Okay.. so I was able to install tesseract R582 and when I ran it on a test
>> image for hindi, it gave segmentation fault.
>>
>>
>> Splitting shiro-rekha ...
>> Split strategy = Minimal
>> Initial pageseg available = no
>> Skipping splitting CC at (11, 12): stroke width too huge..
>> Skipping splitting CC at (127, 14): stroke width too huge..
>> Cube ERROR (ConvNetCharClassifier::RunNets): NeuralNet is NULL
>> Cube ERROR (ConvNetCharClassifier::RunNets): NeuralNet is NULL
>>
>> and it just crashes after that.
>>
>> Also, as pointed by Debayan the Shirorekha splitting isn't accurate. And the
>> darkened area of the text is totally ignored.
>> Lots of work needed on it I guess.
>> Does tesseractindic also employ similar matraa- splitting algo or is it
>> different?
>
> After an initial reading of the code it seems they have used
> morphological operations like erosion, dilation etc for maatraa
> clipping. I have used projection operations.

I read the code carefully after I came home from work. They first try
to do a morphological close operation (erosion + dilation) which gets
rid of some noise, like stray marks. Then they pretty much do the same
thing that the clip maatraa algorithm does. They use projection
operations as well. They compute a histogram of the word image for
white pixels (also called 'on' pixels most of the time). They then
look for the local maxima and these points are good candidates for
clipping.
This is good. The code is far cleaner than what I could have written.
The few wrong clippings do not seem to cause problems since they can
merge blobs (connected components) for recognition.
The next step (which is yet to be implemented) is to separate
consonant + vowel signs to physically separate consonant and vowel
signs in the image itself, eg কি to ক and ি (ki to ka and eekaar).

> Debayan Banerjee
>

--
Debayan Banerjee

pranay prateek

unread,

Mar 31, 2011, 4:46:32 PM3/31/11

to indi...@googlegroups.com, Debayan Banerjee

Since the vowel-signs in Hindi and Bengali are fairly standard(don't have idea about other Indic Languages)
and we already know the blocks of akshars i.e ( की , कि, कु, कू, etc) from maatra clipping, I think template matching
should work well. We could have templates of all vowel-signs and run over the blocks to tell which vowel-sign are actually
present there. Though I do have a feeling that this might be computationally intensive.
Also,on a side note, what is the final format in which we want to give the output. Is it utf-8, the standard followed
in most web based applications nowadays?
--
Pranay

Anish Patankar

unread,

Apr 1, 2011, 12:31:02 AM4/1/11

to indi...@googlegroups.com

In Devanagari, the combination of vowel and consonant is rule bound with few exceptions.
e.g. ब/क/म + ऊ = बू/ कू /मू (exception : र + ऊ = रू) .
So, instead of running all vowels on all consonants, it is better to detect consonant and the vowel separately. i.e. split ब and ऊ (in the form it is used for writing in conjunction with a consonant).
Also, for samyuktakshar, (e.g. क्र), making all combinations of all consonants might be required anyway.
So, running vowels on samyuktakshars (e.g. क्रि) will produce another set of symbols.
Detecting all these symbols (only consonant, only vowel, consonant + vowel, consonant + consonant, consonant + consonant + vowel ) as separate symbols will be computationally too expensive, I feel.
Actually, म = म् (consonant) + अ (vowel). I have ignored consonants in pure form (म्) completely. For Sanskrit, consonants come in pure form at many places.
Also, as far as I understand, the result of tesseract is in UTF-8 encoding. I could read the resulting buffer from Java using JNI as UTF-8 encoded String.

+Anish

Abhaya Agarwal

unread,

Apr 1, 2011, 12:55:24 AM4/1/11

to indi...@googlegroups.com

Just wanted to pitch in about the problem of large number of classes. The main problem with having large number of classes is not computation, it is the fact that your training data gets divided into so many classes. So you have much lesser number of samples per class and hence the statistics are less robust. To put it other way round, the more the number of classes, the more training data you need.

On the flip side, by splitting the consonants and vowels in two different classes, we lose the context information. Also the model implicitly allows all the combinations thus spending effort on impossible outputs. Moreover not all the combinations are equally likely. If we model each combination as a separate class, that information is captured but here it is lost. So it becomes important to somehow control the allowed combinations or put some feedback loop in. For example, if the consonant has been identified as being 'sh', then it is not possible to combine it with a half r. So the vowel matching can ignore that.

Regards,

Abhaya

--
-------------------------------------------------
blog: http://abhaga.blogspot.com
Twitter: http://twitter.com/abhaga
-------------------------------------------------

pranay prateek

unread,

Apr 1, 2011, 2:21:35 AM4/1/11

to indi...@googlegroups.com, Abhaya Agarwal

On Fri, Apr 1, 2011 at 10:25 AM, Abhaya Agarwal <abhaya....@gmail.com> wrote:

Just wanted to pitch in about the problem of large number of classes. The main problem with having large number of classes is not computation, it is the fact that your training data gets divided into so many classes. So you have much lesser number of samples per class and hence the statistics are less robust. To put it other way round, the more the number of classes, the more training data you need.

On the flip side, by splitting the consonants and vowels in two different classes, we lose the context information. Also the model implicitly allows all the combinations thus spending effort on impossible outputs. Moreover not all the combinations are equally likely. If we model each combination as a separate class, that information is captured but here it is lost. So it becomes important to somehow control the allowed combinations or put some feedback loop in. For example, if the consonant has been identified as being 'sh', then it is not possible to combine it with a half r. So the vowel matching can ignore that.

Regards,

Abhaya

I agree with Abhaya. We should try to do the matching in a more intelligent fashion, and may be first start to match based on some initial estimates as to which blocks might be a possible match, and do a more detailed match in the second stage. I don't know what is the standard way of matching the letter blocks, but template matching is the simplest technique which comes to my mind right now.

Also, i request people in the list to append their response at the bottom of the mail, so that the flow of the chain is maintained. Otherwise, it becomes difficult to track the order of the ideas.Thanks.

Regards

Pranay

Anish Patankar

unread,

Apr 1, 2011, 5:29:49 AM4/1/11

to indi...@googlegroups.com

Agreed with Abhaya about training data.
As far as I know about Tesseract, it uses some neural network based
algorithm for matching characters. After detecting characters, it
forms words. Then in second step these words are searched for in
dictionary. Some incorrectly detected character can be corrected based
on the matching word. This word matching is the only source of
contextual information.
It would be great, if someone can throw more light on Tessrecat
internals and code.
I wonder how much we should deviate from basic Tesseract to support
Indic scripts in Tesseract.

+Anish

Debayan Banerjee

unread,

Apr 1, 2011, 6:00:05 AM4/1/11

to indi...@googlegroups.com

On 1 April 2011 14:59, Anish Patankar <anis...@gmail.com> wrote:
> Agreed with Abhaya about training data.
> As far as I know about Tesseract, it uses some neural network based
> algorithm for matching characters. After detecting characters, it
> forms words. Then in second step these words are searched for in
> dictionary. Some incorrectly detected character can be corrected based
> on the matching word. This word matching is the only source of
> contextual information.
> It would be great, if someone can throw more light on Tessrecat
> internals and code.
> I wonder how much we should deviate from basic Tesseract to support
> Indic scripts in Tesseract.

There is no neural network in Tesseract.

" Tesseract used to employ a neural network called Aspirin/Migraines
(which has been removed due to licensing issues) which allowed
training. How did this removal affect tesseract's accuracy? Ray Smith
said, "When I took the NN code out, I measured the increase in error
rate to be slightly less than 1% (relative). The benefit was a 10-15%
improvement in speed. I have some accuracy/bug fixes coming in the
next release (i.e., v1.03) that more than compensate by reducing the
error rate by more than 3%. Of course any error rate changes are
dependent on the test set and are not necessarily reproducible on a
different test set..." "

Taken from http://tesseract-ocr.repairfaq.org/tess_glossary.html.

I agree we need not deviate much from Tesseract's way of doing things.
What we need to do in the pre-processing and post-processing stage
depends on the script as well. For example this whole shirorekha
clipping business is not required for Dravidian script.

--
Debayan Banerjee

Reply all

Reply to author

Forward