Indu S just pointed to me that Tesseract-3.0 implements shirorekha
clipping code (somewhat like
http://tesseractindic.googlecode.com/files/clipmatra_pseudocode.pdf).
The file is http://code.google.com/p/tesseract-ocr/source/browse/trunk/textord/devanagari_processing.cpp
. Apparently it incorrectly clips the maatraa sometimes. See the
attachment.
--
Debayan Banerjee
After an initial reading of the code it seems they have used
morphological operations like erosion, dilation etc for maatraa
clipping. I have used projection operations.
When I wrote the code I had no idea about morphological operations.
They are better because they are not affected by local noise or random
dark pixels.
I suggest you spend some time reading about morphological operations
in Digital Image Processing. Google is your friend. Also check out
Ocropus' maatraa clipping operations
(http://sites.google.com/site/ocropus/old-documentation/morphological-operations)
.
--
Debayan Banerjee
I read the code carefully after I came home from work. They first try
to do a morphological close operation (erosion + dilation) which gets
rid of some noise, like stray marks. Then they pretty much do the same
thing that the clip maatraa algorithm does. They use projection
operations as well. They compute a histogram of the word image for
white pixels (also called 'on' pixels most of the time). They then
look for the local maxima and these points are good candidates for
clipping.
This is good. The code is far cleaner than what I could have written.
The few wrong clippings do not seem to cause problems since they can
merge blobs (connected components) for recognition.
The next step (which is yet to be implemented) is to separate
consonant + vowel signs to physically separate consonant and vowel
signs in the image itself, eg কি to ক and ি (ki to ka and eekaar).
> Debayan Banerjee
>
--
Debayan Banerjee
+Anish
There is no neural network in Tesseract.
" Tesseract used to employ a neural network called Aspirin/Migraines
(which has been removed due to licensing issues) which allowed
training. How did this removal affect tesseract's accuracy? Ray Smith
said, "When I took the NN code out, I measured the increase in error
rate to be slightly less than 1% (relative). The benefit was a 10-15%
improvement in speed. I have some accuracy/bug fixes coming in the
next release (i.e., v1.03) that more than compensate by reducing the
error rate by more than 3%. Of course any error rate changes are
dependent on the test set and are not necessarily reproducible on a
different test set..." "
Taken from http://tesseract-ocr.repairfaq.org/tess_glossary.html.
I agree we need not deviate much from Tesseract's way of doing things.
What we need to do in the pre-processing and post-processing stage
depends on the script as well. For example this whole shirorekha
clipping business is not required for Dravidian script.
--
Debayan Banerjee