I have been looking into the Tesseract source code of late. i was
trying to write some small files calling the api and do simple stuff
like getting bounding boxes for glyphs and getting baselines etc. I
also have been trying to modify the way Tesseract combines 2 connected
components together if they are not separated horizontally. If we can
change this, and if we can simply 'clip' the point between
<
http://1.bp.blogspot.com/-Y7CaiQH_iZ4/TZ4UVtTeJzI/AAAAAAAAH0k/7c6DMj-zlhY/s1600/46.png_.jpg>
the consonant and the descending vowel, Tesseract will do the rest.
I had committed to creating a high level schematic diagram of the OCR
we are trying to create, but right now I am not very sure what
architecture we will follow, because it depends on how our algorithms
work out.
--
Debayan Banerjee