SahanaPy Workout - OCR experiments from foss.in

2 views

Skip to first unread message

Praneeth Bodduluri

unread,

Dec 6, 2009, 7:55:38 AM12/6/09

to talkin...@googlegroups.com

Hello all,

I am Praneeth working on SahanaPy ( http://trac.sahanapy.org ). Myself and Suryajith have been a part of the SahanaPy workout at FOSS.in ( http://foss.in ). we were analysing some popular Open Source OCR systems additionally we have written some code to convert SahanaPy Models directly into OCR capable forms using report lab - The code can be found at ( https://code.launchpad.net/~lifeeth/sahana/sahanapy-trunk ) and can be quite easily modified to build custom forms. I have attached a form generated from the model - The form is auto generated from the database model and has no form specific tweaks.

For testing we used Tesseract-OCR ( http://code.google.com/p/tesseract-ocr/ ) - we found that for printed text and numbers the results are near 100% accurate, but for handwritten text work needs to be done. Tesseract-OCR training data from the repository was used and no additional training has been done. we think that training the OCR engine using the handwriting from the person collecting the data will show a considerable increase in accuracy of the engine - hopefully we will get a chance to test this out soon.

We are yet to try out CellWriter ( http://risujin.org/cellwriter/ ) - Supports Right to Left languages which Tesseract does not.

We discussed some ways in which a system can be implemented a possible outline for work flow could be :

1) Define a Model in SahanaPy
2) Use the Inbuilt OCR controller to generate the PDF
3) Train the OCR engine with the handwriting of the collector.
4) Collect the data with the forms
5) Scan and upload the data to a central server - ( server has the OCR engine )
6) Pre-populate a webform with data obtained from the OCR - This enables the user to verify the data if needed and hit the submit. ( Note that the current SahanaPy framework enables this without having to do anything specific for a model) - In case some text cannot be recognized the image is shown above the text field and the person scanning the forms can write it in and hit submit to commit the data into the SahanaPy Database.

Note that the only thing a end-user needs to do is setup the model everything else is automated. Also all the code is cross-platform compatible as SahanaPy and most of the Open Source OCR engines are cross-platform.

We would like to know what you think about this. I am relatively a newbie in this field - excuse me if I made any obvious mistakes.

--
Praneeth