Hello all,
I am Praneeth working on SahanaPy (
http://trac.sahanapy.org ). Myself and Suryajith have been a part of the SahanaPy workout at FOSS.in (
http://foss.in
). we were analysing some popular Open Source OCR systems additionally
we have written some code to convert SahanaPy Models directly into OCR
capable forms using report lab - The code can be found at (
https://code.launchpad.net/~lifeeth/sahana/sahanapy-trunk
) and can be quite easily modified to build custom forms. I have
attached a form generated from the model - The form is auto generated
from the database model and has no form specific tweaks.
For testing we used Tesseract-OCR (
http://code.google.com/p/tesseract-ocr/
) - we found that for printed text and numbers the results are near
100% accurate, but for handwritten text work needs to be done.
Tesseract-OCR training data from the repository was used and no
additional training has been done. we think that training the OCR
engine using the handwriting from the person collecting the data will
show a considerable increase in accuracy of the engine - hopefully we
will get a chance to test this out soon.
We are yet to try out CellWriter (
http://risujin.org/cellwriter/ ) - Supports Right to Left languages which Tesseract does not.
We discussed some ways in which a system can be implemented a possible outline for work flow could be :
1) Define a Model in SahanaPy
2) Use the Inbuilt OCR controller to generate the PDF
3) Train the OCR engine with the handwriting of the collector.
4) Collect the data with the forms
5) Scan and upload the data to a central server - ( server has the OCR engine )
6) Pre-populate a webform with data obtained from the OCR - This
enables the user to verify the data if needed and hit the submit. (
Note that the current SahanaPy framework enables this without having to
do anything specific for a model) - In case some text cannot be
recognized the image is shown above the text field and the person
scanning the forms can write it in and hit submit to commit the data
into the SahanaPy Database.
Note that the only thing a end-user needs to do is setup the model
everything else is automated. Also all the code is cross-platform
compatible as SahanaPy and most of the Open Source OCR engines are
cross-platform.
We would like to know what you think about this. I am
relatively a newbie in this field - excuse me if I made any obvious
mistakes.
--
Praneeth