Well, we have the first numbered release of OCRopus up on
ocropus.org, the promised Alpha release. As scheduled, it includes a lot of new functionality:
- text/image segmentation
- MLP-based character recognition
- OpenFST-based statistical language modeling
- more detailed layout information in the hOCR output
- better testing and evaluation tools
- some image cleanup, deskewing
- Lua-based configuration and scripting
- fast binary morphology
- better code organization through namespaces, include file simplifications
- code for alignment and training data generation from transcribed ground truth
This branch will be maintained as the 0.1 branch and main development is moving to 0.2, eventually resulting in the 0.5 release (beta release), planned for the end of Q1 2008. New functionality will go largely only into
0.2. We will be back-porting smaller, useful pieces of functionality to 0.1.
Note that while the MLP-based recognizer and the OpenFST language modeling work, they do not perform very well yet; we have just focussed on getting the functionality in there for now.
For the beta release, we will be focusing less on new functionality and more on getting higher quality output, better command line tools for training and testing, and bug fixing.
I'd like to thank everybody for their feedback, suggestions, and contributions, and in particular Daniel, Hagen, Faisal, Ilya, and Christian for the large amount of pre-release work. We all hope that OCRopus will become increasingly useful to over the next year.
Cheers,
Thomas
for the OCRopus developers