Hi,
sorry, there is no tutorial yet, and there are actually a number of
different possibilities.
The following is what works for the development branch if you have
text lines and transcriptions and you already have a character
recognition model that sort of works but not well. The process then
is roughly as follows:
- put the text line images into *.png files and the corresponding
ground truth into *.gt.txt files
- run ocropus-calign -x .gt.txt -m my.cmodel *.png
- run ocropus-extract-csegs *.png -o chars.db
- optionally, correct the character labels with ocropus-cedit chars.db
-t chars
- optionally, cluster the character shapes with ocropus-cluster
chars.db clusters.db
- optionally, correct the cluster labels with ocropus-cedit
clusters.db
- train a new character recognition model with ocropus-ctrain -b
clusters.db new.cmodel
You can now recognize with "ocropus-calign -m new.cmodel ..."
There are other recipes for completely new scripts (i.e., if you don't
already have any model), for new scripts that differ from old scripts
by only a few characters, etc.
Also, there are two kinds of recognizers, the old C++ recognizer
(ocropus-linerec) and the new Python recognizer (ocropus-calign); they
work similarly but have some differences. For the official release,
we're moving completely to the Python recognizer.
Tom