Hot To Train On Book Directory Against Ground Truth Text Files?

Dennis Rardin

unread,

May 5, 2010, 5:53:39 PM5/5/10

to ocropus

All/Anyone,

I have 2 large books broken into pages and then to lines. I'm ready to train. For both books, I have text files to compare against the images.

How do I train OCROpus by using the text files to correct the results of the character recognition?

Thank You Very Much,
Dennis

--
You received this message because you are subscribed to the Google Groups "ocropus" group.
To post to this group, send email to ocr...@googlegroups.com.
To unsubscribe from this group, send email to ocropus+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.

Ted

unread,

May 9, 2010, 4:28:02 PM5/9/10

to ocropus

Did you ever get a reply to the documentation question?
I'm a new OCROpus user and have to use gocr. This creates problems
because it needs a lot of corrections. I'd like to know how to use
the trainer.

I'd even be willing to write an introductory manual.

Tom

unread,

May 13, 2010, 3:45:46 PM5/13/10

to ocropus

Hi,

sorry, there is no tutorial yet, and there are actually a number of
different possibilities.

The following is what works for the development branch if you have
text lines and transcriptions and you already have a character
recognition model that sort of works but not well. The process then
is roughly as follows:

- put the text line images into *.png files and the corresponding
ground truth into *.gt.txt files
- run ocropus-calign -x .gt.txt -m my.cmodel *.png
- run ocropus-extract-csegs *.png -o chars.db
- optionally, correct the character labels with ocropus-cedit chars.db
-t chars
- optionally, cluster the character shapes with ocropus-cluster
chars.db clusters.db
- optionally, correct the cluster labels with ocropus-cedit
clusters.db
- train a new character recognition model with ocropus-ctrain -b
clusters.db new.cmodel

You can now recognize with "ocropus-calign -m new.cmodel ..."

There are other recipes for completely new scripts (i.e., if you don't
already have any model), for new scripts that differ from old scripts
by only a few characters, etc.

Also, there are two kinds of recognizers, the old C++ recognizer
(ocropus-linerec) and the new Python recognizer (ocropus-calign); they
work similarly but have some differences. For the official release,
we're moving completely to the Python recognizer.

Tom

Dennis Rardin

unread,

May 15, 2010, 12:21:52 AM5/15/10

to ocr...@googlegroups.com

No, I haven't gotten an answer on this yet.

Brad Hards

unread,

May 16, 2010, 6:52:36 AM5/16/10

to ocr...@googlegroups.com, Dennis Rardin

On Saturday 15 May 2010 02:21:52 pm Dennis Rardin wrote:
> No, I haven't gotten an answer on this yet.

There was an answer from Tom on Friday:
http://groups.google.com/group/ocropus/browse_thread/thread/4f3a2ee1a94d419b?hl=en.

Brad

Dennis Rardin

unread,

May 20, 2010, 2:58:28 PM5/20/10

to ocr...@googlegroups.com

Thanks for the reply, Tom. I didn't see it right away for some reason. I'll report back when I have some results from the process you describe.

Dennis

Reply all

Reply to author

Forward