How to manually correct cseg? critique training procedure (spaces question)?

Keefe

unread,

Oct 19, 2009, 7:22:44 AM10/19/09

to ocropus

Hi,

Here are my specific questions :

What is the recommended procedure for manually correcting cseg.gt.png
files? Is there a utility that I am overlooking?

When generating text for training images, should this include spaces?

My overall procedure : I have spent some time training ocropus on a
custom font, images from JPGs. I am using the following methods :

1) Generate a variety of single line training images programatically
2) Manually type the text contained in each training image
3) Places these in a directory training/0000 or training/0001 etc
4) run ocropus lines2fsts training
5) replace the generate txt files with my txt files and run ocropus
align training to generate cseg.png
6) run ocropus trainseg on training to generate a new model file
7) goto 1 using the new training model

I've read the wiki and a variety of docs like http://docs.google.com/Doc?id=dfxcv4vc_92c8xxp7
and http://www.slideshare.net/tmbdev/ocropus-status

Tom Breuel

unread,

Oct 19, 2009, 6:26:23 PM10/19/09

to ocropus

> What is the recommended procedure for manually correcting cseg.gt.png
> files? Is there a utility that I am overlooking?

There isn't one yet; we've been working on it.

> When generating text for training images, should this include spaces?

Yes; however, the space handling in OCRopus is currently inconsistent
so that the spaces are ignored.

> My overall procedure : I have spent some time training ocropus on a
> custom font, images from JPGs. I am using the following methods :
>
> 1) Generate a variety of single line training images programatically
> 2) Manually type the text contained in each training image

If you generate it, why not save the text?

> 3) Places these in a directory training/0000 or training/0001 etc
> 4) run ocropus lines2fsts training
> 5) replace the generate txt files with my txt files and run ocropus
> align training to generate cseg.png
> 6) run ocropus trainseg on training to generate a new model file
> 7) goto 1 using the new training model

If you can write a script that takes a text file and font and
generates a book directory full of binary line images, corresponding
csegs, and corresponding Unicode strings, that would be useful.

Tom

Keefe

unread,

Oct 19, 2009, 9:30:26 PM10/19/09

to ocropus

I'm generating the lines from a set of training images, not from a set
of strings. I just know where in this particular data set the text is,
so I can grab the text from those regions and dump it to a file. I
still need to correctly transcribe it afterwards. Thanks for the
answers, I will keep my eyes peeled for the next release!

Reply all

Reply to author

Forward