Chinese

43 views
Skip to first unread message

Tom H

unread,
Feb 1, 2010, 8:26:17 PM2/1/10
to ocropus
Hello. I'm trying to train ocropus on Chinese using the code in the
repository as of last week. I'm using the training code in extras/
train-unicode (very cool, btw). After producing the training files, I
ran:

ocropus trainseg my.model out

which took all day but finally produced a model. From the log:

[info] updateModel 236200 samples, 6600 features, 127 classes
[info] updateModel memory status 1755 Mbytes, 1558 Mvalues
[info] training content classifier
[info] [mapped 123 to 53 classes]
[info] mlp training n 47020 nc 53
[info] mlp round 0 err 0.0198 nhidden 80
...
[info] mlp round 7 err 0.0112 nhidden 159
[info] training junk classifier
[info] mlp training n 231200 nc 2
[info] mlp round 0 err 0.0042 nhidden 50
...
[info] mlp round 7 err 0.001 nhidden 23
[info] trained 53140 characters, 2430 lines
[warn] 35120 old csegs
[info] saving my.model

Also in the log were a ton of "transcript doesn't agree with cseg
(transcript 4, cseg 25)" type messages.

But since I had a model, I thought things were ok. Then I ran:

debug=info,transcript cmodel=my.model ocropus lines2fsts out

but every single line in the log read like:

[warn] skipping out/train/0001/0001 (CHECK ocr-line/glclass.cc:1743
Training incomplete for all classes)

I checked out that source location and it's in the LatinClassifier
class!

Three questions:

1. What do those error messages from trainseg mean? How can I get
training to complete?

2. Is lines2fsts correct in using LatinClassifier? I expected
MlpClassifier.

3. Am I doing this right?

Thank you.

Reply all
Reply to author
Forward
0 new messages