Appending new ground truth to the default language model

234 views
Skip to first unread message

Marcin

unread,
Aug 19, 2010, 3:06:10 AM8/19/10
to ocropus
I'm trying to build my own language model by extending the default one
at /usr/local/share/ocropus/models/default.fst. Following the example
of ocropus-linefst and fstutils, I'm doing the following:

fst = openfst.StdVectorFst.Read("/usr/local/share/ocropus/models/
default.fst")
filenames = glob.glob("training/*.gt.txt")
for filename in filenames:
file = open(filename)
for line in file.readlines():
l = line.strip()
if not l:
continue
fstutils.add_line(fst, l)

det = Fst()
openfst.Determinize(fst, det)
(...)

The rest is truncated because I never get there. The Determinize
function aborts the program with the message:

FATAL: StringWeight::Plus: unequal arguments (non-functional FST?)

Is this even supposed to work? The same crash happens when I run
Determinize on the original model, i.e. without running the for loop
above. I suppose I should load the default model into an Ocropus
container created with ocropy.make_OcroFST(), but then I can't use the
functions in fstutils, which expect StdVectorFst. Does anyone have any
advice here?

Ilya Mezhirov

unread,
Aug 19, 2010, 12:45:15 PM8/19/10
to ocropus
Yes, default.fst can't be determinized. There are some conditions
(which I don't remember) on an FST to determinize it, but an acyclic
FST should always work. So you can make a word model first,
determinize/minimize it, and then create cycles to get a line model.

Marcin

unread,
Aug 20, 2010, 2:37:47 AM8/20/10
to ocropus
Thanks for your reply Ilya, but I'm afraid I'm still none the wiser
here. I know I can create a deterministic and minimal model from raw
text files, but how do I add it to the default model that comes with
Ocropus? I don't want to have to create a new comprehensive one from
scratch because I don't have enough training data. Are there any other
tools you know of?

Ilya Mezhirov

unread,
Aug 20, 2010, 5:33:30 AM8/20/10
to ocropus
The language model isn't exactly trained, at least AFAIK, more like
constructed.
It's similar to a regexp like ((a | aaron | abacus | ... | zygote)
( |,|.|!|?))* except more complicated and with probabilities on arcs.
One can't just add stuff to it, it has to be recreated from scratch. I
don't know how this is done currently.

Marcin

unread,
Aug 21, 2010, 12:53:10 AM8/21/10
to ocropus
That's what I feared. It's not the end of the world, though. I can
live with small models created from scratch for now. Thanks again for
your time Ilya.

Tom

unread,
Sep 4, 2010, 2:09:02 AM9/4/10
to ocropus
Well, it's trained, it's just not discriminatively trained. What's
shipping is just a model using word frequencies; we have tried n-gram
with back-off, but that's not ready for prime time yet.

Tom

Tom

unread,
Sep 4, 2010, 2:15:15 AM9/4/10
to ocropus
For specific questions of what you can and cannot do with OpenFST, you
might also want to try the OpenFST mailing list; people there have
much more experience with creating complex language models.

Tom
Reply all
Reply to author
Forward
0 new messages