Scala FST

61 views
Skip to first unread message

Robert V.

unread,
Oct 28, 2010, 11:20:10 AM10/28/10
to ScalaNLP
Hi,

I'm interested in trying to use your FST library for morphological
analysis (specifically, lemmatization) of English. I'm not really
seeing any documentation - any suggestions on how to get started?

Thanks,
Robert

David Hall

unread,
Oct 28, 2010, 2:50:53 PM10/28/10
to scal...@googlegroups.com
Hey,

It's still very much work in progress, and so I haven't written any
non-javadoc documentation, and even that's pretty sparse. The unit
tests are pretty complete though, and might show you how things are
used in general.

At a high level, we have automata that are objects with type
parameters [W, State, T], where W is a a weight object (say, a double
or a boolean), State is what you're using to represent states, and T
is the label along the arcs. These all require certain evidence
parameters. W must have a Semiring, which gives it plus and times and
such. W must also have a ClassManifest, which isn't too hard to come
by. T needs an Alphabet, which basically provides an in-band Sigma and
Epsilon, which represent "all characters" and "the null character",
respectively. To a first approximation, if you just use Double/Boolean
for W and Char for T, things work out nicely.

Automata are composed of states and arcs, in the usual way. To create
an automata, you'll need to provide a set of arc. See the Unit Tests
for some simple examples of where I do that. FSTs are automata that
have input and output labels. They're just a type-alias for
Automaton[W,State,(Input,Output)].

To install:
1) clone https://dl...@github.com/dlwh/scalanlp-core.git
2) checkout the "graphs" branch. // basically you need the graphs
library, which is also WIP.
3) mvn install
4) clone git://github.com/dlwh/scalanlp-fst.git
5) mvn install that.

Useful operations:
a and b are automata, f is an fst.

a & b = weighted intersection
f >> f = fst composition
f.outputProjection = automaton with only output labels.
f.inputProjection = similar
a.cost = sum of all paths through the automaton. If a represents
something like a probability over words, returns \sum_word p(word|a)
a.relabel = relabel states with integers. this typically makes
automata more efficient.

There's also minimization, determinization, epsilon removal, and the
like. Look at the unit tests for how those work. There are a few
automata/transducers that are provided, notably edit distance.

The algorithms are based largely on Mohri's work:
www.cs.nyu.edu/~mohri/pub/hwa.pdf , with the addition of expectation
semirings, which are based on some of Jason Eisner's work. These last
can be used to compute things like "expected number of transitions
that output b." Let me know if that interests you.

Hope that helps, and let me know if you have any questions.

-- David

> --
> You received this message because you are subscribed to the Google Groups "ScalaNLP" group.
> To post to this group, send email to scal...@googlegroups.com.
> To unsubscribe from this group, send email to scalanlp+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scalanlp?hl=en.
>
>

Reply all
Reply to author
Forward
0 new messages