Java Universal Levenshtein

250 views
Skip to first unread message

Courtney Falk

unread,
Mar 8, 2010, 10:04:27 PM3/8/10
to mo...@googlegroups.com
All:

Within the last week or so I've put together a working Java version of a
universal Levenshtein automaton. This implementation requires the use
of a dictionary automaton for processing. Both of these classes and
some other tools are packaged together into a single library. The JAR
itself can be found at:

http://www.infiauto.com/projects/datastr/

There's also some minimal JavaDoc documentation. I'll improve it as I
polish the code.

http://www.infiauto.com/projects/datastr/javadoc/

The JAR also contains pre-built Levenshtein automata of distance 1, 2,
and 3. Additional distances can be generated at the command line.

I'll include a quick guide here on using it. Let's say your dictionary
is kept in a plaintext file called, dictionary.txt, where each line is a
word in the dictionary. First, we need to convert that into an instance
of DictionaryAutomaton:

java -cp infiauto-datastr-0.1.0.jar
com.infiauto.datastr.auto.DictionaryAutomaton -g dictionary.txt working.dict

Now we'll search it for all words that match the string "garbled" within
an edit distance of 2:

java -cp infiauto-datastr-0.1.0.jar
com.infiauto.datastr.auto.LevenshteinAutomaton -l 2 working.dict garbled

The code is still rough. I'll be cleaning and improving it as I go.
Unit testing is not something I've worked with before so that might have
to be included. I look forward to any feedback.


Courtney Falk
courtn...@gmail.com

Robert Muir

unread,
Mar 8, 2010, 11:31:03 PM3/8/10
to mo...@googlegroups.com
Courtney, this is interesting. Is this the same as the Schulz & Mihov
algorithm? Or is this the Mitankin one, or both?

I couldn't see the source code, but i looked at the jar, and in case
it is the same, I wanted to let you know what we did with Schulz and
Mihov. We had the same problem of these offline-calculated tables
being very large.

Instead we packed these tables into long[] arrays right into the
source code (we use the moman python implementation to do the actual
offline calculations, and generate java source code from it)

This gave us substantially smaller tables: below are the latest sizes.

n=1 1,951 bytes (1,099 bytes compressed in the jar)
n=2 5,316 bytes (3,377 bytes compressed in the jar)
n=3 103,065 bytes (59,252 bytes compressed in the jar)

you could potentially use a similar approach, if you are concerned
about the space of these tables.

you can see our latest code/table generator here if you are curious:
https://issues.apache.org/jira/secure/attachment/12437765/createLevAutomata.py

--
Robert Muir
rcm...@gmail.com

Courtney Falk

unread,
Mar 9, 2010, 4:40:39 PM3/9/10
to mo...@googlegroups.com
Robert,

The implementation I've posted is based on the later Mitankin work.

So my offline-calculated data are just instances of the
LevenshteinAutomaton class that I write out via ObjectOutputStream
(LevenshteinAutomaton implements java.io.Serializable). My sizes for
uncompressed results are slightly larger than your's:

n=1 9,425 bytes
n=2 213,768 bytes
n=3 6,050,980 bytes

However, I haven't optimized the LevenshteinAutomaton yet. This 0.1.0
version is the first version that works reliably. My next step will be
to remove unnecessary code, etc. I will say though that I couldn't
generate a n=4 automaton with the default VM memory settings. I might
try that in the future, but the result will probably be too big to be
practically contain in a JAR, plus it might kill my poor little Eee Pc.
But realistically, you shouldn't see more than 3 edit corrections
required for anything written in a Latin script. I don't know how that
will compare to source code though.


Court

Reply all
Reply to author
Forward
0 new messages