Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[historical] Invoking a Lisp compiler

82 views
Skip to first unread message

Matthew Economou

unread,
Jan 21, 1999, 3:00:00 AM1/21/99
to
Can anyone tell me (or point me to the relevant history) how the Lisp
compiler has traditionally been invoked? I've found some
documentation on the internals of a Multics Lisp compiler called
"lcp", but no manual page. The modern Lisps I've run across
(e.g. Allegro CL, GCL, Chez Scheme) don't have a "batch" mode, per se
(e.g., a non-interactive program like "cc").

--
"When I was a kid, I used to think that Dammit was God's last name, just like
Christ is Jesus' last name." - Kimberly Chapman in rec.humor.oracle.d

Erik Naggum

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
* Matthew Economou <xeno...@irtnog.org>

| Can anyone tell me (or point me to the relevant history) how the Lisp
| compiler has traditionally been invoked? I've found some documentation
| on the internals of a Multics Lisp compiler called "lcp", but no manual
| page. The modern Lisps I've run across (e.g. Allegro CL, GCL, Chez
| Scheme) don't have a "batch" mode, per se (e.g., a non-interactive
| program like "cc").

well, I'm not too sure about this part of Lisp lore, but since the
function is COMPILE-FILE, I'd assume that it's called much like any other
function in the listener that effectively replaces the shell command line.

other than that, there is indeed a batch mode in Allegro CL. e.g., I
build a new Allegro CL image with this Makefile:

allegro.dxl: allegro libacl503.so customization.fasl make-allegro.cl update code
./allegro -batch -e '(load "make-allegro.cl")' -kill
mv --force --backup --version-control=t new-lisp.dxl allegro.dxl

allegro:
ln -s lisp allegro
cp -p lisp.dxl allegro.dxl

customization.fasl: allegro ../custom/*.cl
./allegro -batch -e '(load "../custom/customization.cl")' -kill

and customization.cl goes like this:

(defsystem :customization
(:pretty-name "Customizations for Naggum Software"
:default-pathname #p"/home/franz/custom/")
(:serial "readtable"
"custom"
"naggum-software"
"defsys-fixes"
"excl-fixes"
"tpl-fixes"
"format-fixes"
"filesys-fixes"
"socket-fixes"))

(load-system :customization :silent t :compile t :no-warn t)
(concatenate-system :customization "customization.fasl")

incidentally, I forget why I did LOAD-SYSTEM instead of COMPILE-SYSTEM.
there is a reason to it.

#:Erik
--
SIGTHTBABW: a signal sent from Unix to its programmers at random
intervals to make them remember that There Has To Be A Better Way.

Matthew X. Economou

unread,
Jan 22, 1999, 3:00:00 AM1/22/99
to
>>>>> "EN" == Erik Naggum <er...@naggum.no> writes:

EN> well, I'm not too sure about this part of Lisp lore, but
EN> since the function is COMPILE-FILE, I'd assume that it's
EN> called much like any other function in the listener that
EN> effectively replaces the shell command line.

Would it be sensible to have a Lisp compiler that could be invoked
similarly to e.g. CC under UNIX? I would suppose that running the
command "lcp somefile.lsp" would be semantically equivalent to the
(COMPILE-FILE "./somefile.lsp"), in this case.

EN> other than that, there is indeed a batch mode in Allegro CL.
EN> e.g., I build a new Allegro CL image with this Makefile:

This prompts more questions related to compilers (and the external
environment).

EMACS uses the term "Lisp image" when describing the last stage in its
build process. The build process, in this final step, starts up the
EMACS Lisp engine (src/temacs) and loads all of the various editing
functions into memory. It then forces a core dump of this running
system, and the core file is then "undumped" (their terminology) into
a runnable executable (src/emacs).

Is this (effectively) what "build a new Allegro CL image" means, or
does Allegro (and, I'm assuming, other modern Lisp implementations) do
something more like the typical C or FORTRAN compiler?

(I looked up "Lisp image" in the HyperSpec and it doesn't seem to
specify this kind of build model. This is probably a good thing.)

The related question is, can one generate DLLs and whatnot with
Allegro (and/or other modern Lisp implementations)? Can one call DLLs
from within Allegro, and does spec say anything about this? I'm
curious, because every use I've put Lisp to (AI, a Scheme
meta-interpreter for a class on programming language concepts, and
EMACS) invokes this "self-containedness" property (i.e. one doesn't
really call upon external libraries, just uses already available
facilites), whereas languages like ADA, Perl, C, FORTRAN, and Pascal
seem to be very good at mixing code (especially on platforms like VMS
and UNIX).

(The reason I'm asking is that I noticed a Java front end in the EGCS
code repository. My assumption, possibly incorrect, is that the Java
front end generates the usual output that can be fed to standard tools
like as, ar, and ld, just like GCC and G++. If Java -- a (in my mind)
primarily interpreted language, but with features like just-in-time
compilation and byte compilation -- if Java can be compiled into
object files, then linked into programs or libraries just like any C
or C++ program, why couldn't the same be done with Lisp or Scheme?

I seem to be having a difficult time picturing such a Lisp compiler
though; all my experience with Lisp has been with interpreted
environments like EMACS, GCL, or Chez Scheme. _Lisp in Small Pieces_
arrived in the mail today, so I hope to clear some of my confusion
with Lisp compilers up.)

--
'Fed up with excitement? Bored with fun? Too many stimuli? Let waves of
boredom wash over you with "The Incredibly Dull Cam".'
-- (pointer at http://sg1.pcy.kcl.ac.uk/~foop/dullat.html)

Erik Naggum

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
* "Matthew X. Economou" <xeno...@irtnog.org>

| Would it be sensible to have a Lisp compiler that could be invoked
| similarly to e.g. CC under UNIX?

no, because the results aren't "linked" and "executed" from the
command-line, except, of course, in a similar "batch" mode.

note, however, that the command-line is really an interpreter and calls
out to functions that are loaded from disk every time. usually, this
interpreter has some strange syntax for sub-evaluation and has only a few
simple control structures, cannot use the value from the functions it has
called except through arcane syntax, and usually you have to deal with
their "output" through I/O, not actually any values per se. re-using the
output of a function may also need a lot of work because it has been
optimized for people, not for re-use. I prefer the REPL in Lisp...

| EMACS uses the term "Lisp image" when describing the last stage in its
| build process. The build process, in this final step, starts up the
| EMACS Lisp engine (src/temacs) and loads all of the various editing
| functions into memory. It then forces a core dump of this running
| system, and the core file is then "undumped" (their terminology) into a
| runnable executable (src/emacs).

that last past is wrong. the executable is not built from a core dump.
by that time, too much is actually lost to be useful in building an
executable on most operating systems. instead, the information in memory
is written out in a way and an order that causes it to be loaded into
memory the same way. basically, the process is that of preparing to make
the operating system re-create the conditions that exist in the image at
dump time.

| Is this (effectively) what "build a new Allegro CL image" means, or does
| Allegro (and, I'm assuming, other modern Lisp implementations) do
| something more like the typical C or FORTRAN compiler?

no, they do basically the same thing: load it into memory, then write
memory to disk in a way that causes it to be loaded in the same way.
note that a C program doesn't necessarily know what it will look like in
memory until actually loaded. there are lots of "loose ends" that the
operating system and the dynamic object loader take care of.

| The related question is, can one generate DLLs and whatnot with Allegro
| (and/or other modern Lisp implementations)?

yes. however, "DLL" is a Windows thing. it is as yet not possible for
end users to build shared objects or shared libraries, the Unix way.

| Can one call DLLs from within Allegro, and does spec say anything about
| this?

which spec? the documentation for Allegro CL details a foreign interface
that I believe is fully capable of using any DLL the system provides.

| I'm curious, because every use I've put Lisp to (...) invokes this "self-
| containedness" property (...), whereas languages like ADA, Perl, C,
| FORTRAN, and Pascal seem to be very good at mixing code (...).

Common Lisp implementations are better at this than most other languages
-- they have to by virtue of being in a minority. however, Lisp has a
much different concept of what constitutes "interface" than C does, and
the C calling conventions are sorely lacking in the needs that Lisp has.
also, of course, statically typed languages discard type information by
execution time, and Lisp keeps it around in the pointers and such.

calling out to foreign functions is no big deal, you only need to declare
them properly, but that is no news to people from the languages you need
to talk to. if the language is any good (and most aren't :), you should
be able to generate the interface code in Lisp automatically. however,
there is often insufficient information on how the run-time system calls
its own functions that it is possible to do this without actually
compiling some interface functions and go through an expensive interface
that has known properties.

| I seem to be having a difficult time picturing such a Lisp compiler
| though; all my experience with Lisp has been with interpreted
| environments like EMACS, GCL, or Chez Scheme.

Emacs and GCL have compilers and you really need them. don't know about
Chez Scheme, but from what I have read, it does indeed compile files and
all that stuff.

how do you picture a C compiler? would it be harder if you could type in
C code to an interactive environment and see results immediately?

BTW, I think you confuse "interpreted" with "interactive". you really
don't know what happens when ypu have typed in an expression. for all
you know or care, the system doesn't do (loop (print (eval (read)))), but
(loop (print (funcall (compile nil `(lambda () ,(read))))))

Rainer Joswig

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
In article <w4og192...@nemesis.irtnog.org>, "Matthew X. Economou" <xeno...@irtnog.org> wrote:

> The related question is, can one generate DLLs and whatnot with
> Allegro (and/or other modern Lisp implementations)?

Lispworks for Windows supports compiling to DDL.

> I seem to be having a difficult time picturing such a Lisp compiler
> though; all my experience with Lisp has been with interpreted
> environments like EMACS, GCL, or Chez Scheme.

Hmm, hasn't GCL a compiler (compiles to C). Hmm, hasn't
Chez Scheme a compiler, too.

--
http://www.lavielle.com/~joswig

Martin Rodgers

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
In article <w4og192...@nemesis.irtnog.org>, xeno...@irtnog.org
says...

> Would it be sensible to have a Lisp compiler that could be invoked

> similarly to e.g. CC under UNIX? I would suppose that running the
> command "lcp somefile.lsp" would be semantically equivalent to the
> (COMPILE-FILE "./somefile.lsp"), in this case.

You can do this with CLISP. Here's an example:

all : nodes.fas

nodes.fas : nodes.lisp
clisp -c nodes.lisp

The clisp script invokes CLISP for me.
--
Remove insect from address to email me | You can never browse enough
will write code that writes code that writes code for food

Erik Naggum

unread,
Jan 23, 1999, 3:00:00 AM1/23/99
to
* "Matthew X. Economou" <xeno...@irtnog.org>
| Would it be sensible to have a Lisp compiler that could be invoked
| similarly to e.g. CC under UNIX?

just to update my previous response. Allegro CL 5.0 has more command-
line "punch" than I related in the previous article. I wrote the build
stuff for 4.3 and never really took the time to update them. Allegro CL
5.0 takes some new command-line options, listed here with the old
equivalent option, minus the necessary quoting the shell needs.

-C file -e (compile-file file)
-L file -e (load file)

some probably find this a lot easier to deal with, especially from
scripts.

Raffael Cavallaro

unread,
Jan 24, 1999, 3:00:00 AM1/24/99
to
In article <31260882...@naggum.no>, Erik Naggum <er...@naggum.no> wrote:

> BTW, I think you confuse "interpreted" with "interactive". you really
> don't know what happens when ypu have typed in an expression. for all
> you know or care, the system doesn't do (loop (print (eval (read)))), but
> (loop (print (funcall (compile nil `(lambda () ,(read))))))

Exactly. For example, MCL does this and only this (i.e., there is no
interpreter, just an incremental compiler fast enough to be
indistinguishable from an interpreter). Everything the user enters
interactively (at the top-level prompt) is compiled to native PPC code
immediately.

raf

--
Raffael Cavallaro

Rob Warnock

unread,
Jan 25, 1999, 3:00:00 AM1/25/99
to
Erik Naggum <er...@naggum.no> wrote:
+---------------

| | The related question is, can one generate DLLs and whatnot with Allegro
| | (and/or other modern Lisp implementations)?
|
| yes. however, "DLL" is a Windows thing. it is as yet not possible for
| end users to build shared objects or shared libraries, the Unix way.
+---------------

Actually, the latest version of MzScheme provides a compiler-to-C ["mzc"]
that allows compiling Scheme code into a MzScheme "extension" which in Unix
is a ".so" file (a.k.a. DSO). Such files can later be linked into a running
MzScheme session with the [non-standard] call (load-extension "foo.so").


-Rob

-----
Rob Warnock, 8L-855 rp...@sgi.com
Applied Networking http://reality.sgi.com/rpw3/
Silicon Graphics, Inc. Phone: 650-933-1673
2011 N. Shoreline Blvd. FAX: 650-964-0811
Mountain View, CA 94043 PP-ASEL-IA

Erik Naggum

unread,
Jan 25, 1999, 3:00:00 AM1/25/99
to
* rp...@rigden.engr.sgi.com (Rob Warnock)

| Actually, the latest version of MzScheme provides a compiler-to-C ["mzc"]
| that allows compiling Scheme code into a MzScheme "extension" which in Unix
| is a ".so" file (a.k.a. DSO). Such files can later be linked into a running
| MzScheme session with the [non-standard] call (load-extension "foo.so").

the question was about Allegro CL, not about MzScheme.

personally, I don't think "Scheme" is the answer to _any_ Common Lisp
question to begin with, but may just be me.

Matthew X. Economou

unread,
Jan 25, 1999, 3:00:00 AM1/25/99
to
>>>>> "RW" == Rob Warnock <rp...@rigden.engr.sgi.com> writes:

RW> Actually, the latest version of MzScheme provides a
RW> compiler-to-C ["mzc"] that allows compiling Scheme code into a
RW> MzScheme "extension" which in Unix is a ".so" file
RW> (a.k.a. DSO). Such files can later be linked into a running
RW> MzScheme session with the [non-standard] call (load-extension
RW> "foo.so").

Can you point me to some sources? I'd like to take a look at how it
handles linking and the runtime.

Matthew X. Economou

unread,
Jan 25, 1999, 3:00:00 AM1/25/99
to
>>>>> "EN" == Erik Naggum <er...@naggum.no> writes:

EN> the question was about Allegro CL, not about MzScheme.

Actually, my original question included Chez Scheme. :)

EN> personally, I don't think "Scheme" is the answer to _any_
EN> Common Lisp question to begin with, but may just be me.

Note to self: Never, ever, get on Erik's bad side.

Erik Naggum

unread,
Jan 26, 1999, 3:00:00 AM1/26/99
to
* "Matthew X. Economou" <xeno...@irtnog.org>
| Note to self: Never, ever, get on Erik's bad side.

is it not just great, then, that I tend to forget _who_ did something
that annoyed me? :)

Rob Warnock

unread,
Jan 26, 1999, 3:00:00 AM1/26/99
to
Matthew X. Economou <xeno...@irtnog.org> wrote:
+---------------

| RW> Actually, the latest version of MzScheme provides a
| RW> compiler-to-C ["mzc"] that allows compiling Scheme code into a
| RW> MzScheme "extension" which in Unix is a ".so" file ...

|
| Can you point me to some sources? I'd like to take a look at how it
| handles linking and the runtime.
+---------------

Start here:

http://www.cs.rice.edu/CS/PLT/packages/mzscheme/

Barry Margolin

unread,
Jan 26, 1999, 3:00:00 AM1/26/99
to
In article <31263050...@naggum.no>, Erik Naggum <er...@naggum.no> wrote:
>* "Matthew X. Economou" <xeno...@irtnog.org>
>| Note to self: Never, ever, get on Erik's bad side.
>
> is it not just great, then, that I tend to forget _who_ did something
> that annoyed me? :)

I'll bet you still remember me. :)

--
Barry Margolin, bar...@bbnplanet.com
GTE Internetworking, Powered by BBN, Burlington, MA

Erik Naggum

unread,
Jan 26, 1999, 3:00:00 AM1/26/99
to
* Barry Margolin <bar...@bbnplanet.com>

| I'll bet you still remember me. :)

*laugh* yeah, I do. :)

Dieter Menszner

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to

I am looking for a Lisp function which does a fuzzy string match
e.g.

(fuzzy-match "document" "documet")
=> t

Its purpose is to be typing error tolerant.

Does anybody know of some downloadable source code ?

Dieter Menszner


Sunil Mishra

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
mens...@t-online.de (Dieter Menszner) writes:

I've written a partial set of routines for doing string search that you
could potentially adapt to do string matching. In your case, you seem to
want to handle substitution, deletion and insertion errors. This is what I
had done for string search allowing for these errors:

(defun smvs-match-first-item (pattern text evaluator key shifts errors)
(loop with pattern-item = (funcall key (aref pattern 0))
for text-index from 0 below (length text)
for text-item = (funcall key (aref text text-index))
do (setf (aref shifts text-index) text-index)
(setf (aref errors text-index) (funcall evaluator pattern-item text-item))))

(defun smvs-match-next-item (pattern text evaluator key error-limit
shifts last-shifts errors last-errors pattern-index)
(flet ((eval-delete-both (pre-delete-error pattern-item text-item)
(if (> pre-delete-error error-limit)
pre-delete-error
(+ pre-delete-error (funcall evaluator pattern-item text-item))))
(save-match-result (shift error text-index)
(setf (aref shifts text-index) shift)
(setf (aref errors text-index) error)))
(declare (inline eval-delete-both save-match-result))
(loop with pattern-item = (funcall key (aref pattern pattern-index))
for text-index from 1 below (length text)
for text-item = (funcall key (aref text text-index))
for delete-text-shift = (aref shifts (1- text-index))
and delete-text-error = (1+ (aref errors (1- text-index)))
and delete-pattern-shift = (aref last-shifts text-index)
and delete-pattern-error = (1+ (aref last-errors text-index))
and delete-both-shift = (aref last-shifts (1- text-index))
and delete-both-error = (eval-delete-both (aref last-errors (1- text-index))
pattern-item text-item)
do (if (> delete-text-error delete-pattern-error)
(if (> delete-pattern-error delete-both-error)
(save-match-result delete-both-shift delete-both-error text-index)
(save-match-result delete-pattern-shift delete-pattern-error text-index))
(if (> delete-text-error delete-both-error)
(save-match-result delete-both-shift delete-both-error text-index)
(save-match-result delete-text-shift delete-text-error text-index))))))

(defun string-match-variable-size (pattern text handler evaluator key error-limit)
;; Using dynamic programming
;; I don't have to keep around the entire substring match matrix, since
;; the match only depends on the current and the last row. This allows
;; for an enormous reduction in memory use for large patterns.
(let* ((pattern-length (length pattern))
(text-length (length text))
(shifts (make-array text-length :element-type 'fixnum))
(last-shifts (make-array text-length :element-type 'fixnum))
(errors (make-array text-length))
(last-errors (make-array text-length)))
;; Ignore first column
(setf (aref last-errors 0) (1+ error-limit))
;; Initialization - match first pattern item
(smvs-match-first-item pattern text evaluator key shifts errors)
;; Expansion - build on first match
(loop for pattern-index from 1 below pattern-length
do (psetq shifts last-shifts
last-shifts shifts
errors last-errors
last-errors errors)
(setf (aref errors 0) (1+ error-limit))
(smvs-match-next-item pattern text evaluator key error-limit
shifts last-shifts errors last-errors pattern-index))
;; Collection - call handler
(loop with last-shift = -1
for text-index from (if (> pattern-length 1) 1 0) below text-length
for shift = (aref shifts text-index)
when (and (<= (aref errors text-index) error-limit) (> shift last-shift))
do (setq last-shift shift)
(funcall handler pattern text shift (1+ text-index)))))

Kent M Pitman

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
mens...@t-online.de (Dieter Menszner) writes:

> I am looking for a Lisp function which does a fuzzy string match
> e.g.
>
> (fuzzy-match "document" "documet")
> => t
>
> Its purpose is to be typing error tolerant.

Excellent goal. I wish more people cared about such things.



> Does anybody know of some downloadable source code ?

I don't. However...

A long time ago (around 1979) I had this problem to solve. I wrote a
trivial program that scored matches by counting 1 point for every char
in the right position and 0.5 for every char that was off by one and
that scaled the result by the differences in the lengths. so, for
example,

d o c u m e n t
d o c u m e t

1 1 1 1 1 1 0.5 = 6.5 out of possible 8 match chars


So (6.5/8 match chars) * (7/8 match length) = 71.1% match total

or

d o c u m e n t

d o c u m n e t

1 1 1 1 1 0.5 0.5 1 = 7/8 match chars

* 8/8 match length = 87.5% match

I found that you rarely get more than 1 match higher than 0.5
in a small finite set that is relatively varied in nature.
That is, if you're trying to match one of
"document" or "frog" or "apple"
If you have a lot of words that are the same, you may need to
sort by bestness when you get scores for each.

In case anyone is curious, I used the algorithm for allowing a user
interface to a game I'd written to do things like:


PROGRAM: Is your animal a frgo?

USER: You misspelled frog.

PROGRAM: Ok. Is your animal a frog?

The program would match "frog" against each of the words in the
previous query and guess that "frog" was replacing "frgo" because
it had a good "closeness" to the given word according to the
metric that I've described above.

I later also generalized it to work with PDP10 Macsyma input so
you could get an interaction like:

USER MIS-TYPES: intergate(sin(x),x);

SYSTEM DOESN'T UNDERSTAND "intergate" (should have been "integrate")

USER PRESSES CONTROL-BACKSLASH

SYSTEM RESPONDS: What did you misspell?

USER TYPES: integrate;

SYSTEM RETRIES "integrate(sin(x),x);"

I note as an aside that Zork also had a command interface that
allowed you to say "oops <word>" when you misspelled a word.
I don't know what algorithm they used to match which word to
replace, but I strongly approve of the interface for not having
required you to mention both the old and new word.
A technique of this kinds is nice because it minimizes
gratuitous syntax that the user has to type. No reasonable
person would ask "which word should be spelled `frog'"
and so, I claim, one of the problems with the programs most
people write is that they think it's reasonable for a program
to either ask that question or to ask "what word should I
replace?"

I don't have code. But if the problem is that you just
didn't have a theory of how to do the match at all, then
maybe the above gives you enough clue of how you might
do it. Then again, my approach was entirely ad hoc
and you miht be able to think of another way of doing it
better. Or someone might have some formal theory of
what it means to do an "exact" fuzzy match (if that's
not an oxymoron)... in which case, the above might
be useles to you. But it's the best I had to offer
since I'm too lazy to actually write the extremely
straightforward code to implement it today...
Good luck...
--Kent

David B. Lamkins

unread,
Jan 28, 1999, 3:00:00 AM1/28/99
to
In article <79hoa...@0796153227-1000.dialin.t-online.de> ,
mens...@t-online.de (Dieter Menszner) wrote:

>
> I am looking for a Lisp function which does a fuzzy string match
> e.g.
>
> (fuzzy-match "document" "documet")
> => t
>
> Its purpose is to be typing error tolerant.
>

> Does anybody know of some downloadable source code ?

There was an interesting article about computing the "Levenshtein distance"
between two strings presented by Ray Valdez in Dr. Dobbs Journal almost
seven years ago.

I had written a Pascal implementation to try it out, and I still have that
code in my archives. Here's the header comment, which has all the source
references:

{Compute the Levenshtein distance between a pair of strings.}
{Adapted from C code presented in "Finding String Distances",}
{Ray Valdés, Dr. Dobbąs Journal, April 1992, ppg. 56‹ 62, 107.}
{Algorithm due to V.I. Levenshtein, as presented in "Time Warps,}
{String Edits, and MacroMolecules: The Theory and Practice of}
{Sequence Comparison", Sankoff and Kruskal, eds., Addison­Wesley,}
{1983. Macintosh implementation for THINK Pascal by D.B.Lamkins.}

The API has a call to initialize the algorithm using four values to specify
the cost of matching, insertion, deletion, and transposition. Then you call
the comparison function with a pair of strings, and the algorithm returns a
distance based upon the total cost of the edits required to transform one
string into another (and, optionally, a list of the edit operations needed.)

You could get your fuzzy string match by computing the Levenshtein distance
and comparing the result to some threshold.

I thought I had translated the algorithm into Common Lisp at one point, but
I can't find it now. I'd be happy to send or post the source if someone
wants to take a shot at the translation, or you could check out the original
sources referenced above.


--
David B. Lamkins <http://www.teleport.com/~dlamkins/>

There are many ways to abbreviate something, but only one way not to.

Will Hartung

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to

>and you >miht< be able to think of another way of doing it


m i g h t
m i h t
1 1 0.5 0.5
(3/5) * (4/5) = .48

*FAIL*

"I'm sorry, I don't understand 'miht'" :-)


Not to super relevant, and I may be mistaken, but you may be able to find
suitable code in the Squeak Smalltalk sources code.

I haven't played with it recently, but the original Smalltalk has a
"spelling corrector" when you mispelled variables and what not. When it
detected a misspelling, it popped up a list of possible matches.

Assuming that Squeak still has the functionality, then the source would be
in the distribution. Yeah, it's in Smalltalk, but its better than nothing.

You can find Squeak at http://www.create.ucsb.edu/squeak/

It could be a wild goose chase, but may be worth looking into. (Or for the
lazy, get on their mailing list or over on c.l.smalltalk and see if someone
can strip out the pertinent code.)

> But it's the best I had to offer
>since I'm too lazy to actually write the extremely
>straightforward code to implement it today...
>Good luck...
> --Kent

What he said.


paul_...@scientia.com

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <79hoa...@0796153227-1000.dialin.t-online.de>,
mens...@t-online.de (Dieter Menszner) wrote:
>
> I am looking for a Lisp function which does a fuzzy string match
> e.g.
>
> (fuzzy-match "document" "documet")
> => t
>
> Its purpose is to be typing error tolerant.
>
> Does anybody know of some downloadable source code ?


Gnus, the emacs newsreader, does some fuzzy matching of this kind for the
purposes of analysing news headers. This is written in elisp and is available
almost anywhere (and presumably subject to the GPL). It may be of some help to
you.

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own

Robert Bechtel

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
Interlisp. Spelling correction as part of the DWIM (Do What I Mean)
package. Doubt that source is available, but if you can track down a
manual, there is a description of the algorithm used, as well as
descriptions of the functions that make up the spelling corrector.

While there's a lot more to it, the first paragraph of section 15.7.4 of
the October 1983 _Interlisp_Reference_Manual_ by Xerox says:

"The basic philosophy of DWIM spelling correction is to count the number
of disagreements between two words, and use this number divided by the
length of the longer of the two words as a measure of their relative
disagreement. One minus this number is then the relative agreement or
closeness. For example, CONS and CONX differ only in their last
character. Such substitution errors count as one disagreement, so that
the two words are in 75% agreement. Most calls to the spelling
corrector specify a relative agreement of 70, so that a single
substitution error is permitted in words of four characters of longer.
However, spelling correction on shorter words is possible since certain
types of differences such as single transpositions are not counted as
disagreements. For example, AND and NAD have a relative agreement of
100. Calls to the spelling corrector from DWIM use the value of
FIXSPELLREL, which is initially 70. Note that by setting FIXSPELLREL to
100, only spelling corrections with "zero" mistakes, will be considered,
e.g., transpositions, double characters, etc."

There are several other well-thought-out variations (considering shifts,
keyboard locality, etc.), at least if you're typing "English Lisp"
("American Lisp"?).

Grumble. Makes me remember all the other neat stuff that didn't make
the Common Lisp merge.

bob bechtel

Kent M Pitman

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
"Will Hartung" <Will.H...@msoft.com> writes:

> >and you >miht< be able to think of another way of doing it
> m i g h t
> m i h t
> 1 1 0.5 0.5
> (3/5) * (4/5) = .48
> *FAIL*
> "I'm sorry, I don't understand 'miht'" :-)

heh. yeah, it does degrade a bit for short words. in fairness,
though, a typo makes more difference in a short word than in a
long one. but your point is well-taken.

btw, i didn't just suddenly become a worse speller a couple months
ago. rather, i switched to using a PC with a really annoying keypress
sensitivity that loses a lot of chars that other keyboards don't. i
often end up finding i've lost letters on this keyboard... sometimes i
fix them, but sometimes it's too much trouble. it's amazing to me
that with all the world relying so heavily on keyboards there isn't
more variety available in order to assure better quality.

Philip Lijnzaad

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
On Fri, 29 Jan 1999 03:56:49 GMT,
"Kent" == Kent M Pitman <pit...@world.std.com> writes:

Kent> ago. rather, i switched to using a PC with a really annoying keypress
Kent> sensitivity that loses a lot of chars that other keyboards don't.

Seeing that you use Gnus as a news reader, you could consider using emacs
flyspell-mode, which does on-the-fly spell checking (highlighting typo's and
repeated words), and works like a dream. It's built in to emacs 20.3 and
higher, but I suspect it will work on emacs 19.34.

Philip
--
erh ... the laziness you wield is the patience you yield, sort of thing %-/
-----------------------------------------------------------------------------
Philip Lijnzaad, lijn...@ebi.ac.uk | European Bioinformatics Institute
+44 (0)1223 49 4639 | Wellcome Trust Genome Campus, Hinxton
+44 (0)1223 49 4468 (fax) | Cambridgeshire CB10 1SD, GREAT BRITAIN
PGP fingerprint: E1 03 BF 80 94 61 B6 FC 50 3D 1F 64 40 75 FB 53

Nick Levine

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to

Robert Bechtel wrote:

> Interlisp. Spelling correction as part of the DWIM (Do What I Mean)
> package.

I remember that. Used to drive me crazy until I disabled it permanently.
Every time I typed a new symbol into the listener DWIM corrected it for me.
Made prototyping and debugging a real pain. :-)

- n


Paolo Amoroso

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
On Fri, 29 Jan 1999 00:23:17 GMT, "Will Hartung" <Will.H...@msoft.com>
wrote:

> Not to super relevant, and I may be mistaken, but you may be able to find
> suitable code in the Squeak Smalltalk sources code.
>
> I haven't played with it recently, but the original Smalltalk has a
> "spelling corrector" when you mispelled variables and what not. When it
> detected a misspelling, it popped up a list of possible matches.
>
> Assuming that Squeak still has the functionality, then the source would be
> in the distribution. Yeah, it's in Smalltalk, but its better than nothing.

Yes, Squeak still has this functionality. The original one in the
Smalltalk-80 system was influenced by DWIM.


Paolo
--
Paolo Amoroso <amo...@mclink.it>

David B. Lamkins

unread,
Jan 29, 1999, 3:00:00 AM1/29/99
to
In article <XXbs2.10606$202.5...@news1.teleport.com> , "David B. Lamkins"
<dlam...@teleport.com> wrote:

> In article <79hoa...@0796153227-1000.dialin.t-online.de> ,
> mens...@t-online.de (Dieter Menszner) wrote:
>
>>
>> I am looking for a Lisp function which does a fuzzy string match
>> e.g.
>>
>> (fuzzy-match "document" "documet")
>> => t
>>
>> Its purpose is to be typing error tolerant.
>>
>> Does anybody know of some downloadable source code ?
>

> There was an interesting article about computing the "Levenshtein distance"
> between two strings presented by Ray Valdez in Dr. Dobbs Journal almost
> seven years ago.
>
> I had written a Pascal implementation to try it out, and I still have that
> code in my archives. Here's the header comment, which has all the source
> references:
>
> {Compute the Levenshtein distance between a pair of strings.}
> {Adapted from C code presented in "Finding String Distances",}
> {Ray Valdés, Dr. Dobbąs Journal, April 1992, ppg. 56‹ 62, 107.}
> {Algorithm due to V.I. Levenshtein, as presented in "Time Warps,}
> {String Edits, and MacroMolecules: The Theory and Practice of}
> {Sequence Comparison", Sankoff and Kruskal, eds., Addison­Wesley,}
> {1983. Macintosh implementation for THINK Pascal by D.B.Lamkins.}
>
> The API has a call to initialize the algorithm using four values to specify
> the cost of matching, insertion, deletion, and transposition. Then you call
> the comparison function with a pair of strings, and the algorithm returns a
> distance based upon the total cost of the edits required to transform one
> string into another (and, optionally, a list of the edit operations needed.)
>
> You could get your fuzzy string match by computing the Levenshtein distance
> and comparing the result to some threshold.
>
> I thought I had translated the algorithm into Common Lisp at one point, but
> I can't find it now. I'd be happy to send or post the source if someone
> wants to take a shot at the translation, or you could check out the original
> sources referenced above.
>

OK, I looked at the Pascal sources again and discovered that the Levenshtein
metric does not presume that one of its arguments is authoritative. Rather,
it computes the minimal cost of a series of edits to _both_ strings to bring
them into agreement with each other.

Still, ignoring the fact that you can't use the edit operations as a
spelling corrector, the distance metric itself is surprisingly useful. I
did a reasonably straightforward translation of the Pascal source to Lisp,
and experimented with finding fuzzy matches to symbols provided by the
package iterators. The performance definitely needs work, but I found the
quality of the matches to be (subjectively, over a small sample) quite good.

There are some examples at the end that suggest how this could be used to
support a DWIM interface.

Here's the code as it stands today. I'm not happy with the presentation of
this code, so please bear in mind the following caveats: (1) this code is a
fourth-generation translation [book -> article -> C -> Pascal -> Lisp], (2)
I have never seen the original explanation of the algorithm, and (3) I
haven't seen the article -- upon which this code is ultimately based -- in
seven years. OTOH, it does work...

;;;-*- Mode: Lisp; Package: COMMON-LISP-USER -*-
;;;
;;; Copyright © David B. Lamkins 1999
;;;
;;; Permission is granted for use, provided that author's name and copyright
;;; are retained in this file. The author makes no warranties with respect
to
;;; this work.
;;;
;;; File: levenshtein.lisp
;;; Author: David B. Lamkins <dlam...@teleport.com>
;;; Organization: NIL
;;; Created: 1999/01/29 00:31:20 (UTC-8)
;;; Implementation: Macintosh Common Lisp Version 4.2
;;; Target: same
;;; Purpose: Calculate Levenshtein distance between two strings.
;;; Keywords: string similarity
;;; Requires:
;;; Resources:
;;; Dependencies:
;;; References: See Description.
;;; Standards:
;;; Unit-Test: See examples at end of file.
;;; Limitations:
;;; To-Do:
;;; Bugs: Code is ugly and conses too much.
;;;
;;; Description:
;;;
;;; Compute the Levenshtein distance between a pair of strings.
;;;
;;; I no longer have the article in which I found this algorithm,
;;; but it appears to find the minimal-cost sequence of edits that
;;; can be applied to both strings to make them identical. In
;;; other words, this algorithm does not assume that one string
;;; is "correct". Edits can be any of match (no change), insert,
;;; delete, or substitute. Each kind of edit has an associated
;;; cost.
;;;
;;; Despite the fact that neither string is presumed "authoritative",
;;; the Levenshtein distance metric is surprisingly useful for
;;; qualifying fuzzy matches from a dictionary; see the DWIM lookup
;;; examples at the end of the file.
;;;
;;; This code was translated from a THINK Pascal adaptation of a
;;; C algorithm I found in the article "Finding String Distances",


;;; Ray Valdés, Dr. Dobbąs Journal, April 1992, ppg. 56‹ 62, 107.

;;;
;;; This was a "reasonably" straightforward translation from the
;;; THINK Pascal source, which was itself somewhat obfuscated by
;;; a lot of idioms for managing dynamic storage on the Mac heap.
;;; I could probably make the code more readable if I built a Lisp
;;; version from scratch, but that would mean that I'd have to
;;; dig up the original references, and I don't have the inclination
;;; to do that just now.
;;;
;;; Algorithm due to V.I. Levenshtein, as presented in "Time Warps,


;;; String Edits, and MacroMolecules: The Theory and Practice of
;;; Sequence Comparison", Sankoff and Kruskal, eds., Addison­Wesley,
;;; 1983.

(defstruct (lev-op
(:print-function lev-op-printer))
index1
index2
op)

(defun lev-op-printer (object stream depth)
(declare (ignore depth))
(print-unreadable-object (object stream :type t)
(format stream "~D ~D ~A"
(lev-op-index1 object)
(lev-op-index2 object)
(lev-op-op object))))

(defstruct matrix-cell
distance
op)

(defstruct move
row-dist
col-dist)

(let ((moves (list
(cons 'match (make-move :row-dist -1
:col-dist -1))
(cons 'insert (make-move :row-dist 0
:col-dist -1))
(cons 'delete (make-move :row-dist -1
:col-dist 0))
(cons 'substitute (make-move :row-dist -1
:col-dist -1)))))
(defun levenshtein-string-distance (string-1 string-2
&key
(match-cost 0)
(insert-cost 1)
(delete-cost 1)
(substitute-cost 2))
"Return the Levenshtein distance, number of edits, and the
edit sequence between STRING-1 and STRING-2. Edit costs can be
specified via :MATCH-COST, :INSERT-COST, :DELETE-COST and
:SUBSTITUTE-COST keyword arguments."
(declare (optimize (speed 3)))
(let* ((num-rows (1+ (length string-1)))
(num-cols (1+ (length string-2)))
(matrix (make-array (list num-rows num-cols))))
;; initialize matrix
(do ((i 1 (1+ i)))
((= i num-cols))
(setf (aref matrix 0 i)
(make-matrix-cell :distance i
:op 'insert)))
(do ((i 0 (1+ i)))
((= i num-rows))
(setf (aref matrix i 0)
(make-matrix-cell :distance i
:op 'delete)))
;; calculate matrix
(let ((linearized-matrix (make-array (* num-rows num-cols)
:displaced-to matrix))
(c (1+ num-cols))
(n 1)
(w num-cols)
(nw 0)
(op-costs `((match . ,match-cost)
(insert . ,insert-cost)
(delete . ,delete-cost)
(substitute . ,substitute-cost))))
(labels ((advance-indices ()
(incf c)
(incf n)
(incf w)
(incf nw))
(op-cost (op)
(cdr (assoc op op-costs)))
(distance (index)
(matrix-cell-distance (aref linearized-matrix index)))
(set-c-entry (ref-index op)
(setf (aref linearized-matrix c)
(make-matrix-cell :distance (+ (distance ref-index)
(op-cost op))
:op op))))
(do ((row 0 (1+ row)))
((= row (1- num-rows)))
(do ((col 0 (1+ col)))
((= col (1- num-cols)))
;; calculate cell
(cond ((< (distance w) (distance n))
(cond ((< (distance w) (distance nw))
(set-c-entry w 'insert))
((char= (char string-1 row) (char string-2 col))
(set-c-entry nw 'match))
(t
(set-c-entry nw 'substitute))))
((< (distance n) (distance nw))
(set-c-entry n 'delete))
((char= (char string-1 row) (char string-2 col))
(set-c-entry nw 'match))
(t
(set-c-entry nw 'substitute)))
;; advance indices - inner
(advance-indices))
;; advance indices - outer
(advance-indices))))
;; backtrack matrix
(let ((distance (matrix-cell-distance
(aref matrix (1- num-rows) (1- num-cols))))
(row (1- num-rows))
(col (1- num-cols))
(edit-count 0)
(ops ()))
(loop
(unless (and (not (zerop row))
(not (zerop col)))
(return-from levenshtein-string-distance
(values distance edit-count (nreverse ops))))
(let* ((which-op (matrix-cell-op (aref matrix row col)))
(move (cdr (assoc which-op moves))))
(unless (eq which-op 'match)
(incf edit-count)
(push (make-lev-op :index1 (1- row)
:index2 (1- col)
:op which-op)
ops))
(incf row (move-row-dist move))
(incf col (move-col-dist move))))))))

#|
;; Examples

? (levenshtein-string-distance "which" "witch")
2
2
(#<LEV-OP 2 2 INSERT> #<LEV-OP 1 0 DELETE>)
? (levenshtein-string-distance "wchih" "which")
2
2
(#<LEV-OP 3 3 INSERT> #<LEV-OP 1 0 DELETE>)
? (levenshtein-string-distance "foobar" "fubar")
3
2
(#<LEV-OP 2 1 SUBSTITUTE> #<LEV-OP 1 0 DELETE>)
? (levenshtein-string-distance "document" "documet")
1
1
(#<LEV-OP 6 5 DELETE>)
? (levenshtein-string-distance "caliber" "calibre")
2
2
(#<LEV-OP 6 6 DELETE> #<LEV-OP 4 5 INSERT>)
|#

#|
;;; Example code for DWIM lookup

;;; Here are a couple of functions to lookup fuzzy matches for a
;;; given symbol or string, printing a list of candidates from
;;; either the internal and external symbols of *PACKAGE*, or from
;;; just the external symbols of the COMMON-LISP package. With
;;; suitable (nonportable) hooks and minor modifications to return
;;; lists instead of printing results, these functions could form
;;; the basis of a DWIM facility.

;;; An iMac running MCL 4.2 will compute distances at rates ranging
;;; from approximately 2,000 symbols per second (for short symbols)
;;; to 500 symbols per second (for longer symbols). These rates
;;; are fast enough for a DWIM lookup if you restrict lookup to
;;; about 1,000 symbols, as in DWIM-LOOKUP-EXTERNAL-CL-SYMBOLS.
;;; But my CL-USER package contains about 5,000 internal and external
;;; symbols, so more finesse is required to get acceptable performance
;;; in DWIM-LOOKUP-SYMBOLS; the total time is reduced by requiring that
;;; the difference in lengths between the two symbols is below some
;;; threshold before calling LEVENSHTEIN-STRING-DISTANCE. A greater
;;; mismatch in string lengths would result in large distances anyway,
;;; so it makes sense to avoid the expensive test when the cheap test
;;; can rule out a potential candidate.

(defparameter *dwim-lookup-cutoff* 5
"All DWIM matches will have a Levenshtein distance less than this value.")
(defparameter *dwim-length-cutoff* 3
"Max difference in symbol lengths when searching *PACKAGE* for DWIM
matches. Not used when searching COMMON-LISP package.")

(defun dwim-lookup-symbols (symbol-or-string)
(let ((symbol-string (string-upcase (string symbol-or-string)))
(count 0)
(candidates 0))
(do-symbols (sym)
(incf count)
(let ((candidate-string (string sym)))
(when (and (<= (abs (- (length candidate-string)
(length symbol-string)))
*dwim-length-cutoff*)
(incf candidates)
(< (levenshtein-string-distance symbol-string (string
sym))
*dwim-lookup-cutoff*))
(print sym))))
(format t "~&In ~:D symbols, examined ~:D candidates having lengths +/-
~D.~%"
count candidates *dwim-length-cutoff*))
(values))

(defun dwim-lookup-external-cl-symbols (symbol-or-string)
(let ((symbol-string (string-upcase (string symbol-or-string)))
(count 0))
(do-external-symbols (sym :cl)
(incf count)
(when (< (levenshtein-string-distance symbol-string (string sym))
*dwim-lookup-cutoff*)
(print sym)))
(format t "~&Examined ~:D symbols.~%" count))
(values))
|#

#|
;;; Examples for DWIM lookup

? (dwim-lookup-symbols "levelstein-stng-distance")

LEVENSHTEIN-STRING-DISTANCE
In 5,241 symbols, examined 534 candidates with lengths +/- 3.
? (dwim-lookup-external-cl-symbols "do-internal-symbols")

DO-EXTERNAL-SYMBOLS
Examined 980 symbols.
? (dwim-lookup-external-cl-symbols "package-used-list")

PACKAGE-USED-BY-LIST
PACKAGE-USE-LIST
Examined 980 symbols.
|#

((lambda (x) (list x (list 'quote x)))
'(lambda (x) (list x (list 'quote x))))


Christopher Browne

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to
On Thu, 28 Jan 1999 23:21:58 GMT, Kent M Pitman <pit...@world.std.com>
wrote:
>In case anyone is curious, I used the algorithm for allowing a user
>interface to a game I'd written to do things like:
>
> PROGRAM: Is your animal a frgo?
>
> USER: You misspelled frog.
>
> PROGRAM: Ok. Is your animal a frog?
>
>The program would match "frog" against each of the words in the
>previous query and guess that "frog" was replacing "frgo" because
>it had a good "closeness" to the given word according to the
>metric that I've described above.

Decidedly a cool thing. I suppose this parallels, to a certain extent,
the Emacs "M-/" functionality which tries to find the stem of the
present word based on a search of the available buffers.

It begs, to an extent, the question of how you construct an appropriate
dictionary/"database" of words to compare to.

For the above, the game would find "frog" as one of the alternatives in
that "frog" occurred in the recent past in the input stream. Thus, the
dictionary is collected from the input stream.

The Emacs "M-/" thing does similar by treating the set of 'buffers in
memory' as the dictionary.

In both cases, the data structures are small enough that an exhaustive
list search isn't going to run outrageously long.

If you want to start looking at larger dictionaries (e.g. - "All LISP
functions," or "Our 40,000 word dictionary of the English language"),
there probably needs to be some better lookup method so as to be more
selective. I know there are cryptogram dictionaries that count
letters, and would arrange, for instance, the word "dictionaries" as:
a-1, c-1, d-1, e-1, i-3, n-1, o-1, r-1, s-1 t-1
and thereby provide a way of finding other words containing the same
letters. Thinking out loud, there "oughta' be a way" of using
something like this to throw away *distant* words.
--
Would-be National Mottos:
USA: "There oughta' be a law!"
cbbr...@ntlug.org- <http://www.hex.net/~cbbrowne/langlisp.html>

Christopher Browne

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to
On Fri, 29 Jan 1999 16:05:07 +0000, Nick Levine <ni...@harlequin.co.uk>
wrote:

That *would* be a problem...

--
Would-be National Mottos:
USA: "There oughta' be a law!"

cbbr...@ntlug.org- <http://www.hex.net/~cbbrowne/languages.html>

cdm

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to
We had one on LMI Lambda, but it did a couple of things
differently. The dwim would only show up after an unbound
variable or function error, then it would look in the package for
similar symbols. If there were a unique one,
it would offer to 'link' the symbols (bash one with a forwarding
pointer to the other) as one of the proceed options. This was
very inobtrusive and quite helpful at times.

Christopher Browne <cbbr...@news.hex.net> wrote in message
news:18ws2.1793$e65...@news1.giganews.com...

Rob Warnock

unread,
Jan 30, 1999, 3:00:00 AM1/30/99
to
Dieter Menszner <mens...@t-online.de> wrote:
+---------------

| I am looking for a Lisp function which does a fuzzy string match
...

| Does anybody know of some downloadable source code ?
+---------------

It's in C, not Lisp, but looking at it might get you some algorithms
you could use:

ftp://ftp.cs.arizona.edu/agrep/README
ftp://ftp.cs.arizona.edu/agrep/agrep-2.04.tar.Z

From the README:

...
The three most significant features of agrep that are not supported
by the grep family are
1) the ability to search for approximate patterns;
for example, "agrep -2 homogenos foo" will find homogeneous as well
as any other word that can be obtained from homogenos with at most
2 substitutions, insertions, or deletions.
"agrep -B homogenos foo" will generate a message of the form
best match has 2 errors, there are 5 matches, output them? (y/n)
...
The tar file contains the source code (in C), man pages (agrep.1),
and two additional files, agrep.algorithms and agrep.chronicle,
giving more information.
The agrep directory also includes two postscript files:
agrep.ps.1 is a technical report from June 1991
describing the design and implementation of agrep;
agrep.ps.2 is a copy of the paper as appeared in the 1992
Winter USENIX conference.

I have a script "word", which just runs "agrep" on "/usr/lib/dict/words",
that I find *very* handy when I can't remember how to spell something.
To use the example word given above:

% agrep -2 homogenos /usr/lib/dict/words
homogenate
homogeneity
homogeneous
inhomogeneity
inhomogeneous
%

Hope that helps,

Larry Hunter

unread,
Feb 1, 1999, 3:00:00 AM2/1/99
to

Dieter Menszer said:

I am looking for a Lisp function which does a fuzzy string match e.g.
(fuzzy-match "document" "documet") => t. Its purpose is to be typing error
tolerant. Does anybody know of some downloadable source code ?

Strangely enough, the rapidly growing world of genetic sequence analysis has
reinvigorated the area of string algorithms. One of the most basic problems
in analyzing DNA sequences is determining the most likely path of
substitutions, insertions and deletions. The bottom line is that a dynamic
programming algorithm (along with a substition cost function and a gap
insertion penalty) can find the optimal "alignment" between two strings in
time proportional to the product of the lengths.

I include some code I wrote for solving this problem in CL below, although
it may not be right for your application. First, you don't really care much
about the path of transformations, just some distance measure. Second, you
probably want faster approximate algorithms, since you don't want your
searches to be proportional to the size of the dictionary.

I also include below my fast local alignment code, which seems to work well
for longer strings. It looks for exact matches of a certain minimum size,
and then tries to hook together the local matches into a global match that
covers as much of the first string as possible. This is more likely to be
appropriate for your problem, but will require some adaptation to work as a
spelling correcter. Perhaps better would be to look at the ideas in the
implementation and try to use them in your application.

This code is provided with no guarantees, but it works for me. It is a US
Government work, and is therefore uncopyrightable. You may use it for
whatever purpose you wish.

Sorry about the >80 char line lengths, and I hope you find this handy.

Regards,

Larry Hunter
hun...@nlm.nih.gov


;; Fast local alignment

;; Implemented by Larry Hunter (hun...@Nlm.nih.gov) in June 1998.

;; THIS IS A US GOVERNMENT WORK AND UNCOPYRIGHTABLE.

;; Start by looking for exact matches of length min-size (either set by
;; user or to 5). Then iteratively try to extend these initial matches at
;; the beginning or the end until they can't be extended any further.


(defun match (string1 string2 &key (match-fn #'char-equal) (min-size 5))
(let ((length1 (length string1))
(length2 (length string2))
(extended-matches t)
(initial-matches nil)
(final-matches nil))

; Find initial matches: all exact matches of length min-size.

(dotimes (start1 (- length1 min-size))
(dotimes (start2 (- length2 min-size))
(when (every match-fn
(subseq string1 start1 (+ start1 min-size))
(subseq string2 start2 (+ start2 min-size)))
(push (make-match :length min-size :start1 start1 :start2 start2) initial-matches))))

; Iterate through the matches, trying to extend them, either at the
; beginning or at the end. (Uses pushnew to avoid saving duplicates,
; i.e. matches with the same starts and lengths.) Any match that cannot
; be extended at either end is a final match. At the end, eliminate any
; final match that is a subset of another, i.e. which has the same
; starts, but a shorter length.

(do ((matches initial-matches extended-matches))
((null extended-matches)
(delete-if #'(lambda (match)
(some #'(lambda (match2)
(and (= (match-start1 match) (match-start1 match2))
(= (match-start2 match) (match-start2 match2))
(< (match-length match) (match-length match2))))
final-matches))
final-matches))
(setq extended-matches nil)
(dolist (match matches)
(let ((next-pos1 (+ (match-start1 match) (match-length match)))
(next-pos2 (+ (match-start2 match) (match-length match)))
(prev-pos1 (1- (match-start1 match)))
(prev-pos2 (1- (match-start2 match))))
(cond ((and (< next-pos1 length1)
(< next-pos2 length2)
(funcall match-fn (char string1 next-pos1) (char string2 next-pos2)))
(incf (match-length match))
(pushnew match extended-matches :test #'equalp))
((and (>= prev-pos1 0)
(>= prev-pos2 0)
(funcall match-fn (char string1 prev-pos1) (char string2 prev-pos2)))
(incf (match-length match))
(decf (match-start1 match))
(decf (match-start2 match))
(pushnew match extended-matches :test #'equalp))
(t (pushnew match final-matches :test #'equalp))))))))


;; FLA-ASSEMBLED-MATCH tries to hook together local alignments into a global
;; alignment that covers as much of string1 as possible. The tricky part is
;; that matches may overlap with each other, so assembling them gives less
;; new coverage than the sum of the lengths. To calculate the amount of new
;; coverage, substract off the amount of overlap with the most overlapping
;; match already used. [This is actually a heuristic -- if multiple matches
;; overlap in different areas, the worst single overlap may not be as bad as
;; the combination; we ignore this for now.]

(defun fla-assembled-match (string1 string2)
(labels ((match-overlap (match1 match2)
(let ((startm1s1 (match-start1 match1))
(startm1s2 (match-start2 match1))
(startm2s1 (match-start1 match2))
(startm2s2 (match-start2 match2))
(lengthm1 (match-length match1))
(lengthm2 (match-length match2)))
(let ((endm1s1 (+ startm1s1 lengthm1))
(endm1s2 (+ startm1s2 lengthm1))
(endm2s1 (+ startm2s1 lengthm2))
(endm2s2 (+ startm2s2 lengthm2))
(shortest (min lengthm1 lengthm2)))
(cond ((and (>= startm2s1 startm1s1) (>= endm1s1 startm2s1))
(min (- endm1s1 startm2s1) shortest))
((and (>= startm1s1 startm2s1) (>= endm2s1 startm1s1))
(min (- endm2s1 startm1s1) shortest))
((and (>= startm2s2 startm1s2) (>= endm1s2 startm2s2))
(min (- endm1s2 startm2s2) shortest))
((and (>= startm1s2 startm2s2) (>= endm2s2 startm1s2))
(min (- endm2s2 startm1s2) shortest))
(t 0)))))
(new-coverage (new mlist)
(- (match-length new)
(reduce #'max mlist :key #'(lambda (match) (match-overlap new match))))))
(let* ((matches (match string1 string2))
(longest-match (when matches (find (reduce #'max matches :key #'match-length) matches
:key #'match-length))))
(if matches
(let ((matches-used (list longest-match))
(total-length (match-length longest-match))
(remaining (delete longest-match matches)))
(do ()
((null remaining) (values total-length (nreverse matches-used)))
(let* ((best-new-coverage
(reduce #'max remaining
:key #'(lambda (match) (new-coverage match matches-used))))
(next-best (find best-new-coverage remaining
:key #'(lambda (match) (new-coverage match matches-used)))))
(incf total-length best-new-coverage)
(push next-best matches-used)
(setq remaining (delete-if #'(lambda (match)
(or (eq match next-best)
(= 0 (new-coverage match matches-used))))
remaining)))))
(values 0 nil)))))


;; Dynamic programming for approximate string matching. (See, e.g. Corman,
;; Leiserson & Rivest, pp. 301-328, or Gribskov & Devereux, pp 124-137).

;; Implemented by Larry Hunter (hun...@nlm.nih.gov) in June, 1998.

;; THIS IS A US GOVERNMENT WORK AND UNCOPYRIGHTABLE.

;; Construct a matrix of alignment scores. Let string1 be indexed by i, and
;; string2 by j. The alignment score Si,j is the score si,j of aligning the
;; characters i and j, plus the max of:

;; Si-1,j-1 OR
;; Max Si-x,j-1 + gap cost, where 2<x<i OR
;; Max Si-1,j-y + gap cost, where 2<y<j
;;
;; The global alignment score is in the lower right corner of the matrix.
;;
;; If endgap? is t, then ignore final gaps in string2, accepting the best
;; score that occurs in the last line of the matrix (that is long enough to
;; include string1).
;;
;; To report the optimal alignments, trace back through the pointers after
;; the matrix is constructed. If you just want the score, then this isn't
;; necessary.

(defun dp-match (string1 string2 &key (gap-cost -2) (score-fn #'dp-default-score-fn)
(endgap? nil) (report-alignment? nil))
(declare (optimize (speed 3) (safety 1) (debug 0)))
(let* ((length1 (length string1))
(length2 (length string2))
(match-array (make-array (list length1 length2 2))))
(dotimes (pos1 length1)
(dotimes (pos2 length2)
(let ((score (funcall score-fn string1 pos1 string2 pos2))
(previous (if (and (> pos1 0) (> pos2 0))
(cons (aref match-array (1- pos1) (1- pos2) 0)
(list (1- pos1) (1- pos2)))
'(0)))
(skip1 (list (list most-negative-fixnum)))
(skip2 (list (list most-negative-fixnum))))
(when (> pos2 0)
(dotimes (skip-back (- pos1 1))
(let* ((previous-best (caar skip1))
(loc1 (- pos1 (+ 2 skip-back)))
(loc2 (1- pos2))
(score (+ (* gap-cost (1+ skip-back))
(aref match-array loc1 loc2 0))))
(cond ((> score previous-best) (setq skip1 (list (list score loc1 loc2))))
((= score previous-best) (push (list score loc1 loc2) skip1))))))
(when (> pos1 0)
(dotimes (skip-back (- pos2 1))
(let* ((previous-best (caar skip2))
(loc1 (1- pos1))
(loc2 (- pos2 (+ 2 skip-back)))
(score (+ (* gap-cost (1+ skip-back))
(aref match-array loc1 loc2 0))))
(cond ((> score previous-best) (setq skip2 (list (list score loc1 loc2))))
((= score previous-best) (push (list score loc1 loc2) skip2))))))
(let ((best-previous (max (car previous) (caar skip1) (caar skip2))))
(setf (aref match-array pos1 pos2 0) (+ score best-previous))
(setf (aref match-array pos1 pos2 1) nil)
(when (= best-previous (car previous))
(push (cdr previous) (aref match-array pos1 pos2 1)))
(when (= best-previous (caar skip1))
(setf (aref match-array pos1 pos2 1)
(append (aref match-array pos1 pos2 1) (delete nil (mapcar #'cdr skip1)))))
(when (= best-previous (caar skip2))
(setf (aref match-array pos1 pos2 1)
(append (aref match-array pos1 pos2 1) (delete nil (mapcar #'cdr skip2)))))
))))
(let ((endpoints (list (list (aref match-array (1- length1) (1- length2) 0)
(1- length1) (1- length2))))
alignments)
(when (and endgap? (> length2 length1))
(dotimes (i (- length2 length1))
(let ((loc (- length2 i 2)))
(cond ((> (aref match-array (1- length1) loc 0) (first (car endpoints)))
(setq endpoints (list (list (aref match-array (1- length1) loc 0)
(1- length1) loc))))
((= (aref match-array (1- length1) loc 0) (first (car endpoints)))
(push (list (aref match-array (1- length1) loc 0) (1- length1) loc) endpoints))))))
(when report-alignment?
(do ((paths (mapcar #'list (mapcar #'cdr endpoints)) expanded)
(expanded nil nil))
((every #'(lambda (path)
(or (= 0 (first (car path)))
(= 0 (second (car path)))))
paths)
(setq alignments paths))
(format t "~%Paths: ~a" paths)
(dolist (path paths)
(setq expanded
(append
(if (or (= 0 (first (car path)))
(= 0 (second (car path))))
(list path)
(mapcar #'(lambda (node)
(append (list node) path))
(aref match-array (first (car path)) (second (car path)) 1)))
expanded)))))
(if report-alignment? (values (first (car endpoints)) alignments) (first (car endpoints))))))

;; Score functions take a pair of strings and positions and return a value (more positive is better).
;; This is a bit of a hack (why not just the chars?) but it allows us to get the confidence values into
;; the match score.

;; The default score function is 0 for a match, -10 for a mismatch.

(defun dp-default-score-fn (string1 position1 string2 position2)
(if (char= (char string1 position1) (char string2 position2))
0
-10))

;; A more complicated score function:

(defun case-sensitive-score-fn (string1 position1 string2 position2)
(let ((char1 (char string1 position1))
(char2 (char string2 position2)))
(cond ((char= char1 char2) 0)
((char-equal char1 char2) -2)
(t -10))))

;; A score function that assumes a special variable *confidence-values* is bound to a simple vector of
;; confidence values for the characters in string2. Discount mismatches based on their confidence
;; value (high confidence => worse mismatch score).

(defun confidence-and-case-score-fn (string1 position1 string2 position2)
(declare (special *confidence-values*))
(let ((char1 (char string1 position1))
(char2 (char string2 position2))
(confidence (svref *confidence-values* position2)))
(cond ((char= char1 char2) 0)
((char-equal char1 char2) -2)
(t (- (1+ confidence))))))

--
Lawrence Hunter, PhD.
National Library of Medicine phone: +1 (301) 496-9303
Bldg. 38A, 9th fl, MS-54 fax: +1 (301) 496-0673
Bethesda. MD 20894 USA email: hun...@nlm.nih.gov

Gareth McCaughan

unread,
Feb 2, 1999, 3:00:00 AM2/2/99
to
Larry Hunter wrote:

> Strangely enough, the rapidly growing world of genetic sequence analysis has
> reinvigorated the area of string algorithms. One of the most basic problems
> in analyzing DNA sequences is determining the most likely path of
> substitutions, insertions and deletions. The bottom line is that a dynamic
> programming algorithm (along with a substition cost function and a gap
> insertion penalty) can find the optimal "alignment" between two strings in
> time proportional to the product of the lengths.

Is that really the "bottom line"? The algorithm used in the GNU
"diff" program (and probably elsewhere too) takes maximum time
of order ND, where N is the length of the sequence and D is
(IIRC) the number of differences. (Obviously it still takes time
of order N if there are *no* differences, so probably that should
say ND+N.)

--
Gareth McCaughan Dept. of Pure Mathematics & Mathematical Statistics,
gj...@dpmms.cam.ac.uk Cambridge University, England.

Philip Lijnzaad

unread,
Feb 2, 1999, 3:00:00 AM2/2/99
to
On 02 Feb 1999 10:24:12 +0000,
"Gareth" == Gareth McCaughan <gj...@dpmms.cam.ac.uk> writes:

Gareth> Larry Hunter wrote:
>> Strangely enough, the rapidly growing world of genetic sequence analysis has
>> reinvigorated the area of string algorithms. One of the most basic problems
>> in analyzing DNA sequences is determining the most likely path of
>> substitutions, insertions and deletions. The bottom line is that a dynamic
>> programming algorithm (along with a substition cost function and a gap
>> insertion penalty) can find the optimal "alignment" between two strings in
>> time proportional to the product of the lengths.

Gareth> Is that really the "bottom line"? The algorithm used in the GNU
Gareth> "diff" program (and probably elsewhere too) takes maximum time
Gareth> of order ND, where N is the length of the sequence and D is
Gareth> (IIRC) the number of differences. (Obviously it still takes time
Gareth> of order N if there are *no* differences, so probably that should
Gareth> say ND+N.)


The 'best' techniques for doing biological sequence alignments
(Smith-Waterman type) are guaranteed to produce an optimal alignment, with
well defined optimality criteria. This includes similarity scores for the
things (amino acids) being aligned and penalty functions for the introduction
and/or extending of 'gaps' in one sequence. All this is would seem to be more
demanding than a text diff. Obviously, also in sequence alignments, if you're
willing to loose some sensitivity, you can gain a lot in speed. I don't know
what the order becomes, though, I haven't followed the field from close by. I
would not be surprised if it's of order N. The task is simply different:
with a text diff, one can be satisfied with a very rough matching up of
things that are identical, and forget about the rest. In biological sequence
alignment, you need every clue you can get, and this includes very weak
signals of similarity, which are hints of common ancestry of sequences that
have diverged since literally billions of years.

A great amount of work has been done on trying to optimize both the speed and
sensitivity of biological sequence alignment methods, as this is currently
the major tool in molecular biology. Typically, you find a 'new ' sequence,
and search through databases containing millions of known sequences to infer
something about your new sequence. It may be, though, that more work can be
done. But if you spent 3 months on finding your new sequence, it's perhaps
debatable if it matters whether you can do the sequence similarity search in
1 min, 1 hour or 1 day.

Gareth McCaughan

unread,
Feb 2, 1999, 3:00:00 AM2/2/99
to
Philip Lijnzaad wrote:

[I referred to the performance of "diff"]


> The 'best' techniques for doing biological sequence alignments
> (Smith-Waterman type) are guaranteed to produce an optimal alignment, with
> well defined optimality criteria. This includes similarity scores for the
> things (amino acids) being aligned and penalty functions for the introduction
> and/or extending of 'gaps' in one sequence. All this is would seem to be more
> demanding than a text diff. Obviously, also in sequence alignments, if you're
> willing to loose some sensitivity, you can gain a lot in speed. I don't know
> what the order becomes, though, I haven't followed the field from close by. I
> would not be surprised if it's of order N. The task is simply different:
> with a text diff, one can be satisfied with a very rough matching up of
> things that are identical, and forget about the rest.

The algorithm I mentioned guarantees to find an optimal matching.
I think it can be used when the notion of optimality includes the
sort of things you mention, although I haven't looked in detail.

Larry Hunter

unread,
Feb 2, 1999, 3:00:00 AM2/2/99
to

In replying to my comments and code for dynamic programming sequence
alignment, noting that optimal alignment requires time proportional to the
product of the lengths of the sequences, Gareth McCaughan said:

Is that really the "bottom line"? The algorithm used in the GNU "diff"
program (and probably elsewhere too) takes maximum time of order ND, where
N is the length of the sequence and D is (IIRC) the number of
differences. (Obviously it still takes time of order N if there are *no*
differences, so probably that should say ND+N.)

But diff is solving a different problem. It is looking for maximum length
exact matches, and not giving partial match scores in mismatching
areas. Once it is possible to asign varying pentalties for mismatches
(e.g. "a" vs. "e" is not as bad a mismatch as "a" vs "l"), then I believe
the dynamic programming approach is provably optimal. See Smith & Waterman,
"Identification of common molecular subsequences." J Mol Biol 1981 Mar
25;147(1):195-7 for details.

By the way, the fast approximate matching code I also provided in that
message is similar in design (and computational complexity) to diff.

Larry

Reini Urban

unread,
Feb 2, 1999, 3:00:00 AM2/2/99
to
Larry Hunter <hun...@nlm.nih.gov> wrote:
>;; THIS IS A US GOVERNMENT WORK AND UNCOPYRIGHTABLE.

i'm confused. is this really true?
looks fantastic! where can i read more about such laws?
couldn't find anything via altavista

BTW: cannot you just use the unix tool agrep and call it via FFI?
---
Reini Urban
http://xarch.tu-graz.ac.at/autocad/news/faq/autolisp.html

Roger Hayes

unread,
Feb 3, 1999, 3:00:00 AM2/3/99
to
Folks--

Are you aware of Gene Myer's work on gene matching & sequencing?

See http://www.cs.arizona.edu/people/gene/

Roger Hayes

0 new messages