[Haskell-cafe] Haskell version of Norvig's Python Spelling Corrector

Pete Kazmier

unread,

Apr 21, 2007, 5:43:50 PM4/21/07

to haskel...@haskell.org

Recently I read an interesting article by Peter Norvig[1] on how to
write a spelling corrector in 21-lines of Python. I wanted to try and
implement it in Haskell. My implementation is terribly slow and I was
hoping for some tips on how to improve it and make it idiomatic.

I'd love to see other Haskell implementations as well if anyone has a
few moments to spare. Admittedly, it took me several hours to get my
version working, but I'm a Haskell newbie. Unfortunately, I think it
runs as slow as it took me to write it! There is definitely something
wrong with it, a memory leak, because I can't correct more than a few
words without a great deal of memory being consumed.

Thanks,
Pete

[1] http://norvig.com/spell-correct.html

spell.hs

Arie Peterson

unread,

Apr 21, 2007, 10:06:58 PM4/21/07

to haskel...@haskell.org

Hi Pete,

> Recently I read an interesting article by Peter Norvig[1] on how to
> write a spelling corrector in 21-lines of Python. I wanted to try and
> implement it in Haskell. My implementation is terribly slow and I was
> hoping for some tips on how to improve it and make it idiomatic.

I had a quick look at this. One way to speed up your program is by
replacing the sets of altered words by lists. Filtering out doubles is a
noble cause, but in this case it takes more time than keeping them around
and doing some extra lookups in your frequency map. (I tried this, it gave
a speedup factor of ~3.)

> I'd love to see other Haskell implementations as well if anyone has a
> few moments to spare. Admittedly, it took me several hours to get my
> version working, but I'm a Haskell newbie. Unfortunately, I think it
> runs as slow as it took me to write it! There is definitely something
> wrong with it, a memory leak, because I can't correct more than a few
> words without a great deal of memory being consumed.

Be careful when you apply the |train| function to more than one word; in
this form it may compute the frequency map from start for each invocation.
(It is better to let |train| take a frequency map, instead of a list of
words.)

Also be sure to compile your program with full optimisation ('-O2' for ghc).

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Pete Kazmier

unread,

Apr 22, 2007, 12:26:08 AM4/22/07

to haskel...@haskell.org

Pete Kazmier <pete-expir...@kazmier.com> writes:

> I'd love to see other Haskell implementations as well if anyone has a
> few moments to spare. Admittedly, it took me several hours to get my
> version working, but I'm a Haskell newbie. Unfortunately, I think it
> runs as slow as it took me to write it! There is definitely something
> wrong with it, a memory leak, because I can't correct more than a few
> words without a great deal of memory being consumed.

As monochrom pointed out on #haskell, I am using 'interact'
incorrectly. For some reason I thought 'interact' applied its
argument to each line of the input. I've replaced it as follows:

interact $ unlines . map (show . (id &&& correct)) . lines

The program is still terribly slow due to my use of lists. Is there a
better way to write 'edits1'?

Thanks,
Pete

Albert Y. C. Lai

unread,

Apr 22, 2007, 1:36:22 AM4/22/07

to haskel...@haskell.org

I try using WordSet = [String] (plus corresponding change in code) and
get great speedup, actually way more than 3x. There was also a memory
growth phenomenon using Set String, and replacement by [String] stops
that too, now it's constant space (constant = 20M). It is possible to
attribute part of the speedup to excellent rewrite rules in GHC
regarding lists; however, I cannot explain the memory growth when using Set.

Regarding the local WordFreq map under "train", I am shocked that ghc -O
is smart enough to notice it and perform proper sharing, and only one
copy is ever created. Nonetheless, I still decide to factor "train" into
two, one builds the WordFreq and the other queries it, which eases blame
analysis when necessary.

On the interact line, I use "tokens" to break up the input, since it's
already written (for the trainer), may as well reuse it.

When reading holmes.txt, be aware that it is in UTF-8, while GHC still
assumes ISO-8859-1. This will affect results.

I have not checked the correctness of edits1.

I am monochrom.

My modification is attached.

spell2.hs

Ketil Malde

unread,

Apr 22, 2007, 9:12:41 AM4/22/07

to Pete Kazmier

On Sun, 2007-04-22 at 00:25 -0400, Pete Kazmier wrote:
> Pete Kazmier <pete-expir...@kazmier.com> writes:
>
> > I'd love to see other Haskell implementations as well if anyone has a
> > few moments to spare. Admittedly, it took me several hours to get my
> > version working, but I'm a Haskell newbie. Unfortunately, I think it
> > runs as slow as it took me to write it!

Hm - nobody suggested using ByteStrings yet? String is notoriously
wasteful, and replacing it with ByteString normally gives a quite
significant speedup.

Worse - and this is true for ByteStrings, too - String comparisons are
O(n), which means lookups in Sets and Maps are expensive. A trie (i.e,
a search tree where each internal node corresponds to a word prefix, and
has one branch per letter in the alphabet) will give you lookup that
scales with word size (and possibly alphabet size).

Instead of generating the (huge) list of misspelled words, you could
calculate edit distance to each correctly spelled word? With a bound on
allowable mistakes, this is a linear time operation using the standard
dynamic programming algorithm.

-k

Pete Kazmier

unread,

Apr 22, 2007, 11:51:44 AM4/22/07

to haskel...@haskell.org

Ketil Malde <ke...@ii.uib.no> writes:

> On Sun, 2007-04-22 at 00:25 -0400, Pete Kazmier wrote:
>> Pete Kazmier <pete-expir...@kazmier.com> writes:
>>
>> > I'd love to see other Haskell implementations as well if anyone has a
>> > few moments to spare. Admittedly, it took me several hours to get my
>> > version working, but I'm a Haskell newbie. Unfortunately, I think it
>> > runs as slow as it took me to write it!
>
> Hm - nobody suggested using ByteStrings yet? String is notoriously
> wasteful, and replacing it with ByteString normally gives a quite
> significant speedup.

I actually have a ByteString version but it runs much slower. This
part of the code is where all of the time is spent in the ByteString
version:

type WordFreq = M.Map B.ByteString Int

train:: [B.ByteString] -> WordFreq
train words = frequencyMap
where
frequencyMap = foldr incWordCount M.empty words
incWordCount w m = M.insertWith (+) w 1 m

> Worse - and this is true for ByteStrings, too - String comparisons are
> O(n), which means lookups in Sets and Maps are expensive. A trie (i.e,
> a search tree where each internal node corresponds to a word prefix, and
> has one branch per letter in the alphabet) will give you lookup that
> scales with word size (and possibly alphabet size).

Right. My first version was just a direct translation of Norvig's
code with an emphasis on trying to keep the complexity and size of
code to a minimum.

> Instead of generating the (huge) list of misspelled words, you could
> calculate edit distance to each correctly spelled word? With a bound on
> allowable mistakes, this is a linear time operation using the standard
> dynamic programming algorithm.

Could you provide additional information on this "standard dynamic
programming algorithm?" I'm not familiar with dynamic programming.

Thanks!
Pete

Bryan O'Sullivan

unread,

Apr 22, 2007, 12:40:24 PM4/22/07

to Ketil Malde

Ketil Malde wrote:

> Hm - nobody suggested using ByteStrings yet?

I wrote an independent port of Norvig's spellchecker, because I figured
it would make the basis of an interesting tutorial. For such a small
program, it's been quite a challenge!

I started out using lazy ByteStrings, Data.Map, and Data.Set. As Albert
observed, using Data.Set is poison for heap size and performance. The
result of switching from sets to lists was a > 90% reduction in memory
usage, and a big (but unmeasured) speedup.

After this switch, I found that spellchecking one word still took 4x as
long in Haskell as Norvig's Python program. Since I was checking only
one word in each case, essentially all of the runtime was taken up by
building the word frequency map.

In my profile results, I find that simply converting words to lower case
accounts for a whopping 40% of time and allocation (see the attachment
for my definition of the train function).

COST CENTRE MODULE %time %alloc

lower Spell 40.5 41.2
train Spell 26.3 14.3
mkWords Spell 21.9 24.1

I was interested in building a profile-enabled version of fps to see
what might be going on inside the library, but was stymied by not
knowing how to get GHC 6.6's base to coexist peacefully with fps (hiding
base isn't very practical).

My experiments are available here:

darcs get http://darcs.serpentine.com/spell

Norvig's training data is available from http://norvig.com/big.txt

<b

train.hs

Pete Kazmier

unread,

Apr 22, 2007, 12:55:54 PM4/22/07

to haskel...@haskell.org

Bryan O'Sullivan <b...@serpentine.com> writes:

> After this switch, I found that spellchecking one word still took 4x
> as long in Haskell as Norvig's Python program. Since I was checking
> only one word in each case, essentially all of the runtime was taken
> up by building the word frequency map.
>

> train = foldl' updateMap M.empty . map lower . mkWords
> where updateMap model word = M.insertWith' (+) word 1 model
> mkWords = filter (not . B.null) . X.splitWith isNotAlpha
> lower !s = X.map toLower s
> isNotAlpha !c = c < 0x41 || (c > 0x5a && c < 0x61) || c > 0x7a
> toLower !c | c >= 0x41 && c <= 0x5a = c + 0x20
> | otherwise = c

After reading Bryan's post, I switched my right fold to a strict left
fold and almost tripled my original speed. Could someone provide some
guidelines on when to use each? I thought foldr should be used when
building some large data structure such as the map I build.

Bryan, out of curiosity, is a non bytestring version of your code
faster?

Thanks,
Pete

Derek Elkins

unread,

Apr 22, 2007, 1:03:36 PM4/22/07

to Pete Kazmier

Pete Kazmier wrote:
> Bryan O'Sullivan <b...@serpentine.com> writes:
>
>> After this switch, I found that spellchecking one word still took 4x
>> as long in Haskell as Norvig's Python program. Since I was checking
>> only one word in each case, essentially all of the runtime was taken
>> up by building the word frequency map.
>>
>> train = foldl' updateMap M.empty . map lower . mkWords
>> where updateMap model word = M.insertWith' (+) word 1 model
>> mkWords = filter (not . B.null) . X.splitWith isNotAlpha
>> lower !s = X.map toLower s
>> isNotAlpha !c = c < 0x41 || (c > 0x5a && c < 0x61) || c > 0x7a
>> toLower !c | c >= 0x41 && c <= 0x5a = c + 0x20
>> | otherwise = c
>
> After reading Bryan's post, I switched my right fold to a strict left
> fold and almost tripled my original speed. Could someone provide some
> guidelines on when to use each? I thought foldr should be used when
> building some large data structure such as the map I build.
>
> Bryan, out of curiosity, is a non bytestring version of your code
> faster?

http://www.haskell.org/hawiki/StackOverflow

Bryan O'Sullivan

unread,

Apr 22, 2007, 2:23:53 PM4/22/07

to Pete Kazmier

Pete Kazmier wrote:

> Bryan, out of curiosity, is a non bytestring version of your code
> faster?

No, but the difference is smaller than I expected: the lazy ByteString
version is about 1.8x faster than using String.

<b

Pete Kazmier

unread,

Apr 22, 2007, 4:01:30 PM4/22/07

to haskel...@haskell.org

Derek Elkins <derek.a...@gmail.com> writes:

>> After reading Bryan's post, I switched my right fold to a strict left
>> fold and almost tripled my original speed. Could someone provide some
>> guidelines on when to use each? I thought foldr should be used when
>> building some large data structure such as the map I build.
>> Bryan, out of curiosity, is a non bytestring version of your code
>> faster?
>
> http://www.haskell.org/hawiki/StackOverflow

I read the article and understand it, but I am having a hard time
applying that specifically to my use of foldr. Here is how I was
using foldr:

type WordFreq = M.Map B.ByteString Int

train:: [B.ByteString] -> WordFreq
train words = frequencyMap
where
frequencyMap = foldr incWordCount M.empty words
incWordCount w m = M.insertWith (+) w 1 m

So is 'incWordCount' strict in its second argument? I'm still not
sure exactly what that means. According to the wiki page, if it is
strict in the second argument, I should have used foldl' instead of
foldr.

Thanks,
Pete

Derek Elkins

unread,

Apr 22, 2007, 4:08:12 PM4/22/07

to Pete Kazmier, haskel...@haskell.org

Pete Kazmier wrote:
> Derek Elkins <derek.a...@gmail.com> writes:
>
>>> After reading Bryan's post, I switched my right fold to a strict left
>>> fold and almost tripled my original speed. Could someone provide some
>>> guidelines on when to use each? I thought foldr should be used when
>>> building some large data structure such as the map I build.
>>> Bryan, out of curiosity, is a non bytestring version of your code
>>> faster?
>> http://www.haskell.org/hawiki/StackOverflow
>
> I read the article and understand it, but I am having a hard time
> applying that specifically to my use of foldr. Here is how I was
> using foldr:
>
> type WordFreq = M.Map B.ByteString Int
>
> train:: [B.ByteString] -> WordFreq
> train words = frequencyMap
> where
> frequencyMap = foldr incWordCount M.empty words
> incWordCount w m = M.insertWith (+) w 1 m
>
> So is 'incWordCount' strict in its second argument? I'm still not
> sure exactly what that means. According to the wiki page, if it is
> strict in the second argument, I should have used foldl' instead of
> foldr.

insertWith will definitely need to examine 'm' to perform it's action, therefore
it is strict.

Ketil Malde

unread,

Apr 22, 2007, 4:11:04 PM4/22/07

to Haskel...@haskell.org

On Sun, 2007-04-22 at 11:51 -0400, Pete Kazmier wrote:

> type WordFreq = M.Map B.ByteString Int
>
> train:: [B.ByteString] -> WordFreq
> train words = frequencyMap
> where
> frequencyMap = foldr incWordCount M.empty words
> incWordCount w m = M.insertWith (+) w 1 m

Isn't this going to build unevaluate thunks for the counts? This is
what frequency counting looks like in some of my old code:

-- IM is Data.IntMap
index ss = foldl' myupdate IM.empty ss
where myupdate m k = let x = 1 + IM.findWithDefault 0 k m
in x `seq` IM.insert k x m

(edited a bit, but hopefully correct)

I.e. use foldl', and look up previous value, succ and seq it, and put it
back.

Generally, I only seem to want strict Maps. It seems so easy to build a
lazy one if you need it (data Lazy a = Lazy a), so I wonder if strict
should be the default?

> > Instead of generating the (huge) list of misspelled words, you could
> > calculate edit distance to each correctly spelled word? With a bound on
> > allowable mistakes, this is a linear time operation using the standard
> > dynamic programming algorithm.
>
> Could you provide additional information on this "standard dynamic
> programming algorithm?" I'm not familiar with dynamic programming.

Sorry! Dynamic programming just means using memoizing on a recursive
function to reduce its run time complexity (although I don't think I've
seen that definition anywhere?). A simple example is finding the n'th
Fibonacci number by calculating (and referring to) the whole list.

For edit distance, you can do something like:

data Edit a = Repl a a | Delete a | Insert a

align (x:xs) (y:ys) = best [ Repl x y : align xs ys,
Insert y : align (x:xs) ys,
Delete x : align xs (y:ys)]

Where 'best' is some fitness function, selecting the best alignment.

Applying this directly gives an exponential algorithm, but since there
are only n*m (n = lenght (x:xs), m = length (y:ys) possible ways to pick
the remainders in the recursive calls throughout the execution, you
might as well store them all in an n*m matrix - which only takes O(nm)
time.

This is probably the worst explanation of DP so far this century, but if
you Google for it, you'll find tons of information. Wikipedia has an
entry as well.

Anyway: since the number of edits in this case is bounded (to four,
since transposition can be modelled by an insertion and a deletion), you
know the optimal solution will never stray further than two off the
diagonal of the DP matrix, so you only need to caclulate a diagonal band
(linear space/time), not the full matrix (quadratic).

-k

Stefan O'Rear

unread,

Apr 22, 2007, 4:15:26 PM4/22/07

to Ketil Malde

You can almost always do DP with less space, and this is no exception:
linear space for full Levenschtein, constant for the diagonal band.

http://www.haskell.org/haskellwiki/Dynamic_programming_example#Optimization

Stefan

Felipe Almeida Lessa

unread,

Apr 22, 2007, 5:45:14 PM4/22/07

to Pete Kazmier

On 4/22/07, Pete Kazmier <pete-expir...@kazmier.com> wrote:
> > Worse - and this is true for ByteStrings, too - String comparisons are
> > O(n), which means lookups in Sets and Maps are expensive. A trie (i.e,
> > a search tree where each internal node corresponds to a word prefix, and
> > has one branch per letter in the alphabet) will give you lookup that
> > scales with word size (and possibly alphabet size).
>
> Right. My first version was just a direct translation of Norvig's
> code with an emphasis on trying to keep the complexity and size of
> code to a minimum.

AFAIK, Python's dictionaries are implemented using hashtables, so
lookups with them are much faster than with M.Map String Int. So
should a direct translation use a Data.HashTable (despite its unpure
interface)? But probably a trie should be better than both (just
gambling ;-).

Cheers,

--
Felipe.

Duncan Coutts

unread,

Apr 22, 2007, 11:54:21 PM4/22/07

to Pete Kazmier

On Sun, 2007-04-22 at 12:55 -0400, Pete Kazmier wrote:

> After reading Bryan's post, I switched my right fold to a strict left
> fold and almost tripled my original speed. Could someone provide some
> guidelines on when to use each? I thought foldr should be used when
> building some large data structure such as the map I build.

You are building a strict data structure, not a lazy one. You cannot use
that map until every element has been inserted into it, there are no
partial intermediate results you can use like say when you build a lazy
list. Hence it's better to use a strict left fold.

Rule of thumb: use a lazy fold to build lazy structures, use a strict
fold to build strict ones.

Duncan

Bryan O'Sullivan

unread,

Apr 23, 2007, 1:36:56 AM4/23/07

to Donald Bruce Stewart

Bryan O'Sullivan wrote:

> In my profile results, I find that simply converting words to lower case
> accounts for a whopping 40% of time and allocation (see the attachment
> for my definition of the train function).
>
> COST CENTRE MODULE %time %alloc
>
> lower Spell 40.5 41.2
> train Spell 26.3 14.3
> mkWords Spell 21.9 24.1

A little more instrumentation says this (using the darcs head of fps
built with -auto-all):

loopU NewData.ByteString.Fusion 25.4 28.8
splitWith NewData.ByteString 15.4 17.2
train Spell 10.2 6.1
isNotAlpha Spell 9.4 12.2
compareBytes NewData.ByteString 8.8 9.6
compareBytes NewData.ByteString.Lazy 7.4 0.4
inlinePerformIO NewData.ByteString.Base 6.6 0.0

(At Stefan's suggestion, I renamed the modules in fps to NewData.*, in
order to get around the name clashes when I try to concurrently use fps
and base.)

<b

Ketil Malde

unread,

Apr 23, 2007, 3:08:38 AM4/23/07

to Haskel...@haskell.org

On Sun, 2007-04-22 at 13:15 -0700, Stefan O'Rear wrote:

> You can almost always do DP with less space, and this is no exception:
> linear space for full Levenschtein, constant for the diagonal band.

Right. Especially as we only want the final score and not the exact
edits, we only need to keep one column of the matrix. If you want the
actual edits, you must backtrack from the optimal score -- or, since
this is Haskell, keep the edits around as lazy lists, which, I think,
amounts to the same thing.

Even if we do want the exact edits we can get linear space with a divide
and conquer approach by finding the position in the middle column with
the best score (summing the alignment scores of the prefix of xs vs ys
and the suffix of xs vs ys), and then doing a separate alignment in the
two quadrants (and recurse, of course). This means an extra log n time
factor, though.

This is all very theoretical for a problem where we align words of about
ten characters, of course. :-)

-k

Albert Y. C. Lai

unread,

Apr 23, 2007, 8:51:44 PM4/23/07

to haskel...@haskell.org

Albert Y. C. Lai wrote:
> I try using WordSet = [String] (plus corresponding change in code) and
> get great speedup, actually way more than 3x. There was also a memory
> growth phenomenon using Set String, and replacement by [String] stops
> that too, now it's constant space (constant = 20M). It is possible to
> attribute part of the speedup to excellent rewrite rules in GHC
> regarding lists; however, I cannot explain the memory growth when using
> Set.

Now I see. The higher memory usage of the Set version is not growth or
leak, just brutal. For each input word, a ton of derived words are
generated, and an optimal one is chosen (by a criterion that's
essentially a mapping from words to scores). There are two ways to do this.

You could put the derived words into a lazy list and then traverse it,
and therefore they're generated, examined, and thrown away on demand. At
most two words are ever going to be stored (the best one so far and the
next candidate), maybe plus a couple of cons cells if the compiler does
not optimize.

Or you could put all derived words into a set first (to avoid
duplicates, but note that you still have to generate a ton, you only
save by storing just half a ton), then traverse it. Storing a ton, or
even just half a ton, of these words is going to clog up memory.
The perceived growth is only because some input words cause fewer
derived words and some other input words encountered later cause more
derived words. So, plotting memory over time, once in a while we have
more derived words than before, so the heap grows for that; they are
still thrown away correctly in due course. If you input the same input
word over and over gain, space usage is still constant.

Udo Stenzel

unread,

Apr 24, 2007, 8:33:02 AM4/24/07

to haskel...@haskell.org

Pete Kazmier wrote:
> train:: [B.ByteString] -> WordFreq
> train words = frequencyMap
> where
> frequencyMap = foldr incWordCount M.empty words
> incWordCount w m = M.insertWith (+) w 1 m
>
> So is 'incWordCount' strict in its second argument? I'm still not
> sure exactly what that means.

Yes. incWordCount is strict in its second argument since

incWordCount x undefined == undefined

Of course you cannot see that from the definition of incWordCount alone,
this depends on the behavior of M.insertWith.

> According to the wiki page, if it is
> strict in the second argument, I should have used foldl' instead of
> foldr.

Remember that the difference between foldr and foldl is not one between
left and right; both have to recurse from the left. But foldr is normal
recursion, while foldl is accumulator recursion. You obviously wanted
an accumulator, and it should usually be strictly evaluated.

There is another bug of this sort in your code. Consider

> incWordCount w m = M.insertWith (+) w 1 m

There is no reason to evaluate the sum inside the map, instead an
unevaluated thunk is put in there. Unfortunately, you need to take the
long way of using M.lookup and M.insert to build a strict replacement
for M.insertWith. (A strict variant of Data.Map would be useful here,
unfortunately, there is none.)

-Udo
--
Streitigkeiten dauerten nie lange, wenn nur eine Seite Unrecht hätte.
-- de la Rochefoucauld

signature.asc

Bryan O'Sullivan

unread,

Apr 24, 2007, 11:34:09 AM4/24/07

to Udo Stenzel

Udo Stenzel wrote:

> There is another bug of this sort in your code. Consider
>
>> incWordCount w m = M.insertWith (+) w 1 m
>
> There is no reason to evaluate the sum inside the map, instead an
> unevaluated thunk is put in there.

Would not Data.Map.insertWith' do the trick?

<b

Udo Stenzel

unread,

Apr 24, 2007, 4:31:31 PM4/24/07

to haskel...@haskell.org

Bryan O'Sullivan wrote:
> Udo Stenzel wrote:
>
> >There is another bug of this sort in your code. Consider
> >
> >> incWordCount w m = M.insertWith (+) w 1 m
> >
> >There is no reason to evaluate the sum inside the map, instead an
> >unevaluated thunk is put in there.
>
> Would not Data.Map.insertWith' do the trick?

Oops, you're right, this is a fairly recent addition.

-Udo
--
Walk softly and carry a BFG-9000.

signature.asc