Recently I read an interesting article by Peter Norvig[1] on how to write a spelling corrector in 21-lines of Python. I wanted to try and implement it in Haskell. My implementation is terribly slow and I was hoping for some tips on how to improve it and make it idiomatic.
I'd love to see other Haskell implementations as well if anyone has a few moments to spare. Admittedly, it took me several hours to get my version working, but I'm a Haskell newbie. Unfortunately, I think it runs as slow as it took me to write it! There is definitely something wrong with it, a memory leak, because I can't correct more than a few words without a great deal of memory being consumed.
> Recently I read an interesting article by Peter Norvig[1] on how to > write a spelling corrector in 21-lines of Python. I wanted to try and > implement it in Haskell. My implementation is terribly slow and I was > hoping for some tips on how to improve it and make it idiomatic.
I had a quick look at this. One way to speed up your program is by replacing the sets of altered words by lists. Filtering out doubles is a noble cause, but in this case it takes more time than keeping them around and doing some extra lookups in your frequency map. (I tried this, it gave a speedup factor of ~3.)
> I'd love to see other Haskell implementations as well if anyone has a > few moments to spare. Admittedly, it took me several hours to get my > version working, but I'm a Haskell newbie. Unfortunately, I think it > runs as slow as it took me to write it! There is definitely something > wrong with it, a memory leak, because I can't correct more than a few > words without a great deal of memory being consumed.
Be careful when you apply the |train| function to more than one word; in this form it may compute the frequency map from start for each invocation. (It is better to let |train| take a frequency map, instead of a list of words.)
Also be sure to compile your program with full optimisation ('-O2' for ghc).
Pete Kazmier <pete-expires-20070...@kazmier.com> writes: > I'd love to see other Haskell implementations as well if anyone has a > few moments to spare. Admittedly, it took me several hours to get my > version working, but I'm a Haskell newbie. Unfortunately, I think it > runs as slow as it took me to write it! There is definitely something > wrong with it, a memory leak, because I can't correct more than a few > words without a great deal of memory being consumed.
As monochrom pointed out on #haskell, I am using 'interact' incorrectly. For some reason I thought 'interact' applied its argument to each line of the input. I've replaced it as follows:
I try using WordSet = [String] (plus corresponding change in code) and get great speedup, actually way more than 3x. There was also a memory growth phenomenon using Set String, and replacement by [String] stops that too, now it's constant space (constant = 20M). It is possible to attribute part of the speedup to excellent rewrite rules in GHC regarding lists; however, I cannot explain the memory growth when using Set.
Regarding the local WordFreq map under "train", I am shocked that ghc -O is smart enough to notice it and perform proper sharing, and only one copy is ever created. Nonetheless, I still decide to factor "train" into two, one builds the WordFreq and the other queries it, which eases blame analysis when necessary.
On the interact line, I use "tokens" to break up the input, since it's already written (for the trainer), may as well reuse it.
When reading holmes.txt, be aware that it is in UTF-8, while GHC still assumes ISO-8859-1. This will affect results.
On Sun, 2007-04-22 at 00:25 -0400, Pete Kazmier wrote: > Pete Kazmier <pete-expires-20070...@kazmier.com> writes:
> > I'd love to see other Haskell implementations as well if anyone has a > > few moments to spare. Admittedly, it took me several hours to get my > > version working, but I'm a Haskell newbie. Unfortunately, I think it > > runs as slow as it took me to write it!
Hm - nobody suggested using ByteStrings yet? String is notoriously wasteful, and replacing it with ByteString normally gives a quite significant speedup.
Worse - and this is true for ByteStrings, too - String comparisons are O(n), which means lookups in Sets and Maps are expensive. A trie (i.e, a search tree where each internal node corresponds to a word prefix, and has one branch per letter in the alphabet) will give you lookup that scales with word size (and possibly alphabet size).
Instead of generating the (huge) list of misspelled words, you could calculate edit distance to each correctly spelled word? With a bound on allowable mistakes, this is a linear time operation using the standard dynamic programming algorithm.
Ketil Malde <ke...@ii.uib.no> writes: > On Sun, 2007-04-22 at 00:25 -0400, Pete Kazmier wrote: >> Pete Kazmier <pete-expires-20070...@kazmier.com> writes:
>> > I'd love to see other Haskell implementations as well if anyone has a >> > few moments to spare. Admittedly, it took me several hours to get my >> > version working, but I'm a Haskell newbie. Unfortunately, I think it >> > runs as slow as it took me to write it!
> Hm - nobody suggested using ByteStrings yet? String is notoriously > wasteful, and replacing it with ByteString normally gives a quite > significant speedup.
I actually have a ByteString version but it runs much slower. This part of the code is where all of the time is spent in the ByteString version:
type WordFreq = M.Map B.ByteString Int
train:: [B.ByteString] -> WordFreq train words = frequencyMap where frequencyMap = foldr incWordCount M.empty words incWordCount w m = M.insertWith (+) w 1 m
> Worse - and this is true for ByteStrings, too - String comparisons are > O(n), which means lookups in Sets and Maps are expensive. A trie (i.e, > a search tree where each internal node corresponds to a word prefix, and > has one branch per letter in the alphabet) will give you lookup that > scales with word size (and possibly alphabet size).
Right. My first version was just a direct translation of Norvig's code with an emphasis on trying to keep the complexity and size of code to a minimum.
> Instead of generating the (huge) list of misspelled words, you could > calculate edit distance to each correctly spelled word? With a bound on > allowable mistakes, this is a linear time operation using the standard > dynamic programming algorithm.
Could you provide additional information on this "standard dynamic programming algorithm?" I'm not familiar with dynamic programming.
I wrote an independent port of Norvig's spellchecker, because I figured it would make the basis of an interesting tutorial. For such a small program, it's been quite a challenge!
I started out using lazy ByteStrings, Data.Map, and Data.Set. As Albert observed, using Data.Set is poison for heap size and performance. The result of switching from sets to lists was a > 90% reduction in memory usage, and a big (but unmeasured) speedup.
After this switch, I found that spellchecking one word still took 4x as long in Haskell as Norvig's Python program. Since I was checking only one word in each case, essentially all of the runtime was taken up by building the word frequency map.
In my profile results, I find that simply converting words to lower case accounts for a whopping 40% of time and allocation (see the attachment for my definition of the train function).
I was interested in building a profile-enabled version of fps to see what might be going on inside the library, but was stymied by not knowing how to get GHC 6.6's base to coexist peacefully with fps (hiding base isn't very practical).
Bryan O'Sullivan <b...@serpentine.com> writes: > After this switch, I found that spellchecking one word still took 4x > as long in Haskell as Norvig's Python program. Since I was checking > only one word in each case, essentially all of the runtime was taken > up by building the word frequency map.
> train = foldl' updateMap M.empty . map lower . mkWords > where updateMap model word = M.insertWith' (+) word 1 model > mkWords = filter (not . B.null) . X.splitWith isNotAlpha > lower !s = X.map toLower s > isNotAlpha !c = c < 0x41 || (c > 0x5a && c < 0x61) || c > 0x7a > toLower !c | c >= 0x41 && c <= 0x5a = c + 0x20 > | otherwise = c
After reading Bryan's post, I switched my right fold to a strict left fold and almost tripled my original speed. Could someone provide some guidelines on when to use each? I thought foldr should be used when building some large data structure such as the map I build.
Bryan, out of curiosity, is a non bytestring version of your code faster?
Pete Kazmier wrote: > Bryan O'Sullivan <b...@serpentine.com> writes:
>> After this switch, I found that spellchecking one word still took 4x >> as long in Haskell as Norvig's Python program. Since I was checking >> only one word in each case, essentially all of the runtime was taken >> up by building the word frequency map.
>> train = foldl' updateMap M.empty . map lower . mkWords >> where updateMap model word = M.insertWith' (+) word 1 model >> mkWords = filter (not . B.null) . X.splitWith isNotAlpha >> lower !s = X.map toLower s >> isNotAlpha !c = c < 0x41 || (c > 0x5a && c < 0x61) || c > 0x7a >> toLower !c | c >= 0x41 && c <= 0x5a = c + 0x20 >> | otherwise = c
> After reading Bryan's post, I switched my right fold to a strict left > fold and almost tripled my original speed. Could someone provide some > guidelines on when to use each? I thought foldr should be used when > building some large data structure such as the map I build.
> Bryan, out of curiosity, is a non bytestring version of your code > faster?
Derek Elkins <derek.a.elk...@gmail.com> writes: >> After reading Bryan's post, I switched my right fold to a strict left >> fold and almost tripled my original speed. Could someone provide some >> guidelines on when to use each? I thought foldr should be used when >> building some large data structure such as the map I build. >> Bryan, out of curiosity, is a non bytestring version of your code >> faster?
I read the article and understand it, but I am having a hard time applying that specifically to my use of foldr. Here is how I was using foldr:
type WordFreq = M.Map B.ByteString Int
train:: [B.ByteString] -> WordFreq train words = frequencyMap where frequencyMap = foldr incWordCount M.empty words incWordCount w m = M.insertWith (+) w 1 m
So is 'incWordCount' strict in its second argument? I'm still not sure exactly what that means. According to the wiki page, if it is strict in the second argument, I should have used foldl' instead of foldr.
Pete Kazmier wrote: > Derek Elkins <derek.a.elk...@gmail.com> writes:
>>> After reading Bryan's post, I switched my right fold to a strict left >>> fold and almost tripled my original speed. Could someone provide some >>> guidelines on when to use each? I thought foldr should be used when >>> building some large data structure such as the map I build. >>> Bryan, out of curiosity, is a non bytestring version of your code >>> faster? >> http://www.haskell.org/hawiki/StackOverflow
> I read the article and understand it, but I am having a hard time > applying that specifically to my use of foldr. Here is how I was > using foldr:
> type WordFreq = M.Map B.ByteString Int
> train:: [B.ByteString] -> WordFreq > train words = frequencyMap > where > frequencyMap = foldr incWordCount M.empty words > incWordCount w m = M.insertWith (+) w 1 m
> So is 'incWordCount' strict in its second argument? I'm still not > sure exactly what that means. According to the wiki page, if it is > strict in the second argument, I should have used foldl' instead of > foldr.
insertWith will definitely need to examine 'm' to perform it's action, therefore it is strict. _______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
On Sun, 2007-04-22 at 11:51 -0400, Pete Kazmier wrote: > type WordFreq = M.Map B.ByteString Int
> train:: [B.ByteString] -> WordFreq > train words = frequencyMap > where > frequencyMap = foldr incWordCount M.empty words > incWordCount w m = M.insertWith (+) w 1 m
Isn't this going to build unevaluate thunks for the counts? This is what frequency counting looks like in some of my old code:
-- IM is Data.IntMap index ss = foldl' myupdate IM.empty ss where myupdate m k = let x = 1 + IM.findWithDefault 0 k m in x `seq` IM.insert k x m
(edited a bit, but hopefully correct)
I.e. use foldl', and look up previous value, succ and seq it, and put it back.
Generally, I only seem to want strict Maps. It seems so easy to build a lazy one if you need it (data Lazy a = Lazy a), so I wonder if strict should be the default?
> > Instead of generating the (huge) list of misspelled words, you could > > calculate edit distance to each correctly spelled word? With a bound on > > allowable mistakes, this is a linear time operation using the standard > > dynamic programming algorithm.
> Could you provide additional information on this "standard dynamic > programming algorithm?" I'm not familiar with dynamic programming.
Sorry! Dynamic programming just means using memoizing on a recursive function to reduce its run time complexity (although I don't think I've seen that definition anywhere?). A simple example is finding the n'th Fibonacci number by calculating (and referring to) the whole list.
For edit distance, you can do something like:
data Edit a = Repl a a | Delete a | Insert a
align (x:xs) (y:ys) = best [ Repl x y : align xs ys, Insert y : align (x:xs) ys, Delete x : align xs (y:ys)]
Where 'best' is some fitness function, selecting the best alignment.
Applying this directly gives an exponential algorithm, but since there are only n*m (n = lenght (x:xs), m = length (y:ys) possible ways to pick the remainders in the recursive calls throughout the execution, you might as well store them all in an n*m matrix - which only takes O(nm) time.
This is probably the worst explanation of DP so far this century, but if you Google for it, you'll find tons of information. Wikipedia has an entry as well.
Anyway: since the number of edits in this case is bounded (to four, since transposition can be modelled by an insertion and a deletion), you know the optimal solution will never stray further than two off the diagonal of the DP matrix, so you only need to caclulate a diagonal band (linear space/time), not the full matrix (quadratic).
On Sun, Apr 22, 2007 at 10:10:44PM +0200, Ketil Malde wrote: > On Sun, 2007-04-22 at 11:51 -0400, Pete Kazmier wrote:
> > type WordFreq = M.Map B.ByteString Int
> > train:: [B.ByteString] -> WordFreq > > train words = frequencyMap > > where > > frequencyMap = foldr incWordCount M.empty words > > incWordCount w m = M.insertWith (+) w 1 m
> Isn't this going to build unevaluate thunks for the counts? This is > what frequency counting looks like in some of my old code:
> -- IM is Data.IntMap > index ss = foldl' myupdate IM.empty ss > where myupdate m k = let x = 1 + IM.findWithDefault 0 k m > in x `seq` IM.insert k x m
> (edited a bit, but hopefully correct)
> I.e. use foldl', and look up previous value, succ and seq it, and put it > back.
> Generally, I only seem to want strict Maps. It seems so easy to build a > lazy one if you need it (data Lazy a = Lazy a), so I wonder if strict > should be the default?
> > > Instead of generating the (huge) list of misspelled words, you could > > > calculate edit distance to each correctly spelled word? With a bound on > > > allowable mistakes, this is a linear time operation using the standard > > > dynamic programming algorithm.
> > Could you provide additional information on this "standard dynamic > > programming algorithm?" I'm not familiar with dynamic programming.
> Sorry! Dynamic programming just means using memoizing on a recursive > function to reduce its run time complexity (although I don't think I've > seen that definition anywhere?). A simple example is finding the n'th > Fibonacci number by calculating (and referring to) the whole list.
> For edit distance, you can do something like:
> data Edit a = Repl a a | Delete a | Insert a
> align (x:xs) (y:ys) = best [ Repl x y : align xs ys, > Insert y : align (x:xs) ys, > Delete x : align xs (y:ys)]
> Where 'best' is some fitness function, selecting the best alignment.
> Applying this directly gives an exponential algorithm, but since there > are only n*m (n = lenght (x:xs), m = length (y:ys) possible ways to pick > the remainders in the recursive calls throughout the execution, you > might as well store them all in an n*m matrix - which only takes O(nm) > time.
> This is probably the worst explanation of DP so far this century, but if > you Google for it, you'll find tons of information. Wikipedia has an > entry as well.
> Anyway: since the number of edits in this case is bounded (to four, > since transposition can be modelled by an insertion and a deletion), you > know the optimal solution will never stray further than two off the > diagonal of the DP matrix, so you only need to caclulate a diagonal band > (linear space/time), not the full matrix (quadratic).
You can almost always do DP with less space, and this is no exception: linear space for full Levenschtein, constant for the diagonal band.
On 4/22/07, Pete Kazmier <pete-expires-20070...@kazmier.com> wrote:
> > Worse - and this is true for ByteStrings, too - String comparisons are > > O(n), which means lookups in Sets and Maps are expensive. A trie (i.e, > > a search tree where each internal node corresponds to a word prefix, and > > has one branch per letter in the alphabet) will give you lookup that > > scales with word size (and possibly alphabet size).
> Right. My first version was just a direct translation of Norvig's > code with an emphasis on trying to keep the complexity and size of > code to a minimum.
AFAIK, Python's dictionaries are implemented using hashtables, so lookups with them are much faster than with M.Map String Int. So should a direct translation use a Data.HashTable (despite its unpure interface)? But probably a trie should be better than both (just gambling ;-).
On Sun, 2007-04-22 at 12:55 -0400, Pete Kazmier wrote: > After reading Bryan's post, I switched my right fold to a strict left > fold and almost tripled my original speed. Could someone provide some > guidelines on when to use each? I thought foldr should be used when > building some large data structure such as the map I build.
You are building a strict data structure, not a lazy one. You cannot use that map until every element has been inserted into it, there are no partial intermediate results you can use like say when you build a lazy list. Hence it's better to use a strict left fold.
Rule of thumb: use a lazy fold to build lazy structures, use a strict fold to build strict ones.
Bryan O'Sullivan wrote: > In my profile results, I find that simply converting words to lower case > accounts for a whopping 40% of time and allocation (see the attachment > for my definition of the train function).
(At Stefan's suggestion, I renamed the modules in fps to NewData.*, in order to get around the name clashes when I try to concurrently use fps and base.)
On Sun, 2007-04-22 at 13:15 -0700, Stefan O'Rear wrote: > You can almost always do DP with less space, and this is no exception: > linear space for full Levenschtein, constant for the diagonal band.
Right. Especially as we only want the final score and not the exact edits, we only need to keep one column of the matrix. If you want the actual edits, you must backtrack from the optimal score -- or, since this is Haskell, keep the edits around as lazy lists, which, I think, amounts to the same thing.
Even if we do want the exact edits we can get linear space with a divide and conquer approach by finding the position in the middle column with the best score (summing the alignment scores of the prefix of xs vs ys and the suffix of xs vs ys), and then doing a separate alignment in the two quadrants (and recurse, of course). This means an extra log n time factor, though.
This is all very theoretical for a problem where we align words of about ten characters, of course. :-)
> I try using WordSet = [String] (plus corresponding change in code) and > get great speedup, actually way more than 3x. There was also a memory > growth phenomenon using Set String, and replacement by [String] stops > that too, now it's constant space (constant = 20M). It is possible to > attribute part of the speedup to excellent rewrite rules in GHC > regarding lists; however, I cannot explain the memory growth when using > Set.
Now I see. The higher memory usage of the Set version is not growth or leak, just brutal. For each input word, a ton of derived words are generated, and an optimal one is chosen (by a criterion that's essentially a mapping from words to scores). There are two ways to do this.
You could put the derived words into a lazy list and then traverse it, and therefore they're generated, examined, and thrown away on demand. At most two words are ever going to be stored (the best one so far and the next candidate), maybe plus a couple of cons cells if the compiler does not optimize.
Or you could put all derived words into a set first (to avoid duplicates, but note that you still have to generate a ton, you only save by storing just half a ton), then traverse it. Storing a ton, or even just half a ton, of these words is going to clog up memory. The perceived growth is only because some input words cause fewer derived words and some other input words encountered later cause more derived words. So, plotting memory over time, once in a while we have more derived words than before, so the heap grows for that; they are still thrown away correctly in due course. If you input the same input word over and over gain, space usage is still constant. _______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Pete Kazmier wrote: > train:: [B.ByteString] -> WordFreq > train words = frequencyMap > where > frequencyMap = foldr incWordCount M.empty words > incWordCount w m = M.insertWith (+) w 1 m
> So is 'incWordCount' strict in its second argument? I'm still not > sure exactly what that means.
Yes. incWordCount is strict in its second argument since
incWordCount x undefined == undefined
Of course you cannot see that from the definition of incWordCount alone, this depends on the behavior of M.insertWith.
> According to the wiki page, if it is > strict in the second argument, I should have used foldl' instead of > foldr.
Remember that the difference between foldr and foldl is not one between left and right; both have to recurse from the left. But foldr is normal recursion, while foldl is accumulator recursion. You obviously wanted an accumulator, and it should usually be strictly evaluated.
There is another bug of this sort in your code. Consider
> incWordCount w m = M.insertWith (+) w 1 m
There is no reason to evaluate the sum inside the map, instead an unevaluated thunk is put in there. Unfortunately, you need to take the long way of using M.lookup and M.insert to build a strict replacement for M.insertWith. (A strict variant of Data.Map would be useful here, unfortunately, there is none.)
-Udo -- Streitigkeiten dauerten nie lange, wenn nur eine Seite Unrecht hätte. -- de la Rochefoucauld