[Haskell-cafe] Spelling checker exercise

Matthew Phillips

unread,

Jan 22, 2010, 1:51:54 AM1/22/10

to haskel...@haskell.org

Hello all,

sorry to bring up an old chestnut, but Iï¿½m trying to improve my Haskell-fu by writing a small program, and chose Peter Norvigï¿½s spelling checker as my exercise material (http://norvig.com/spell-correct.html).

While Iï¿½ve gotten it working, and found it quite illuminating, I also found it to to be very slow. And then I discovered that, of course, others have been here well before me ([1] and [2]). Those discussions were very interesting, but the code they refer to is mostly not available, so the most Iï¿½ve gotten out of it so far is that:

(a) I should be using strict left folds and strict Map insertions for the word frequency map (this shaved off about a second: ~5s -> ~4s for a single word on my 2.4GHz MacBook Pro, GHC 6.10.4)
(b) I should probably be using ByteStringï¿½s
(c) Using Setï¿½s for the edit permutations probably isnï¿½t worth it (although I found using plain lists made it about a second slower)

(b) is difficult because Iï¿½ve used matching patterns plus list comprehensions to generate the potential edits, and I really like how elegantly it pans out that way. Because ByteStringï¿½s are not lists, I canï¿½t see a way to keep the current structure and use them.

The code is at [3] (link to version at time of post). Profiling [4] shows:

$ ./spelling becuase +RTS -p
becuase -> because
$ cat spelling.prof
total time = 4.02 secs (201 ticks @ 20 ms)
total alloc = 1,544,257,792 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc

train Main 52.7 19.7
readFile Main 28.9 8.6
wordsBy Main 10.9 49.5
toLower Main 7.0 21.8
...

So it appears that ï¿½train" (building the freq map) and ï¿½readFileï¿½ in ï¿½nwords" are the places to hone. I will look at using Bloom Filters or Trieï¿½s instead of Data.Map, but I wonder if readFile should be taking nearly %30 of the run time, even for a 6MB file?

Sorry to dump such a long post on the list ï¿½ Iï¿½ll understand if no one can be bothered rehashing this. But, in summary Iï¿½d like to know:

(a) how could I use ByteStringï¿½s for this to speed up I/O and reduce memory usage without losing the nice readability?

(b) should readFile be so slow?

(c) any other tips

Possibly all my questions could be answered if someone has the code from the old posts.

Cheers,

Matthew.

[1]: http://haskell.markmail.org/search/?q=norvig%20spelling#query:norvig%20spelling+page:1+mid:jrvpyjpms5dumd3k+state:results

[2]: http://thread.gmane.org/gmane.comp.lang.haskell.cafe/21780

[3]: http://github.com/scramjet/spelling/blob/44fd336ef4f62d49c7087dbc1ffb9c009b78cec0/spelling.hs

[4]: http://github.com/scramjet/spelling/blob/f5146b24b77443c22975ff69fb91aa922cc1d4a3/spelling.prof

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Daniel Fischer

unread,

Jan 22, 2010, 8:43:27 AM1/22/10

to haskel...@haskell.org

Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:
> Hello all,
>
> sorry to bring up an old chestnut, but Iï¿½m trying to improve my
> Haskell-fu by writing a small program, and chose Peter Norvigï¿½s spelling
> checker as my exercise material (http://norvig.com/spell-correct.html).
>
> While Iï¿½ve gotten it working, and found it quite illuminating, I also
> found it to to be very slow. And then I discovered that, of course,
> others have been here well before me ([1] and [2]). Those discussions
> were very interesting, but the code they refer to is mostly not
> available, so the most Iï¿½ve gotten out of it so far is that:
>
> (a) I should be using strict left folds and strict Map insertions for
> the word frequency map (this shaved off about a second: ~5s -> ~4s for a
> single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably
> be using ByteStringï¿½s

That does help, but the worst part is building the map. That takes a couple
of seconds in Python, too. Just building the map takes 1.95s for Python,
3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and
6s with plain Strings here.

So I'd go with strict ByteStrings, although that takes a little more memory
than lazy, but waay less than Strings.

> (c) Using Setï¿½s for the edit permutations probably isnï¿½t worth it
> (although I found using plain lists made it about a second slower)

Might make a difference once you need to take two edit steps on a not very
short word.

>
> (b) is difficult because Iï¿½ve used matching patterns plus list
> comprehensions to generate the potential edits, and I really like how
> elegantly it pans out that way. Because ByteStringï¿½s are not lists, I
> canï¿½t see a way to keep the current structure and use them.

Train with ByteStrings, then do the edits on Strings and pack for lookup.

>
> The code is at [3] (link to version at time of post). Profiling [4]
> shows:
>
> $ ./spelling becuase +RTS -p
> becuase -> because
> $ cat spelling.prof
> total time = 4.02 secs (201 ticks @ 20 ms)
> total alloc = 1,544,257,792 bytes (excludes profiling overheads)
>
> COST CENTRE MODULE %time %alloc
>
> train Main 52.7 19.7
> readFile Main 28.9 8.6
> wordsBy Main 10.9 49.5
> toLower Main 7.0 21.8
> ...
>
> So it appears that ï¿½train" (building the freq map) and ï¿½readFileï¿½ in
> ï¿½nwords" are the places to hone.

readFile does not appear in my profile.
If you insert an SCC for updateMap,

where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1
model

, you'll see that the really bad citizen is updateMap (splitWords is rather
bad, too, together they take some 95% of the time in that profile).

But once you start needing two edits (try korrekt), correct and edits1
start to show up. That shows that Norvig's algorithm isn't really good.
With two edit steps, you create a _lot_ of strings you need to look up, far
more than there are in the map. That takes time. It'll be better to scan
the map for entries with an edit distance (< 3) if you have a good method
to check that
(http://old.nabble.com/haskell-in-online-contests-td26546989.html contains
pointers for that).

Another thing is

allWords = keysSet wordCounts

Ouch. For each correction, you construct that set anew. Just use member
from Data.Map instead of Data.Set.member and look up the words in the map.

> I will look at using Bloom Filters or
> Trieï¿½s instead of Data.Map, but I wonder if readFile should be taking
> nearly %30 of the run time, even for a 6MB file?

No way. But it doesn't seem to, from my GHC's point of view.

>
> Sorry to dump such a long post on the list ï¿½ Iï¿½ll understand if no one
> can be bothered rehashing this. But, in summary Iï¿½d like to know:
>
> (a) how could I use ByteStringï¿½s for this to speed up I/O and reduce
> memory usage without losing the nice readability?

A small rewrite of your code, I would have designed it slightly differently
for using ByteStrings from the beginning, the packing in known and
known_edits2 isn't too beautiful.

----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString.Char8 as B
import Data.Char (toLower)
import Data.Map (Map, findWithDefault, insertWith', member)
import qualified Data.Map as Map (empty)
import Data.Set as Set (Set, fromList, toList, fold, null)
import Data.List (inits, tails, foldl')
import System.Environment (getArgs)

dataFile = "big.txt"
alphabet = "abcdefghijklmnopqrstuvwxyz"

splitWords :: B.ByteString -> [B.ByteString]
splitWords = filter (not . B.null) . B.splitWith isNogud . B.map toLower

isNogud :: Char -> Bool
isNogud c = c < 'a' || 'z' < c

train :: [B.ByteString] -> Map B.ByteString Int
train = foldl' updateMap Map.empty
where
updateMap model word = insertWith' (+) word 1 model

nwords :: IO (Map B.ByteString Int)
nwords = return . train . splitWords =<< B.readFile dataFile

edits1 :: String -> [String]
edits1 s = toList . fromList $ deletes ++ transposes ++ replaces ++ inserts
where
deletes = [a ++ bs | (a, _:bs) <- splits]
transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits]
replaces = [a ++ (c:bs) | (a, _:bs) <- splits, c <- alphabet]
inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet]
splits = zip (inits s) (tails s)

correct :: Map B.ByteString Int -> String -> String
correct wordCounts word = B.unpack . fst $ fold maxCount (B.pack "?", 0)
candidates
where
candidates :: Set B.ByteString
candidates =
known [word] `or` ((known $ edits1 word) `or` known_edits2 word)

known_edits2 :: String -> Set B.ByteString
known_edits2 w =
fromList [w3 | w1 <- edits1 w, w2 <- edits1 w1
, let w3 = B.pack w2, w3 `member` wordCounts]

known :: [String] -> Set B.ByteString
known ws = fromList [w | w <- map B.pack ws, w `member` wordCounts]

maxCount :: B.ByteString -> (B.ByteString, Int) -> (B.ByteString, Int)
maxCount word current@(_, currentMax)
| count > currentMax = (word, count)
| otherwise = current
where
count = findWithDefault 1 word wordCounts

or :: Set B.ByteString -> Set B.ByteString -> Set B.ByteString
or a b | Set.null a = b
| otherwise = a

main :: IO ()
main = do
args <- getArgs
wordCounts <- nwords
mapM_ (printCorrect wordCounts) args
where
printCorrect :: Map B.ByteString Int -> String -> IO ()
printCorrect wordCounts word =
putStrLn $ word ++ " -> " ++ correct wordCounts word
----------------------------------------------------------------------

>
> (b) should readFile be so slow?
>
> (c) any other tips

Choose a better algorithm for the two-edit case.

>
> Possibly all my questions could be answered if someone has the code from
> the old posts.
>
> Cheers,
>
> Matthew.

_______________________________________________

Matthew Phillips

unread,

Jan 24, 2010, 12:21:17 AM1/24/10

to haskel...@haskell.org

Thanks very much Daniel for giving my (amateurish!) exercise such an in-depth a look-over. Comments inline below.

On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:

> Am Freitag 22 Januar 2010 07:51:27 schrieb Matthew Phillips:
>> Hello all,
>>
>> sorry to bring up an old chestnut, but Iï¿½m trying to improve my
>> Haskell-fu by writing a small program, and chose Peter Norvigï¿½s spelling
>> checker as my exercise material (http://norvig.com/spell-correct.html).
>>
>> While Iï¿½ve gotten it working, and found it quite illuminating, I also
>> found it to to be very slow. And then I discovered that, of course,
>> others have been here well before me ([1] and [2]). Those discussions
>> were very interesting, but the code they refer to is mostly not
>> available, so the most Iï¿½ve gotten out of it so far is that:
>>
>> (a) I should be using strict left folds and strict Map insertions for
>> the word frequency map (this shaved off about a second: ~5s -> ~4s for a
>> single word on my 2.4GHz MacBook Pro, GHC 6.10.4) (b) I should probably
>> be using ByteStringï¿½s
>
> That does help, but the worst part is building the map. That takes a couple
> of seconds in Python, too. Just building the map takes 1.95s for Python,
> 3.6s (including GC) with strict ByteStrings, 4.2s with lazy ByteStrings and
> 6s with plain Strings here.

I get the Python version running in about 1s, compared to 5.6s for the Haskell:

$ time python spelling.py
because

real 0m1.071s
user 0m0.821s
sys 0m0.139s

$ time ./spelling becuase
becuase -> because

real 0m5.589s
user 0m4.554s
sys 0m0.307s

And, strangely, the rewrite you provided (I called it "spelling_df") runs a fair bit slower:

$ time ./spelling_df becuase
becuase -> because

real 0m8.087s
user 0m6.966s
sys 0m0.193s

$ time ./spelling korrekt
korrekt -> correct

real 0m5.970s
user 0m4.885s
sys 0m0.300s

$ time ./spelling_df korrekt
korrekt -> correct

real 0m8.616s
user 0m7.538s
sys 0m0.187s

>> The code is at [3] (link to version at time of post). Profiling [4]

>> shows:
>>
>> $ ./spelling becuase +RTS -p
>> becuase -> because
>> $ cat spelling.prof
>> total time = 4.02 secs (201 ticks @ 20 ms)
>> total alloc = 1,544,257,792 bytes (excludes profiling overheads)
>>
>> COST CENTRE MODULE %time %alloc
>>
>> train Main 52.7 19.7
>> readFile Main 28.9 8.6
>> wordsBy Main 10.9 49.5
>> toLower Main 7.0 21.8
>> ...
>>
>> So it appears that ï¿½train" (building the freq map) and ï¿½readFileï¿½ in
>> ï¿½nwords" are the places to hone.
>
> readFile does not appear in my profile.

Apologies, I should have said that Iï¿½d inserted some SCCï¿½s to try to tease out the cost of readFile (i.e. {-# SCC "readFile"}).

> If you insert an SCC for updateMap,
>
> where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word 1
> model
>
> , you'll see that the really bad citizen is updateMap (splitWords is rather
> bad, too, together they take some 95% of the time in that profile).

Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking 80% of runtime. Adding SCC's like this:

splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map toLower

gives me:

splitWords Main 216 1 0.0 0.0 78.6 91.8
filter Main 217 1 1.9 3.0 78.6 91.8
splitWith Main 218 1 28.4 36.8 76.7 88.8
isNogud Main 221 6488666 4.2 4.1 4.2 4.1
toLower Main 219 1 44.2 47.9 44.2 47.9

i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why, I have no idea.

Am I reading this wrong?

> But once you start needing two edits (try korrekt), correct and edits1
> start to show up. That shows that Norvig's algorithm isn't really good.
> With two edit steps, you create a _lot_ of strings you need to look up, far
> more than there are in the map. That takes time. It'll be better to scan
> the map for entries with an edit distance (< 3) if you have a good method
> to check that
> (http://old.nabble.com/haskell-in-online-contests-td26546989.html contains
> pointers for that).

Will have a look at that: it looks like it'll be very informative.

> Another thing is
>
> allWords = keysSet wordCounts
>
> Ouch. For each correction, you construct that set anew. Just use member
> from Data.Map instead of Data.Set.member and look up the words in the map.

Whoops! Just to be clear though: Haskell will memoise the result of "allWords" for a given invocation of "correct"? So this would only make a large difference for multiple corrections? (which I wasn't worrying about for the moment). The change seems to wipe off about 0.2s on average.

>> I will look at using Bloom Filters or
>> Trieï¿½s instead of Data.Map, but I wonder if readFile should be taking
>> nearly %30 of the run time, even for a 6MB file?
>
> No way. But it doesn't seem to, from my GHC's point of view.

Just to be sure I wasn't using the SCC incorrectly, I split out the readFile call into "myReadFile". The prof line comes out as:

myReadFile Main 210 1 35.8 8.6 35.8 8.6

i.e. 35.8% of runtime.

>> Sorry to dump such a long post on the list ï¿½ Iï¿½ll understand if no one
>> can be bothered rehashing this. But, in summary Iï¿½d like to know:
>>
>> (a) how could I use ByteStringï¿½s for this to speed up I/O and reduce
>> memory usage without losing the nice readability?

And thanks for the other small tips (e.g. findWithDefault), and I didn't know you could use let the way you did either.

Daniel Fischer

unread,

Jan 24, 2010, 6:54:20 AM1/24/10

to Matthew Phillips, haskel...@haskell.org

Am Sonntag 24 Januar 2010 06:19:58 schrieb Matthew Phillips:
> Thanks very much Daniel for giving my (amateurish!) exercise such an
> in-depth a look-over. Comments inline below.
>
> On 23/01/2010, at 12:11 AM, Daniel Fischer wrote:
> >
> > That does help, but the worst part is building the map. That takes a
> > couple of seconds in Python, too. Just building the map takes 1.95s
> > for Python, 3.6s (including GC) with strict ByteStrings, 4.2s with
> > lazy ByteStrings and 6s with plain Strings here.

Correction: I took the total time for a no-argument run for the time it
took Python to build the map, adding timing for the map-building, that says
it takes ~1.45 seconds. I'm clueless as to what Python needs the remaining
half second for.

>
> I get the Python version running in about 1s, compared to 5.6s for the
> Haskell:

$ python --version
Python 2.6

With psyco.full():
$ time python ./norvig.py becuase
Trained in 1.16967606544 seconds
/norvig.py
because
1.52user 0.08system 0:01.60elapsed 100%CPU

without psyco:
$ time python ./norvig.py becuase
Trained in 1.45706319809 seconds
/norvig.py
because
1.95user 0.08system 0:02.03elapsed 100%CPU
$ time python ./norvig.py
Trained in 1.46250891685 seconds
/norvig.py
1.95user 0.09system 0:02.04elapsed 100%CPU

I think I know what happened here:

$ ghc -fforce-recomp --make matthew -o matthew0
[1 of 1] Compiling Main ( matthew.hs, matthew.o )
Linking matthew0 ...
$ ghc -O2 -fforce-recomp --make matthew -o matthew2
[1 of 1] Compiling Main ( matthew.hs, matthew.o )
Linking matthew2 ...
$ time ./matthew0 becuase
becuase -> because
7.07user 0.21system 0:07.28elapsed 99%CPU
$ time ./matthew2 becuase
becuase -> because
6.01user 0.19system 0:06.21elapsed 100%CPU

$ ghc -fforce-recomp --make spellingBS -o spelling0
[1 of 1] Compiling Main ( spellingBS.hs, spellingBS.o )
Linking spelling0 ...
$ ghc -O2 -fforce-recomp --make spellingBS -o spelling2
[1 of 1] Compiling Main ( spellingBS.hs, spellingBS.o )
Linking spelling2 ...
$ time ./spelling0 becuase
becuase -> because
9.78user 0.09system 0:09.87elapsed 100%CPU
$ time ./spelling2 becuase
becuase -> because
3.57user 0.03system 0:03.60elapsed 100%CPU

I habitually compile all code with -O2, unless I have a specific reason not
to. I tend to forget that some compile without optimisations.
For some kinds of code that makes hardly any difference, for others it
makes a big difference.

*** Don't even think of using ByteStrings without optimising. ***

> >
> > readFile does not appear in my profile.
>
> Apologies, I should have said that Iï¿½d inserted some SCCï¿½s to try to
> tease out the cost of readFile (i.e. {-# SCC "readFile"}).
>
> > If you insert an SCC for updateMap,
> >
> > where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+)
> > word 1 model
> >
> > , you'll see that the really bad citizen is updateMap (splitWords is
> > rather bad, too, together they take some 95% of the time in that
> > profile).
>
> Maybe I'm doing this wrong, but I see "splitWords" in spelling_df taking
> 80% of runtime. Adding SCC's like this:
>
> splitWords = {-# SCC "filter" #-} filter (not . B.null) . {-# SCC
> "splitWith" #-} B.splitWith isNogud . {-# SCC "toLower" #-} B.map
> toLower
>
> gives me:
>
> splitWords Main 216 1 0.0 0.0 78.6
> 91.8 filter Main 217 1 1.9 3.0 78.6
> 91.8 splitWith Main 218 1 28.4 36.8
> 76.7 88.8 isNogud Main 221 6488666 4.2 4.1
> 4.2 4.1 toLower Main 219 1 44.2 47.9
> 44.2 47.9
>
> i.e. it seems that "splitWith" and "toLower" (!) are the culprits. Why,
> I have no idea.
>
> Am I reading this wrong?
>

No, you're just compiling it wrong :)
If I profile without optimising, I get
Sun Jan 24 11:37 2010 Time and Allocation Profiling Report (Final)

pspellingBS0 +RTS -P -RTS becuase

total time = 16.46 secs (823 ticks @ 20 ms)
total alloc = 4,088,410,184 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

toLower Main 36.3 45.9 299 468806205
splitWith Main 30.6 33.9 252 346657565
train Main 25.8 13.5 212 138466403
isNogud Main 4.3 3.8 35 38931996
filter Main 2.1 2.7 17 27466769

which is compatible with your results.
With -O2:
Sun Jan 24 11:33 2010 Time and Allocation Profiling Report (Final)

pspellingBS +RTS -P -RTS becuase

total time = 5.66 secs (283 ticks @ 20 ms)
total alloc = 708,686,372 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

updateMap Main 68.6 76.9 194 136314135
toLower Main 16.6 0.9 47 1622182
splitWith Main 6.7 14.3 19 25366835
train Main 3.5 2.5 10 4362826
filter Main 2.1 4.4 6 7736998
splitWords Main 1.8 0.0 5 0

which gives a completely different picture.

Still, toLower takes a significant part of the time, we can drastically
reduce that by exploiting the fact that we don't need to handle the whole
Unicode complexity:

Sun Jan 24 11:53 2010 Time and Allocation Profiling Report (Final)

pspellingBSW +RTS -P -RTS becuase

total time = 4.72 secs (236 ticks @ 20 ms)
total alloc = 708,686,372 bytes (excludes profiling overheads)

COST CENTRE MODULE %time %alloc ticks bytes

updateMap Main 76.3 76.9 180 136314135
splitWith Main 9.7 14.3 23 25366835
filter Main 5.1 4.4 12 7736998
train Main 3.0 2.5 7 4362826
splitWords Main 2.1 0.0 5 0
isNogud Main 2.1 0.0 5 0
mkLow Main 1.7 0.9 4 1622182

And, without profiling:
$ time ./spellingBSW becuase
becuase -> because
2.84user 0.03system 0:02.88elapsed 99%CPU

Finally, building the set of two-step edits takes longer than the
additional lookups:

$ time ./spellingBSW becuase
becuase -> because
2.84user 0.03system 0:02.88elapsed 99%CPU
$ time ./spellingBSW korrekt
korrekt -> correct
3.50user 0.02system 0:03.52elapsed 100%CPU

vs.

$ time ./spellingBSWL becuase
becuase -> because
2.79user 0.04system 0:02.83elapsed 100%CPU
$ time ./spellingBSWL3 korrekt
korrekt -> correct
3.20user 0.02system 0:03.23elapsed 99%CPU

Which is reached with

----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import qualified Data.ByteString.Char8 as B

import qualified Data.ByteString as BS
import Data.Bits
import Data.Word (Word8)
import Data.Char (toLower)
import Data.Map (Map, findWithDefault, insertWith', keysSet, empty, member)
import qualified Data.Map as Map (lookup, empty, size)
import Data.Set (toList, fromList)

import Data.List (inits, tails, foldl')
import System.Environment (getArgs)

dataFile = "big.txt"
alphabet = "abcdefghijklmnopqrstuvwxyz"

splitWords :: B.ByteString -> [B.ByteString]
splitWords = {-# SCC "filter" #-} filter (not . BS.null) . {-# SCC
"splitWith" #-} BS.splitWith isNogud . {-# SCC "mkLow" #-} BS.map mkLow

mkLow :: Word8 -> Word8
mkLow x = x .|. 32

isNogud :: Word8 -> Bool
isNogud c = c < 97 || 122 < c

train :: [B.ByteString] -> Map B.ByteString Int
train = foldl' updateMap Map.empty

where updateMap model word = {-# SCC "updateMap" #-} insertWith' (+) word
1 model

nwords :: IO (Map B.ByteString Int)
nwords = (return $!) . train . splitWords =<< B.readFile dataFile

edits1 :: String -> [String]
edits1 s = deletes ++ transposes ++ replaces ++ inserts

where
deletes = [a ++ bs | (a, _:bs) <- splits]
transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits]
replaces = [a ++ (c:bs) | (a, _:bs) <- splits, c <- alphabet]
inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet]
splits = zip (inits s) (tails s)

correct :: Map B.ByteString Int -> String -> String

correct wordCounts word = B.unpack $ maxCount (B.pack "?") 0 candidates
where
candidates :: [B.ByteString]
candidates =
known [word] `or` ((known e1) `or` known_edits2)

e1 :: [String]
e1 = toList . fromList $ edits1 word

known_edits2 :: [B.ByteString]
known_edits2 =
[w3 | w1 <- e1, w2 <- edits1 w1, let w3 = B.pack w2, w3 `member`
wordCounts]

known :: [String] -> [B.ByteString]
known ws = {-# SCC "known" #-} [w | w <- map B.pack ws, w `member`
wordCounts]

maxCount :: B.ByteString -> Int -> [B.ByteString] -> B.ByteString
maxCount best cmax (word:more)
| cmax < count = maxCount word count more
| otherwise = maxCount best cmax more

where
count = findWithDefault 1 word wordCounts

maxCount best _ _ = best

or :: [B.ByteString] -> [B.ByteString] -> [B.ByteString]
or a b | null a = b
| otherwise = a

main :: IO ()
main = do
args <- getArgs
wordCounts <- nwords
mapM_ (printCorrect wordCounts) args
where
printCorrect :: Map B.ByteString Int -> String -> IO ()
printCorrect wordCounts word =
putStrLn $ word ++ " -> " ++ correct wordCounts word
----------------------------------------------------------------------

> > But once you start needing two edits (try korrekt), correct and edits1

> > start to show up. That shows that Norvig's algorithm isn't really
> > good. With two edit steps, you create a _lot_ of strings you need to
> > look up, far more than there are in the map. That takes time. It'll be
> > better to scan the map for entries with an edit distance (< 3) if you
> > have a good method to check that

Indeed:
$ time ./nLDBSWSpelling becuase
becuase -> because
2.84user 0.02system 0:02.86elapsed 100%CPU
$ time ./nLDBSWSpelling korrekt
korrekt -> correct
2.83user 0.05system 0:02.88elapsed 100%CPU

> > (http://old.nabble.com/haskell-in-online-contests-td26546989.html
> > contains pointers for that).
>
> Will have a look at that: it looks like it'll be very informative.
>
> > Another thing is
> >
> > allWords = keysSet wordCounts
> >
> > Ouch. For each correction, you construct that set anew. Just use
> > member from Data.Map instead of Data.Set.member and look up the words
> > in the map.
>
> Whoops! Just to be clear though: Haskell will memoise the result of
> "allWords" for a given invocation of "correct"?

Yes. But not across different invocations.

> So this would only make a large difference for multiple corrections?

Right. But that's the interesting use case, isn't it?

> (which I wasn't worrying about for the moment).
> The change seems to wipe off about 0.2s on average.
>

Which is pretty bad for multiple corrections.

> >> I will look at using Bloom Filters or
> >> Trieï¿½s instead of Data.Map, but I wonder if readFile should be taking
> >> nearly %30 of the run time, even for a 6MB file?
> >
> > No way. But it doesn't seem to, from my GHC's point of view.
>
> Just to be sure I wasn't using the SCC incorrectly, I split out the
> readFile call into "myReadFile". The prof line comes out as:
>
> myReadFile Main 210 1 35.8 8.6
> 35.8 8.6
>
> i.e. 35.8% of runtime.
>

Can I see the exact code which gives that profile?
Somehow, things which shouldn't must be attributed to readFile.

Matthew Phillips

unread,

Jan 24, 2010, 11:35:22 PM1/24/10

to Daniel Fischer, haskel...@haskell.org

On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:

<...>

> I think I know what happened here:
>
> $ ghc -fforce-recomp --make matthew -o matthew0

<...>

> I habitually compile all code with -O2, unless I have a specific reason not
> to. I tend to forget that some compile without optimisations.
> For some kinds of code that makes hardly any difference, for others it
> makes a big difference.

I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not* -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would turn on optimisation, but of course that's just the flag for GCC.

$ time ./spelling becuase
becuase -> because

real 0m4.467s
user 0m3.865s
sys 0m0.280s

$ time ./spelling_df becuase
becuase -> because

real 0m2.422s
user 0m2.198s
sys 0m0.081s

Your previous version is close to twice as fast, and now only 2.5 times slower than Python.

With your suggested changes, your latest version on my machine:

$ time ./spelling_df becuase
becuase -> because

real 0m1.731s
user 0m1.572s
sys 0m0.056s

Now, we're getting close: 4.7s -> 2.3s -> 1.7s.

>>> But once you start needing two edits (try korrekt), correct and edits1
>>> start to show up. That shows that Norvig's algorithm isn't really
>>> good. With two edit steps, you create a _lot_ of strings you need to
>>> look up, far more than there are in the map. That takes time. It'll be
>>> better to scan the map for entries with an edit distance (< 3) if you
>>> have a good method to check that
>
> Indeed:
> $ time ./nLDBSWSpelling becuase
> becuase -> because
> 2.84user 0.02system 0:02.86elapsed 100%CPU
> $ time ./nLDBSWSpelling korrekt
> korrekt -> correct
> 2.83user 0.05system 0:02.88elapsed 100%CPU

Not sure if I see what you're saying here: do you mean to point out the 2.86 vs 2.88 elapsed?

>>> Another thing is
>>>
>>> allWords = keysSet wordCounts
>>>
>>> Ouch. For each correction, you construct that set anew. Just use
>>> member from Data.Map instead of Data.Set.member and look up the words
>>> in the map.
>>
>> Whoops! Just to be clear though: Haskell will memoise the result of
>> "allWords" for a given invocation of "correct"?
>
> Yes. But not across different invocations.
>
>> So this would only make a large difference for multiple corrections?
>
> Right. But that's the interesting use case, isn't it?

It will be when I get the the rest of it working, yes :)

>>>> I will look at using Bloom Filters or
>>>> Trieï¿½s instead of Data.Map, but I wonder if readFile should be taking
>>>> nearly %30 of the run time, even for a 6MB file?
>>>
>>> No way. But it doesn't seem to, from my GHC's point of view.
>>
>> Just to be sure I wasn't using the SCC incorrectly, I split out the
>> readFile call into "myReadFile". The prof line comes out as:
>>
>> myReadFile Main 210 1 35.8 8.6
>> 35.8 8.6
>>
>> i.e. 35.8% of runtime.
>>
>
> Can I see the exact code which gives that profile?
> Somehow, things which shouldn't must be attributed to readFile.

The version at this link has myReadFile split out.

http://github.com/scramjet/spelling/blob/31071edb2166b2bc4d350358900d975228fd43b9/spelling.hs

Doing the same to your version has the same result:

myReadFile Main 210 1 45.7 9.6 45.7 9.6

It does seem that the profiler is wrong or misleading somehow.

Two other quick questions:

(1) you added parentheses to "candidates":

candidates = known [word] `or` ((known e1) `or` known_edits2)

The "or"'s should short circuit so that if "known [word]" is non-empty, the others shouldn't be evaluated. I read the above as forcing evaluation of the second "or" first: am I wrong?

(2) you eliminated the "fold" in "correct" in favour of a tail-recursive search in "maxCount": was this for style or performance reasons (or both :)?

Cheers,

Matthew._______________________________________________

Daniel Fischer

unread,

Jan 25, 2010, 2:22:13 AM1/25/10

to Matthew Phillips, haskel...@haskell.org

Am Montag 25 Januar 2010 05:34:50 schrieb Matthew Phillips:
> On 24/01/2010, at 10:22 PM, Daniel Fischer wrote:
>
> <...>
>
> > I think I know what happened here:
> >
> > $ ghc -fforce-recomp --make matthew -o matthew0
>
> <...>
>
> > I habitually compile all code with -O2, unless I have a specific
> > reason not to. I tend to forget that some compile without
> > optimisations. For some kinds of code that makes hardly any
> > difference, for others it makes a big difference.
>
> I used the flags "-funbox-strict-fields -fvia-C -optc-O2", but *not*
> -O2. Whoops! I could kick myself: I blindly assumed that -optc-O2 would
> turn on optimisation, but of course that's just the flag for GCC.

By the way, compiling via C nowadays hardly ever produces faster code than
the native code generator (most times I tried since 6.10, if there was a
difference, native was faster).

Well, above the code, I had the times for Norvig's algorithm (creating all
two-step edits and checking which are known):

> > And, without profiling:
> > $ time ./spellingBSW becuase
> > becuase -> because

> > 2.84user 0.03system 0:02.88elapsed 99%CPU
> >
> > Finally, building the set of two-step edits takes longer than the
> > additional lookups:
> >

> > $ time ./spellingBSW becuase
> > becuase -> because

> > 2.84user 0.03system 0:02.88elapsed 99%CPU
> > $ time ./spellingBSW korrekt
> > korrekt -> correct
> > 3.50user 0.02system 0:03.52elapsed 100%CPU
> >
> > vs.
> >
> > $ time ./spellingBSWL becuase
> > becuase -> because
> > 2.79user 0.04system 0:02.83elapsed 100%CPU
> > $ time ./spellingBSWL3 korrekt
> > korrekt -> correct
> > 3.20user 0.02system 0:03.23elapsed 99%CPU

For "becuase", which is a one-step edit, all take the same time (within
normal fluctuation), 2.8x seconds.
For the two-step edit "korrekt", the version building Sets takes 3.5
seconds, the version which doesn't build a Set of two-step edits takes 3.2
seconds, *and the version scanning the Map of known words for entries with
a Levenshtein distance [modified to account for transpositions] less than
3, takes 2.8y seconds, practically the same time as for the one-step edit*.

It becomes more obvious, perhaps, if we test a couple of two-steppers:

Lazy Levenshtein:

time ./nLDBSWSpelling becrase korrekt halmos territoir gutzenperg
becrase -> because
korrekt -> correct
halmos -> holmes
territoir -> territory
gutzenperg -> gutenberg
2.94user 0.03system 0:02.97elapsed 100%CPU

just something like 0.1 - 0.15 seconds more than for "becuase", that makes
0.02 - 0.04 seconds per two-edit correction.

Sets:

$ time ./spellingBSW becrase korrekt halmos territoir gutzenperg
becrase -> because
korrekt -> correct
halmos -> holmes
territoir -> territory
gutzenperg -> gutenberg
7.48user 0.03system 0:07.55elapsed 99%CPU

Gewalt geschrien! That takes almost a second per two-edit correction.

List:

$ time ./spellingBSWL3 becrase korrekt halmos territoir gutzenperg
becrase -> because
korrekt -> correct
halmos -> holmes
territoir -> territory
gutzenperg -> gutenberg
5.54user 0.04system 0:05.58elapsed 99%CPU

Better, but still bad, roughly half a second per correction.

Python without psyco:

$ time python ./norvig.py becrase korrekt halmos territoir gutzenperg
Trained in 1.46993017197 seconds
because
correct
holmes
territory
gutenberg
3.00user 0.08system 0:03.09elapsed 99%CPU

$ time python ./norvig.py becuase

Trained in 1.49027395248 seconds
because
1.46user 0.08system 0:01.55elapsed 100%CPU

about 0.3 seconds per correction

and with:

$ time python ./norvig.py becrase korrekt halmos territoir gutzenperg
Trained in 1.22417902946 seconds
because
correct
holmes
territory
gutenberg
2.12user 0.09system 0:02.21elapsed 100%CPU
$ time python ./norvig.py becuase
Trained in 1.17486715317 seconds
because
1.14user 0.08system 0:01.23elapsed 99%CPU

about 0.2 seconds per correction.

(Just for the record, I found out what Python did in the half second I
couldn't account for: it looked for matches for "./norvig.py", I forgot
that Python passes the name of the script in argv[0] - d'oh)

Strange. Doing that here has no such effect. The only line in the profile
(ByteString) where myReadFile occurs is

myReadFile Main 260 1 0.0 0.9 0.0 0.9 0 6507348

(snipped whitespace), entered once, allocated 6.5 million bytes, took not
measurable time.

Both, compiled via C and with the native code generator.

With String IO, it costs 4.6% time and 8.1% allocation.

>
> Two other quick questions:
>
> (1) you added parentheses to "candidates":
>
> candidates = known [word] `or` ((known e1) `or` known_edits2)
>
> The "or"'s should short circuit so that if "known [word]" is non-empty,
> the others shouldn't be evaluated. I read the above as forcing
> evaluation of the second "or" first: am I wrong?

Au contraire,
"Any operator lacking a fixity declaration is assumed to be infixl 9"
says the report, so without parentheses, it's parsed as

(known [word] `or` known e1) `or` known_edits2

and both 'or's must be evaluated even if word is known. Since 'or' is lazy
in its second argument, if we associate it

known [word] `or` (known e1 `or` known_edits2)

, in case of a known word, we needn't look at the parenthesis at all (it
will probably not make a measurable difference in running time unless you
check a couple of million known words, but still, it feels lazier). For a
local definition, I thought an explicit fixity declaration was overkill.

>
> (2) you eliminated the "fold" in "correct" in favour of a tail-recursive
> search in "maxCount": was this for style or performance reasons (or both
> :)?

Performance, kind of. Since the lists we fold over are in fact short, it
doesn't really matter, but if they were long, there'd be the risk of
unnecessarily allocating lots of pairs. I'm rather confident that with -O2,
GHC will eliminate the pairs anyway, but without looking at the core, I'm
not entirely sure.
However, in fact it was an unthought-through rewrite because I just had
someone with a stack overflow in spite of foldl' due to the laziness of the
data constructors[*]. So I made sure that that couldn't happen, without
looking at the folded function to see if that already prevents it. And in
fact, it does, so using a foldl' if you use lists instead of Sets is fine,
with respect to style, even preferable.

[*] classic example: why will

average xs = sum / len
where
(sum,len) = foldl' accum (0,0) xs
accum (sm,ln) x = (sm+x,ln+1)

cause a stack overflow for long lists?

>
> Cheers,
>
> Matthew.

Eduard Sergeev

unread,

Jan 26, 2010, 8:11:25 AM1/26/10

to haskel...@haskell.org

Matthew Phillips-5 wrote:
> I also found it to to be very slow.

My variant: http://a-ejeli-tak.livejournal.com/1326.html Spellchecker in
Haskell
String version runs in 2.5 sec, ByteString in 1.2 sec (just for one word
e.g. just to build the tree). 8 sec to check input of 400 words (copied from
Norvig's example). I think laziness helps here to avoid unnecessary checks
(once the first match is found). Haven't tried it on a larger data sets
neither tried to optimize it. Cheated on dictionary parsing though...

--
View this message in context: http://old.nabble.com/Spelling-checker-exercise-tp27269320p27322382.html
Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

Daniel Fischer

unread,

Jan 26, 2010, 10:15:54 AM1/26/10

to haskel...@haskell.org

Am Dienstag 26 Januar 2010 14:11:06 schrieb Eduard Sergeev:
> Matthew Phillips-5 wrote:
> > I also found it to to be very slow.
>
> My variant: http://a-ejeli-tak.livejournal.com/1326.html Spellchecker
> in Haskell
> String version runs in 2.5 sec, ByteString in 1.2 sec (just for one word
> e.g. just to build the tree). 8 sec to check input of 400 words (copied
> from Norvig's example).

Slower here, time for building the set is approximately equal to that of of
the frequency map [either String or ByteString], lookup a little slower if
one edit is needed, much faster if two are needed (of course).
But the lazy Levenshtein distance is much faster again, for the 'tests2'
data from Norvig's http://norvig.com/spell.py (400 words),

$ xargs -a tdata.txt time ./nLDBSWSpelling > /dev/null
4.50user 0.03system 0:04.53elapsed 100%CPU
$ time ./esergSpellBS big.txt tdata.txt > /dev/null
28.23user 0.09system 0:28.32elapsed 100%CPU

surprisingly (?), your plain String version is faster for that than your
ByteString version:

$ time ./esergSpellS big.txt tdata.txt > /dev/null
25.07user 0.10system 0:25.18elapsed 99%CPU

> I think laziness helps here to avoid unnecessary
> checks (once the first match is found).

But that's the point, these checks aren't unnecessary (unless the word
under inspection is known). You want to propose the most likely correct
word.

If your input is "arthetic", should you return "aesthetic", just because
it's the first of the (at least four) correct words with edit distance 2[*]
which is produced by your arbitrary ordering of edit steps or it's the
lexicographically smallest?

I think you shouldn't.

> Haven't tried it on a larger data sets neither tried to optimize it.
> Cheated on dictionary parsing though...

[*] The others I know are "bathetic", "pathetic" and "arthritic" - without
context, I'd go for "arthritic" because I think spelling errors are more
common i the middle of a word than at the beginning or at the end, but
plain frequency analysis of the corpus suggests "pathetic".

Eduard Sergeev

unread,

Jan 26, 2010, 10:47:05 AM1/26/10

to haskel...@haskell.org

Daniel Fischer-4 wrote:
>
> But that's the point, these checks aren't unnecessary (unless the word
> under inspection is known). You want to propose the most likely correct
> word.

I just wanted to rewrite original Nornig's Python code in Haskell :) (maybe
I misunderstood the algorithm?). Of course it is far from being able to
produce 'most likely correct' result.

Btw, where can I find the source for this super-fast 'nLDBSWSpelling'
variant?

--
View this message in context: http://old.nabble.com/Spelling-checker-exercise-tp27269320p27324740.html

Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

_______________________________________________

Daniel Fischer

unread,

Jan 26, 2010, 3:08:24 PM1/26/10

to haskel...@haskell.org, Eduard Sergeev

Am Dienstag 26 Januar 2010 16:46:42 schrieb Eduard Sergeev:
> Daniel Fischer-4 wrote:
> > But that's the point, these checks aren't unnecessary (unless the word
> > under inspection is known). You want to propose the most likely
> > correct word.
>

> I just wanted to rewrite original Norvig's Python code in Haskell :)

> (maybe I misunderstood the algorithm?).

Seems so.

NWORDS is the frequency map built from the corpus.

return max(candidates, key=NWORDS.get)

returns the candidate with the highest value in NWORDS, i.e. the candidate
that occurred most often in the corpus (if there are several with the same
highest count, I think the one found first is taken, the order in which an
iterator traverses a Python set is not specified, IIRC, so it might be any
of those).

> Of course it is far from being able to produce 'most likely correct'
> result.

Even taking word frequency into account doesn't get really close.
You'd have to take into account that some errors are more common than
others (e.g. award a penalty for words starting with a different letter,
substitution cost should be lower for letters adjacent on common keyboards
than for letters far apart, but it should also be lower for letter pairs of
similar sound [e <-> i, d <-> t and so on], insertion/deletion cost should
be lower for double letters ["diging" is more likely to be a misspelling of
"digging" than of "diving", although g and v are neighbours on qwerty and
qwertz keyboards], -able <-> -ible confusion is extremely common).

It's really hairy. But a combination of edit distance and word frequency is
a good start.

>
> Btw, where can I find the source for this super-fast 'nLDBSWSpelling'
> variant?

Nowhere, unless you come over with a sixpack or two ;)
It originated from a contest-related (codechef, www.codechef.com , a fork
or similar of SPOJ) question end of November.
To not spoil the contest, I didn't post the code then. When I first
mentioned the idea in this thread, I hadn't ported the code to the current
setting yet, so I couldn't post it, even if I wanted, besides I didn't want
to distract from the topic of proting Norvig's algorithm.

But since you ask and it's been long enough ago (and not directly
applicable to the contest), here comes the modified source, I've added
comments and a few further improvements, time for the 400 words is now
4.02user 0.04system 0:04.07elapsed 100%CPU
2.8s for building the map, so on average 3 milliseconds per correction :D

----------------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import Data.ByteString.Unsafe (unsafeIndex)

import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString as BS

import Data.Char (toLower)
import Data.Map (Map, findWithDefault, insertWith', member, assocs, empty)

import Data.List (inits, tails, foldl')
import System.Environment (getArgs)

import Data.Word (Word8)
import Data.Bits ((.|.))

dataFile = "big.txt"
alphabet = "abcdefghijklmnopqrstuvwxyz"

infixl 9 !

{-# INLINE (!) #-}
(!) :: B.ByteString -> Int -> Word8
(!) = unsafeIndex

{-
Lazily calculate Levenshtein distance, cut off at 3, modified
to have transpositions count as one edit.
-}
distance :: B.ByteString -> B.ByteString -> Int
distance start target = go 0 m n
where
m = B.length start
n = B.length target
go l i j
{- if number of edits so far + difference of lengths left is
larger than
2, the total number of edits will be at least 3 -}
| l+i > j+2 || l+j > i+2 = 3
{- if start is completely consumed, we need j additional
inserts -}
| i == 0 = l+j
{- if target is completely consumed, we need i additional
deletions -}
| j == 0 = l+i
{- no edit nor branch if we look at identical letters -}
| a == b = go l (i-1) (j-1)
| otherwise =
let -- replace
x = go (l+1) (i-1) (j-1)
-- insert
y = go (l+1) i (j-1)
-- delete
z = go (l+1) (i-1) j
-- transpose
w = go (l+1) (i-2) (j-2)
-- but only if the letters match
t | i > 1 && j > 1 && b == start!(i-2) && a == target!
(j-2) = w
| otherwise = 3
in case compare i j of
-- if there's more of target left than of start, a
deletion
-- can't give a path of length < 3, since after that
we'd
-- need at least two inserts
LT -> t `seq` x `seq` y `seq` min x (min y t)
-- if both remaining segments have the same length,
-- we must try all edit steps
EQ -> t `seq` x `seq` y `seq` z `seq` min x (min y (min
t z))
-- if there's more of start left than of target, an
-- insert would be pointless
GT -> t `seq` x `seq` z `seq` min x (min z t)
where
a = start!(i-1)
b = target!(j-1)

splitWords :: B.ByteString -> [B.ByteString]
splitWords = filter (not . BS.null) . BS.splitWith isNogud . BS.map mkLow

{- quick and dirty toLower for ASCII letters -}

mkLow :: Word8 -> Word8
mkLow x = x .|. 32

{- not a lowercase ASCII letter -}

isNogud :: Word8 -> Bool
isNogud c = c < 97 || 122 < c

{- build map (word -> how often seen) -}

train :: [B.ByteString] -> Map B.ByteString Int

train = foldl' updateMap empty
where updateMap model word = insertWith' (+) word 1 model

{- read corpus and build map -}

nwords :: IO (Map B.ByteString Int)
nwords = (return $!) . train . splitWords =<< B.readFile dataFile

{- single edit modifications, don't reproduce original word -}

edits1 :: String -> [String]
edits1 s = deletes ++ transposes ++ replaces ++ inserts
where
deletes = [a ++ bs | (a, _:bs) <- splits]

transposes = [a ++ (b2:b1:bs) | (a, b1:b2:bs) <- splits, b1 /= b2]
replaces = [a ++ (c:bs) | (a, l:bs) <- splits, c <- alphabet, c /= l]

inserts = [a ++ (c:b) | (a, b) <- splits, c <- alphabet]
splits = zip (inits s) (tails s)

correct :: Map B.ByteString Int -> String -> String
correct wordCounts word

-- known word, trivial case
| wrd `member` wordCounts = word
-- no known single edit modification, so scan corpus
-- of known words for entries of distance 2
| null ed1 = mxBy2 qm 0 (assocs wordCounts)
-- at least one known single edit modification, look for
-- most frequent of them
| otherwise = mxBy qm 0 ed1
where
wrd = B.pack word
qm = B.pack "?"
-- list of known single edit modifications and their count
ed1 = [(pw,c) | w <- edits1 word
, let { pw = B.pack w
; c = findWithDefault 0 pw wordCounts }
, c > 0]
mxBy w _ [] = B.unpack w
mxBy w m ((n,c):ps)
| m < c = mxBy n c ps -- new highest count
| otherwise = mxBy w m ps
-- if we land here, all known words have a distance of at least 2,
-- we want the one with distance 2 and the highest count among
those
-- (if there are any), we start with the unknown-marker and count 0
mxBy2 w _ [] = B.unpack w
mxBy2 w f ((n,c):ps)
-- if the new word's count isn't larger than the best we've
-- found so far, we can discard it immediately
-- otherwise, calculate distance, if that's larger than 2,
-- discard the word, otherwise we've found a new best
| c <= f || d > 2 = mxBy2 w f ps
| otherwise = mxBy2 n c ps
where
d = distance wrd n

main :: IO ()
main = do
args <- getArgs
wordCounts <- nwords

mapM_ (printCorrect wordCounts) $ map (map toLower) args

where
printCorrect :: Map B.ByteString Int -> String -> IO ()
printCorrect wordCounts word =
putStrLn $ word ++ " -> " ++ correct wordCounts word

----------------------------------------------------------------------

_______________________________________________

Matthew Phillips

unread,

Feb 8, 2010, 4:58:45 PM2/8/10

to haskel...@haskell.org

Just wanted to follow up on this, after getting distracted by less important things.

Thanks to Daniel's suggestions re the ByteString variant, the fairly-faithful Haskell translation of Norvig's Python clocks in at about 1.7 times slower than spelling.py, most of which is down to the fact that "train" doesn't have a O(1) hashmap to use.

I tried using Data.Trie rather than Data.Map, but that made it slightly slower. A counting Bloom filter might be better, but that's going further than I have time for right now.

So, in "real" usage this simple version would probably be usable, since train would be done once. Of course there are far cleverer algorithms, but the version I have so far I think retains the value that Norvig's has as a learning example.

The latest version I came up with is here:

http://github.com/scramjet/spelling/blob/master/spelling.hs

Cheers, and thanks again,

Matthew.

P.S.

> Daniel:

> [*] classic example: why will
>
> average xs = sum / len
> where
> (sum,len) = foldl' accum (0,0) xs
> accum (sm,ln) x = (sm+x,ln+1)
>
> cause a stack overflow for long lists?

You gave a strong hint before this, so I'd guess it's due to the lazy tuple creation in "accum"?

Daniel Fischer

unread,

Feb 8, 2010, 5:33:05 PM2/8/10

to Matthew Phillips, haskel...@haskell.org

Am Montag 08 Februar 2010 22:58:03 schrieb Matthew Phillips:
> P.S.
>
> > Daniel:
> >
> > [*] classic example: why will
> >
> > average xs = sum / len

> > ï¿½ where
> > ï¿½ ï¿½ ï¿½(sum,len) = foldl' accum (0,0) xs
> > ï¿½ ï¿½ ï¿½accum (sm,ln) x = (sm+x,ln+1)

> >
> > cause a stack overflow for long lists?
>
> You gave a strong hint before this, so I'd guess it's due to the lazy
> tuple creation in "accum"?

Well, the tuples are forced, so in that sense, tuple creation is not lazy.
But the components of the tuple are *not* forced, so they become large
thunks and in that sense, tuple creation is lazy and that's indeed the
cause of the stack overflow.

I think you get full marks for this :)