[Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation]

Showing 1-6 of 6 messages
[Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation] Donald Bruce Stewart 6/7/07 6:43 PM
mdanish:
> Hello,
>
> I've been playing with the INTEST problem on SPOJ which demonstrates
> the ability to write a program which processes large quantities of
> input data.  http://www.spoj.pl/problems/INTEST/
 
> But when I make a slight modification, the program chews up a ton more memory
> and takes more time:
>
> import Control.Monad
> import Data.Maybe
> import qualified Data.ByteString.Char8 as B
>
> divisibleBy :: Int -> Int -> Bool
> a `divisibleBy` n = a `rem` n == 0
>
> main :: IO ()
> main = do
>     [n,k] <- (map int . B.split ' ') `fmap` B.getLine :: IO [Int]
>
>     let
>         doLine :: Int -> Int -> IO Int
>         doLine r _ = B.getLine >>= return . testDiv r
>         -- 'return' moved here      ^^


What follows is a solution to the original question, and then a dramatic
rewrite, showing the fastest way (that I know of) to parse \n separated
lists of numbers in Haskell.  The results should outperform C fairly
well.


*** Solution 1: don't be so lazy in the fold.

First, look at that lazy fold. A simple fix, try being explict about forcing
your accumulator:

        doLine :: Int -> Int -> IO Int
        doLine r _ = B.getLine >>= \s -> return $! testDiv r s


And some timing data:

Original:
    $ time ./A < in
    29359
    ./A < in  1.52s user 0.06s system 93% cpu 1.679 total

Too lazy:
    $ time ./B < in
    29359
    ./B < in  3.84s user 0.26s system 82% cpu 4.957 total

Hand back some strictness hints:
    $ time ./D < in
    29359
    ./D < in  1.52s user 0.03s system 94% cpu 1.637 total


*** Solution 2: use lazy bytestrings to avoid gunky IO


Now, however, I'd give up on that explict getLine stuff, and use a lazy
bytestring. Something like this:

    import Data.Maybe
    import Data.List
    import qualified Data.ByteString.Lazy.Char8 as L

    main :: IO ()
    main = do
        (l:ls) <- L.lines `fmap` L.getContents -- done with IO now.
        let [n,k] = map int (L.split ' ' l)
        print . foldl' (test k) 0 . map int . take n $ ls

    test :: Int -> Int -> Int -> Int
    test k acc n | n `divisibleBy` k = acc+1
                 | otherwise         = acc

    int :: L.ByteString -> Int
    int = fst . fromJust . L.readInt

    divisibleBy :: Int -> Int -> Bool
    a `divisibleBy` n = a `rem` n == 0


The general rule for bytestring loops is to avoid IO, and to use lazy
bytestrings if you need 'lines'. Also, program in a high level, using
combinators, rather than your own loops, so that fusion will kick in (we
get some list fusion here).

And running it:

    $ time ./C < in
    29359
    ./C < in  1.22s user 0.04s system 94% cpu 1.335 total

Ok, faster, and cleaner. Avoid mixing IO into your code!


*** Solution 3: 4x faster by processing strict cache chunks


Now the fun part.

The following code is the fastest way I know to process lists of numbers
(in any language). Its' based on similar code I wrote for the language
shootout.  The key trick is to use lazy bytestrings *only* as a method
for filling the cache with newline-aligned chunks of numbers. Once
you've got that perfectly-sized chunk, walk its lines, and process them.
This is all done in Haskell, and relies on an understanding of the low
level details of bytestring optimisations.

The general framework could be reused for any code that needs to process
a list of numbers in a file, where you care about speed.

It performs as follows:

    $ time ./F < in
    29359
    ./F < in  0.24s user 0.01s system 76% cpu 0.327 total

Pretty fast..

Previous experience[1] indicates it is pretty hard to write a C line
parsing program[2] that that run this fast.  And the code, with comments:

1. http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=ghc&id=0
2. http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=gcc&id=2


    {-# OPTIONS -fbang-patterns #-}

    import Data.Char
    import Data.Maybe
    import Data.ByteString.Base
    import qualified Data.ByteString.Char8      as S
    import qualified Data.ByteString.Lazy.Char8 as L

    main = do
        ss <- L.getContents -- done with IO now.

        let (l,ls) = L.break (=='\n') ss

            -- don't need count, we're allocating lazily
            k      = fst . fromJust . L.readInt . last . L.split ' ' $ l

            file   = L.toChunks (L.tail ls) -- a lazy list of strict cache chunks

        print $ process k 0 file

    divisibleBy :: Int -> Int -> Bool
    a `divisibleBy` n = a `rem` n == 0

    -- ---------------------------------------------------------------------
    --
    -- Optimised parsing of strict bytestrings representing \n separated numbers
    --

    --
    -- we have the file as a list of cache chunks
    -- align them on \n boundaries, and process each chunk separately
    -- when the next chunk is demanded, it will be read in.
    --
    process :: Int -> Int -> [S.ByteString] -> Int
    process k i []      = i
    process k !i (s:t:ts) | S.last s /= '\n' = process k (add k i s') ts'
      where
        (s',r) = S.breakEnd (=='\n') s
        ts'    = (S.append r t) : ts        -- join chunks on line boundaries

    process k i (s: ss) = process k (add k i s) ss

    --
    -- process a single cache-sized chunk of numbers, \n aligned
    --
    add :: Int -> Int -> S.ByteString -> Int
    add k i s | S.null s  = i
              | otherwise = test k i (parse x) xs
      where (x,xs) = uncons s

    --
    -- process a single line, until \n
    --
    test :: Int -> Int -> Int -> ByteString -> Int
    test k i !n t
        | y == '\n' = -- done reading the line, process it:
            if n `divisibleBy` k then add k (i+1) ys
                                 else add k i     ys
       | otherwise = test k i n' ys
      where (y,ys) = uncons t
            n'     = parse y + 10 * n

    parse c  = ord c - ord '0'

    -- fastest way to take the head of a strict bytestring
    uncons s = (w2c (unsafeHead s), unsafeTail s)


Cheers,
  Don
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation] Donald Bruce Stewart 6/7/07 6:59 PM
dons:

> mdanish:
> > Hello,
> >
> > I've been playing with the INTEST problem on SPOJ which demonstrates
> > the ability to write a program which processes large quantities of
> > input data.  http://www.spoj.pl/problems/INTEST/
>  
> > But when I make a slight modification, the program chews up a ton more memory
> > and takes more time:
> >
> > import Control.Monad
> > import Data.Maybe
> > import qualified Data.ByteString.Char8 as B
> >
> > divisibleBy :: Int -> Int -> Bool
> > a `divisibleBy` n = a `rem` n == 0
> >
> > main :: IO ()
> > main = do
> >     [n,k] <- (map int . B.split ' ') `fmap` B.getLine :: IO [Int]
> >
> >     let
> >         doLine :: Int -> Int -> IO Int
> >         doLine r _ = B.getLine >>= return . testDiv r
> >         -- 'return' moved here      ^^
>


And just following up with some GC statistics:

Original,

    95% cpu 1.668 total

    <<ghc: 258766440 bytes,
           452 GCs,
           3036/3036 avg/max bytes residency (1 samples),
           3M in use, 0.00 INIT (0.00 elapsed),
           1.51 MUT (1.63 elapsed),
           0.01 GC (0.03 elapsed) :ghc>>

Too lazy:

    96% cpu 4.219 total

    <<ghc: 278683532 bytes,
           495 GCs,
   -->     14729345/52642396 avg/max bytes residency (7 samples),
   -->     85M in use,
           0.00 INIT (0.00 elapsed),
           1.68 MUT (1.81 elapsed),
   -->     2.07 GC (2.36 elapsed) :ghc>>

    (clear space leak)

Fixing above program with $!:

    94% cpu 1.656 total
    <<ghc: 257394052 bytes
           451 GCs,
    -->    2288/2288 avg/max bytes residency (1 samples),
    -->    1M in use,
           0.00 INIT (0.00 elapsed),
           1.49 MUT (1.64 elapsed),
           0.01 GC (0.01 elapsed) :ghc>>

Using lazy bytestrings for pure processing:

    90% cpu 1.424 total
    <<ghc: 219403252 bytes,
           410 GCs,
           70527/74236 avg/max bytes residency (10 samples),
           2M in use,
           0.00 INIT (0.00 elapsed),
           1.25 MUT (1.40 elapsed),
           0.01 GC (0.01 elapsed) :ghc>>

And the killer strict chunk parser:

    78% cpu 0.327 total
    <<ghc: 20685092 bytes,
    -->    38 GCs,
    -->    81348/81348 avg/max bytes residency (1 samples),
           2M in use,
           0.00 INIT (0.00 elapsed),
    -->    0.21 MUT (0.32 elapsed),
           0.00 GC (0.00 elapsed) :ghc>>

Very little data shuffled around in the last one.

-- Don


_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation] Donald Bruce Stewart 6/7/07 8:50 PM
dons:

> dons:
> > mdanish:
> > > Hello,
> > >
> > > I've been playing with the INTEST problem on SPOJ which demonstrates
> > > the ability to write a program which processes large quantities of
> > > input data.  http://www.spoj.pl/problems/INTEST/
> >  
> > > But when I make a slight modification, the program chews up a ton more memory
> > > and takes more time:
> > >
> > > import Control.Monad
> > > import Data.Maybe
> > > import qualified Data.ByteString.Char8 as B
> > >
> > > divisibleBy :: Int -> Int -> Bool
> > > a `divisibleBy` n = a `rem` n == 0
> > >
> > > main :: IO ()
> > > main = do
> > >     [n,k] <- (map int . B.split ' ') `fmap` B.getLine :: IO [Int]
> > >
> > >     let
> > >         doLine :: Int -> Int -> IO Int
> > >         doLine r _ = B.getLine >>= return . testDiv r
> > >         -- 'return' moved here      ^^
> >
>
>
> Original,
>
>     95% cpu 1.668 total
>
>     <<ghc: 258766440 bytes,
>            452 GCs,
>            3036/3036 avg/max bytes residency (1 samples),
>            3M in use, 0.00 INIT (0.00 elapsed),
>            1.51 MUT (1.63 elapsed),
>            0.01 GC (0.03 elapsed) :ghc>>
 
> And the killer strict chunk parser:
>
>     78% cpu 0.327 total
>     <<ghc: 20685092 bytes,
>     -->    38 GCs,
>     -->    81348/81348 avg/max bytes residency (1 samples),
>            2M in use,
>            0.00 INIT (0.00 elapsed),
>     -->    0.21 MUT (0.32 elapsed),
>            0.00 GC (0.00 elapsed) :ghc>>
>

I note there was a missing constructor specialisation happening in the
calls to 'add', in the good program. We can fix that with some well
place inline pragma:

    add :: Int -> Int -> S.ByteString -> Int
    add k i s = if S.null s then i else test k i (parse x) xs

      where (x,xs) = uncons s
    {-# INLINE add #-}

Before, GHC -ddump-simpl-stats reported:

        22 RuleFired
        2 SC:$wprocess1
        4 SC:$wprocess2
        2 SC:comb1

After:
        24 RuleFired
        4 SC:$wprocess1
        4 SC:$wprocess2
        2 SC:comb1

And timing stats:

    $ time ./F < in
    29359
    ./F < in  0.20s user 0.04s system 81% cpu 0.288 total

So some 10% better.  It's often a good idea to inline non-recursive wrapper
functions like this, in bytestring code.

-- Don
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation] Bulat Ziganshin 6/8/07 3:28 AM
Hello Donald,

Friday, June 8, 2007, 5:42:41 AM, you wrote:

> Previous experience[1] indicates it is pretty hard to write a C line
> parsing program[2] that that run this fast.  And the code, with comments:

[2] uses gets() function while your haskell code read whole buffer
each time. that is the obvious source of slowness

--
Best regards,
 Bulat                            mailto:Bulat.Z...@gmail.com

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation] Conal Elliott 6/8/07 10:47 AM
Lovely!

Perhaps a stylistic shift would encourage writing this sort of elegant,
fusion-friendly code.

    -- Generalized version of "interact".  Encapsulates data getter &
putter.
    genInteract :: IO i -> (o -> IO ()) -> ((i -> o) -> IO ())
    genInteract get put = \ f -> get >>= put . f

The intention here is that i and o are pure value types (no IO).  Solutions
2 and three use the following specialization.

    -- Lazy ByteString in and showable out
    lGetPrint :: Show o => (L.ByteString -> o) -> IO ()
    lGetPrint = genInteract L.getContents print

Then rewrite solution 2 as follows:

    main = lGetPrint f
     where
       f contents = foldl' (test k) 0 . map int . take n $ ls
         where
           (l:ls) = L.lines contents


           [n,k]  = map int (L.split ' ' l)

And solution 3 similarly.

Plug: for additional examples and a more general approach to separating out
IO, my blog post "separating IO from logic -- example
<http://conal-elliott.blogspot.com/2007/02/separating-io-from-logic-example.html>",
which uses the TV <http://www.haskell.org/haskellwiki/TV> library.

Re: [Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation] Donald Bruce Stewart 6/8/07 10:10 PM
conal:

>
>    Lovely!
>
>    Perhaps a stylistic shift would encourage writing this sort
>    of elegant, fusion-friendly code.
>        -- Generalized version of "interact".  Encapsulates data
>    getter & putter.
>        genInteract :: IO i -> (o -> IO ()) -> ((i -> o) -> IO
>    ())
>        genInteract get put = \ f -> get >>= put . f

Yes, I think this is a good idea. Perhaps even a small module
for doing fast folds over lists of lines (abstracting out the
cache/demand/traversal code from solution #3.)

Also, I realise I'd missed another optimisatoin for #3, minimal copying
to do the alignment. Gains another 10%:

    process k !i (s:t:ts) | S.last s /= '\n' = process k (add k i s') ts'
      where
        (s',r)  = S.breakEnd (=='\n') s
        (r',rs) = S.break    (=='\n') t
        ts'     = S.concat [r,r',S.singleton '\n'] : unsafeTail rs : ts

Here, we copy only the smallest amount from the end of the current
chunk, and the start of the next chunk, to do \n aligning. Gets us down
to:

    $ time ./G < in
    29359
    ./G < in  0.19s user 0.02s system 96% cpu 0.210 total

And the code:

    {-# OPTIONS -fbang-patterns #-}

    import Data.Char
    import Data.Maybe
    import Data.ByteString.Base
    import qualified Data.ByteString.Char8      as S
    import qualified Data.ByteString.Lazy.Char8 as L

    main = do
        ss <- L.getContents -- done with IO now.

        let (l,ls) = L.break (=='\n') ss

            -- don't need count, we're allocating lazily
            k      = fst . fromJust . L.readInt . last . L.split ' ' $ l

            -- a lazy list of strict chunks


            file   = L.toChunks (L.tail ls)

        print $ process k 0 file

    divisibleBy :: Int -> Int -> Bool


    a `divisibleBy` n = a `rem` n == 0

    -- ---------------------------------------------------------------------


    --
    -- Optimised parsing of strict bytestrings representing \n separated numbers
    --

    --
    -- we have the file as a list of cache chunks
    -- align them on \n boundaries, and process each chunk separately
    -- when the next chunk is demanded, it will be read in.
    --
    process :: Int -> Int -> [S.ByteString] -> Int
    process k i []      = i

    process k !i (s:t:ts) | S.last s /= '\n' = process k (add k i s') ts'
      where
        (s',r)  = S.breakEnd (=='\n') s
        (r',rs) = S.break    (=='\n') t
        ts'     = S.concat [r,r',S.singleton '\n'] : unsafeTail rs : ts

    process k i (s: ss) = process k (add k i s) ss

    --
    -- process a single cache-sized chunk of numbers, \n aligned
    --
    add :: Int -> Int -> S.ByteString -> Int
    add k i s = if S.null s then i else test k i (parse x) xs
      where (x,xs) = uncons s
    {-# INLINE add #-}

    --


    -- process a single line, until \n
    --
    test :: Int -> Int -> Int -> ByteString -> Int
    test k i !n t
        | y == '\n' = -- done reading the line, process it:
            if n `divisibleBy` k then add k (i+1) ys
                                 else add k i     ys
        | otherwise = test k i n' ys
      where (y,ys) = uncons t
            n'     = parse y + 10 * n

    parse c  = ord c - ord '0'

    -- fastest way to take the head of a strict bytestring
    uncons s = (w2c (unsafeHead s), unsafeTail s)

_______________________________________________


Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe