[Haskell-cafe] not enough fusion?

35 views
Skip to first unread message

Johannes Waldmann

unread,
Jun 24, 2012, 8:04:47 PM6/24/12
to haskel...@haskell.org
Dear all,

while doing some benchmarking (*)
I noticed that function s1 is considerably faster than s2
(but I wanted s2 because it looks more natural)
(for n = 10000, s1 takes 20 s, s2 takes 13 s; compiled by ghc-7.4.2 -O2)

s1 :: Int -> Int
s1 n = sum $ do
x <- [ 0 .. n-1 ]
return $ sum $ do
y <- [ 0 .. n-1 ]
return $ gcd x y

s2 :: Int -> Int
s2 n = sum $ do
x <- [ 0 .. n-1 ]
y <- [ 0 .. n-1 ]
return $ gcd x y

I was expecting that in both programs,
all lists will be fused away (are they?)
so the code generator essentially can produce straightforward
assembly code (no allocations, no closures, etc.)


For reference, I also wrote the equivalent imperative program
(two nested loops, one accumulator for the sum)
(with the straightforward recursive gcd)
and runtimes are (for same input as above)

C/gcc: 7.3 s , Java: 7.7 s, C#/Mono: 8.7 s


So, they sort of agree with each other, but disagree with ghc.
Where does the factor 2 come from? Lists? Laziness?
Does ghc turn the tail recursion (in gcd) into a loop? (gcc does).
(I am looking at -ddump-asm but can't quite see through it.)


(*) benchmarking to show that today's compilers are clever enough
such that the choice of paradigm/language does not really matter
for this kind of low-level programming.






_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Dmitry Olshansky

unread,
Jun 25, 2012, 5:09:30 AM6/25/12
to Johannes Waldmann, haskel...@haskell.org
s1 ~ sum $ map (sum . flip map [0..n] . gcd) [0..n]
s2 ~ sum $ concatMap (flip map [0..n] . gcd) [0..n] 

There are some posts from Joachim Breitner investigated fusion for concatMap:



2012/6/25 Johannes Waldmann <wald...@imn.htwk-leipzig.de>

Lorenzo Bolla

unread,
Jun 25, 2012, 5:33:32 AM6/25/12
to Dmitry Olshansky, Johannes Waldmann, haskel...@haskell.org
I wonder why this performs really badly, though (I would expect it to be the same as s2):

s3 :: Int -> Int
s3 n = sum [gcd x y | x <- [ 0 .. n-1 ], y <- [ 0 .. n-1 ]]

From the links posted by Dmitry, it might be that the code generated is made of 2 recursive calls: in fact, what I observe is a "stack space overflow" error on runtime...

L.

Dmitry Olshansky

unread,
Jun 25, 2012, 6:55:03 AM6/25/12
to Lorenzo Bolla, Johannes Waldmann, haskel...@haskell.org
In my test it works ~20% faster than s2 and ~20% slower than s1.
Did you use -O2 flag?

2012/6/25 Lorenzo Bolla <lbo...@gmail.com>

Lorenzo Bolla

unread,
Jun 25, 2012, 7:04:06 AM6/25/12
to Dmitry Olshansky, Johannes Waldmann, haskel...@haskell.org
You are right, probably I didn't because I cannot reproduce it now.
Sorry for the noise.
(Anyway, I am still surprised that list-comprehension gives a different result from do-notation in the list monad.)

L.

Duncan Coutts

unread,
Jun 25, 2012, 7:09:28 AM6/25/12
to Johannes Waldmann, haskel...@haskell.org
On 25 June 2012 02:04, Johannes Waldmann <wald...@imn.htwk-leipzig.de> wrote:
> Dear all,
>
> while doing some benchmarking (*)
> I noticed that function  s1  is considerably faster than  s2
> (but I wanted  s2  because it looks more natural)
> (for n = 10000,  s1 takes 20 s, s2 takes 13 s; compiled by ghc-7.4.2 -O2)
>
> s1 :: Int -> Int
> s1 n = sum $ do
>        x <- [ 0 .. n-1 ]
>        return $ sum $ do
>            y <- [ 0 .. n-1 ]
>            return $ gcd x y
>
> s2 :: Int -> Int
> s2 n = sum $ do
>      x <- [ 0 .. n-1 ]
>      y <- [ 0 .. n-1 ]
>      return $ gcd x y
>
> I was expecting that in both programs,
> all lists will be fused away (are they?)

Almost certainly not.

> so the code generator essentially can produce straightforward
> assembly code (no allocations, no closures, etc.)

Unless it changed recently, sum does not fuse (as it is currently
defined, using the current implementation of foldr/build fusion).
Also, lists built using do notation do not (I think) translate into
instances of foldr and build, only list comprehension syntax.

On the first point: sum is a foldl and the current implementation of
foldr/build fusion does not cope well with foldl. While foldl can be
defined in terms of foldr the result is lots of closure allocations.
This could in principle be fixed with an arity raising transformation,
and GHC now does have a simple implementation of this transformation,
so it may be possible to get sum as a foldl to fuse. I'm not sure that
anyone has yet tried changing the sum implementation to try to get it
to fuse. It would be an interesting experiment.

On the second point: ghc has a special desugaring for list
comprehensions (in -O mode) where it turns them into uses of foldr and
build. On the other hand, do notation desugars into bind and return.
I'm not sure how well the result fuses, it uses: foldr ((++) . k) [].

You can find out, just look at the core. If all goes well then you
should see a single list being built and then consumed by sum.

Duncan

Dominic Steinitz

unread,
Jun 27, 2012, 3:58:26 AM6/27/12
to haskel...@haskell.org
Duncan Coutts <duncan.coutts <at> googlemail.com> writes:

> This could in principle be fixed with an arity raising transformation,

Do you have a reference to arity raising transformations?

Thanks, Dominic.

Thomas Schilling

unread,
Jun 27, 2012, 3:19:38 PM6/27/12
to Dominic Steinitz, haskel...@haskell.org
It's described in Andy Gill's PhD thesis (which describes the
foldr/build fusion).
http://ittc.ku.edu/~andygill/paper.php?label=GillPhD96 Section 4.4
describes the basic ideas. There aren't any further details, though.

Max's Strict Core paper also describes it a bit (Section 6):
http://www.cl.cam.ac.uk/~mb566/papers/tacc-hs09.pdf
--
Push the envelope. Watch it bend.

Bertram Felgenhauer

unread,
Jul 2, 2012, 7:36:00 AM7/2/12
to haskel...@haskell.org
Hi,

Johannes Waldmann wrote:
> s2 :: Int -> Int
> s2 n = sum $ do
> x <- [ 0 .. n-1 ]
> y <- [ 0 .. n-1 ]
> return $ gcd x y

This code shows some interesting behaviour: its runtime depends heavily
on the allocation area size.

For comparison, with main = print $ s1 10000 I see the following GC
statistics.

# ./a.out +RTS -A512k -s
15,201,891,976 bytes allocated in the heap
4,272,144 bytes copied during GC
Total time 9.97s ( 10.00s elapsed)
%GC time 1.1% (1.1% elapsed)

For s2, using main = print $ s2 10000, I get

# ./s2 +RTS -A512k -s
20,801,251,976 bytes allocated in the heap
3,438,870,504 bytes copied during GC
Total time 15.90s ( 15.95s elapsed)
%GC time 34.3% (34.4% elapsed)
# ./s2 +RTS -A1m -s
20,801,251,976 bytes allocated in the heap
2,724,903,032 bytes copied during GC
Total time 14.74s ( 14.80s elapsed)
%GC time 29.2% (29.3% elapsed)
# ./s2 +RTS -A2m -s
20,801,251,976 bytes allocated in the heap
20,311,952 bytes copied during GC
Total time 10.74s ( 10.78s elapsed)
%GC time 1.2% (1.2% elapsed)
# ./a.out +RTS -A2052k -s
20,801,251,976 bytes allocated in the heap
1,812,776 bytes copied during GC
Total time 10.35s ( 10.38s elapsed)
%GC time 0.8% (0.8% elapsed)

Note that the number of bytes copied during GC drops by three orders of
magnitude when we increase the allocation area size from 1 MB to 2 MB,
and another order of magnitude when adding an additional 4kB to that.

Why does this happen? The reason is a failure of generational GC
heuristics. If we look at the core, we find that the inner loop (which
generates a list that is consumed by sum) translates to something like

nums = [0..9999]

go xs0 = case xs0 of
[] -> []
x : xs -> let
go' ys0 = case ys0 of
[] -> []
y : ys -> GHC.Real.gcdInt x y : go' ys
in
case go' nums of
[] -> go xs
(z : zs) -> z : zs ++ go xs

At a first glance, this looks fine - there's one big chunk of fixed data
(the nums list) that will end up in the old generation. The rest of the
data is consumed as it is created. However, this is not quite true: The
thunk for the second argument to (++) (representing 'go xs') is already
allocated on the heap when the first element of the result of (++), i.e.,
the first element of zs, is consumed. While zs is processed, this thunk
is never touched. If it survives a whole GC-mutator cycle, then the next
garbage collection will consider the thunk mature and put it into the
old generation. But when the code starts evaluating this 'go xs' call,
it produces a long list, all of which is being kept alive by the (now
updated) thunk in generation 1, and as a result will be copied during
GC, until the next major GC.

So the observed behaviour hinges on the question whether the go xs
thunk can survive processing the zs list. The memory allocated during
this time is constant - besides the list, some memory is allocated
for thunks in go', and for intermediate Integers[*] in gcdInt. If the
allocation area size exceeds this constant, then the program will
run as expected. Note that every time a 'go xs' thunk survives, a lot
of extra work is caused for the GC -- this explains the sharp increase
in bytes copied observed above.

Bertram

[*] gcdInt really ought to only do a single allocation for the result,
with an inlining opportunity to get rid of that as well. It is defined
in GHC.Real as

gcdInt :: Int -> Int -> Int
gcdInt a b = fromIntegral (gcdInteger (fromIntegral a) (fromIntegral b))

which used to optimize to a single low-level GHC.Integer.Type.gcdInt call
in ghc 7.2. With 7.4 and HEAD, integerToInt and smallInteger are no longer
inlined, resulting in worse code. See also

http://hackage.haskell.org/trac/ghc/ticket/7041
Reply all
Reply to author
Forward
0 new messages