[Haskell-cafe] newbie optimization question

Prabhakar Ragde

unread,

Oct 28, 2007, 9:14:02 AM10/28/07

to haskel...@haskell.org

For the purposes of learning, I am trying to optimize some variation of
the following code for computing all perfect numbers less than 10000.

divisors i = [j | j<-[1..i-1], i `mod` j == 0]
main = print [i | i<-[1..10000], i == sum (divisors i)]

I know this is mathematically stupid, but the point is to do a moderate
nested-loops computation. On my 2.33GHz dual-core MacBookPro, the
obvious C program takes about .3 seconds, and a compiled OCaML program
(tail recursion, no lists) about .33 seconds. The above takes about 4
seconds.

I've tried using foldl', and doing explicit tail recursion with strict
accumulators, but I can't get the running time below 3 seconds. Is it
possible to come within striking distance of the other languages?
Thanks. --PR
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Jaak Randmets

unread,

Oct 28, 2007, 9:57:54 AM10/28/07

to Prabhakar Ragde, haskel...@haskell.org

On 10/28/07, Prabhakar Ragde <plr...@uwaterloo.ca> wrote:
> For the purposes of learning, I am trying to optimize some variation of
> the following code for computing all perfect numbers less than 10000.
>
> divisors i = [j | j<-[1..i-1], i `mod` j == 0]
> main = print [i | i<-[1..10000], i == sum (divisors i)]
>
> I know this is mathematically stupid, but the point is to do a moderate
> nested-loops computation. On my 2.33GHz dual-core MacBookPro, the
> obvious C program takes about .3 seconds, and a compiled OCaML program
> (tail recursion, no lists) about .33 seconds. The above takes about 4
> seconds.
>

You could try giving divisors type signature:
divisors :: Int -> [Int]

Prabhakar Ragde

unread,

Oct 28, 2007, 10:24:12 AM10/28/07

to Jaak Randmets, haskel...@haskell.org

Jaak Randmets wrote:
> On 10/28/07, Prabhakar Ragde <plr...@uwaterloo.ca> wrote:
>> For the purposes of learning, I am trying to optimize some variation of
>> the following code for computing all perfect numbers less than 10000.
>>
>> divisors i = [j | j<-[1..i-1], i `mod` j == 0]
>> main = print [i | i<-[1..10000], i == sum (divisors i)]
>>
>> I know this is mathematically stupid, but the point is to do a moderate
>> nested-loops computation. On my 2.33GHz dual-core MacBookPro, the
>> obvious C program takes about .3 seconds, and a compiled OCaML program
>> (tail recursion, no lists) about .33 seconds. The above takes about 4
>> seconds.
>>
>
> You could try giving divisors type signature:
> divisors :: Int -> [Int]

Thank you. That brings the time down to 0.5 seconds. I'm glad it was
something as simple as that. --PR

jerzy.kar...@info.unicaen.fr

unread,

Oct 28, 2007, 11:01:22 AM10/28/07

to haskel...@haskell.org

Prabhakar Ragde writes:

> For the purposes of learning, I am trying to optimize some variation of
> the following code for computing all perfect numbers less than 10000.
>
> divisors i = [j | j<-[1..i-1], i `mod` j == 0]
> main = print [i | i<-[1..10000], i == sum (divisors i)]
>
> I know this is mathematically stupid, but the point is to do a moderate
> nested-loops computation. On my 2.33GHz dual-core MacBookPro, the obvious
> C program takes about .3 seconds, and a compiled OCaML program (tail
> recursion, no lists) about .33 seconds. The above takes about 4 seconds.
>
> I've tried using foldl', and doing explicit tail recursion with strict
> accumulators, but I can't get the running time below 3 seconds. Is it
> possible to come within striking distance of the other languages? Thanks.

Just a trivial comment...

1. Don't speak about comparing *languages* when you compare *algorithms*,
and in particular data structures.
2. Please, DO code the above in C, using linked lists. Compare then.

3. Check the influence of bare printing, separated from the computation.

Jerzy Karczmarczuk

Prabhakar Ragde

unread,

Oct 28, 2007, 11:27:29 AM10/28/07

to haskel...@haskell.org

Jerzy Karczmarczuk wrote:

> Just a trivial comment...
>
> 1. Don't speak about comparing *languages* when you compare *algorithms*,
> and in particular data structures.
> 2. Please, DO code the above in C, using linked lists. Compare then.
>
> 3. Check the influence of bare printing, separated from the computation.

Isn't GHC clever enough to optimize away the entire computation if there
is no I/O?

I take your points, though I am trying to do something admittedly vague,
which is to compare fairly naive yet reasonably idiomatic programs that
might be written by a beginning undergraduate. This particular task is
already biased towards C, so it is reassuring that code that could be
written less than fifty pages into Hutton is competitive with code that
could be written -- hmmm, sixty pages into K&R, and a hundred pages into
the C text I plan to use. --PR

Stefan O'Rear

unread,

Oct 28, 2007, 12:39:36 PM10/28/07

to Prabhakar Ragde, haskel...@haskell.org

On Sun, Oct 28, 2007 at 11:26:46AM -0400, Prabhakar Ragde wrote:
> Jerzy Karczmarczuk wrote:
>
>> Just a trivial comment... 1. Don't speak about comparing *languages* when
>> you compare *algorithms*,
>> and in particular data structures.
>> 2. Please, DO code the above in C, using linked lists. Compare then. 3.
>> Check the influence of bare printing, separated from the computation.
>
> Isn't GHC clever enough to optimize away the entire computation if there is
> no I/O?

Yes, but GHC is not clever enough to solve the perfect number
classification problem. 'length' will suffice, and is prefered for most
enumeration benchmarks.

Note, however, that there are a grand total of 4 perfect numbers less
than 10,000. Not exactly an IO problem.

Stefan

signature.asc

Isaac Dupree

unread,

Oct 28, 2007, 12:59:29 PM10/28/07

to Prabhakar Ragde, haskel...@haskell.org

Prabhakar Ragde wrote:
>> You could try giving divisors type signature:
>> divisors :: Int -> [Int]
>
> Thank you. That brings the time down to 0.5 seconds. I'm glad it was
> something as simple as that. --PR

I assume GHC was smart enough to do inlining and such in this case, so
the difference is that it defaulted to Integer, whose operations are
somewhat slower.

Isaac

jerzy.kar...@info.unicaen.fr

unread,

Oct 28, 2007, 2:58:43 PM10/28/07

to haskel...@haskell.org

Stefan O'Rear adds to the dialogue:

> Prabhakar Ragde wrote:
>> Jerzy Karczmarczuk wrote:
>>
>>> Just a trivial comment... 1. Don't speak about comparing *languages* when
>>> you compare *algorithms*,
>>> and in particular data structures.
>>> 2. Please, DO code the above in C, using linked lists. Compare then. 3.
>>> Check the influence of bare printing, separated from the computation.
>>
>> Isn't GHC clever enough to optimize away the entire computation if there is
>> no I/O?
>
> Yes, but GHC is not clever enough to solve the perfect number
> classification problem. 'length' will suffice, and is prefered for most

> enumeratioon benchmarks.

My point didn't concern that point. Haskell compiler cannot change an
algorithm using lists into something which deals with indexable arrays,
usually faster. Indexing may be faster than the indirection, and the
allocation of memory costs. And there is laziness...
That's why I proposed to check what happens if one uses linked links in
"C". Well, the follow-ups seem to suggest that the main time eater was the
overloading. I must say that I am really astonished. It is hard to believe
that such a signature can make a factor of 8. Never seen that before.

Jerzy Karczmarczuk

Don Stewart

unread,

Oct 28, 2007, 3:01:46 PM10/28/07

to jerzy.kar...@info.unicaen.fr, haskel...@haskell.org

jerzy.karczmarczuk:

> Stefan O'Rear adds to the dialogue:
>
> >Prabhakar Ragde wrote:
> >>Jerzy Karczmarczuk wrote:
> >>
> >>>Just a trivial comment... 1. Don't speak about comparing *languages*
> >>>when you compare *algorithms*,
> >>> and in particular data structures.
> >>>2. Please, DO code the above in C, using linked lists. Compare then. 3.
> >>>Check the influence of bare printing, separated from the computation.
> >>
> >>Isn't GHC clever enough to optimize away the entire computation if there
> >>is no I/O?
> >
> >Yes, but GHC is not clever enough to solve the perfect number
> >classification problem. 'length' will suffice, and is prefered for most
> >enumeratioon benchmarks.
>
> My point didn't concern that point. Haskell compiler cannot change an
> algorithm using lists into something which deals with indexable arrays,
> usually faster. Indexing may be faster than the indirection, and the
> allocation of memory costs. And there is laziness...
> That's why I proposed to check what happens if one uses linked links in
> "C". Well, the follow-ups seem to suggest that the main time eater was the
> overloading. I must say that I am really astonished. It is hard to believe
> that such a signature can make a factor of 8. Never seen that before.

That fits with my experience writing low level numeric code -- Integer
can be a killer.

-- Don

Derek Elkins

unread,

Oct 28, 2007, 3:03:09 PM10/28/07

to Prabhakar Ragde, haskel...@haskell.org

On Sun, 2007-10-28 at 10:23 -0400, Prabhakar Ragde wrote:
> Jaak Randmets wrote:
> > On 10/28/07, Prabhakar Ragde <plr...@uwaterloo.ca> wrote:
> >> For the purposes of learning, I am trying to optimize some variation of
> >> the following code for computing all perfect numbers less than 10000.
> >>
> >> divisors i = [j | j<-[1..i-1], i `mod` j == 0]
> >> main = print [i | i<-[1..10000], i == sum (divisors i)]
> >>
> >> I know this is mathematically stupid, but the point is to do a moderate
> >> nested-loops computation. On my 2.33GHz dual-core MacBookPro, the
> >> obvious C program takes about .3 seconds, and a compiled OCaML program
> >> (tail recursion, no lists) about .33 seconds. The above takes about 4
> >> seconds.
> >>
> >
> > You could try giving divisors type signature:
> > divisors :: Int -> [Int]
>
> Thank you. That brings the time down to 0.5 seconds. I'm glad it was
> something as simple as that. --PR

I'm not sure if Jaak Randmets explained this, but the reason this makes
a difference is that Haskell defaults to Integer, an arbitrary precision
integer type, that is far more costly than an Int.

Derek Elkins

unread,

Oct 28, 2007, 3:10:42 PM10/28/07

to Don Stewart, jerzy.kar...@info.unicaen.fr, haskel...@haskell.org

Inline machine operations v. out-of-line calls to an arbitrary precision
integer C library: there shouldn't be any surprise here. I could make
it even slower by making a Nat type and giving it the type Nat -> [Nat].

Also, this is a well known issue. It is common for people to, for
example, write naive fib and then say that Haskell is useless because
it's orders of magnitude slower than naive fib in C or whatever. Then
you tell them to add an Int -> Int type signature and all of a sudden
Haskell is beating C soundly.

Daniel Fischer

unread,

Oct 28, 2007, 3:40:00 PM10/28/07

to Derek Elkins, Haskell-cafe

Am Sonntag, 28. Oktober 2007 20:09 schrieb Derek Elkins:
<snip>

> >
> > That fits with my experience writing low level numeric code -- Integer
> > can be a killer.
>
> Inline machine operations v. out-of-line calls to an arbitrary precision
> integer C library: there shouldn't be any surprise here.

Obviously. However, what if 32 bit integers aren't sufficient?
What perpetually puzzles me is that in C long long int has very good
performance, *much* faster than gmp, in Haskell, on my computer, Int64 is
hardly faster than Integer.
So I stick to Integer mostly, it may be slow but it's correct.

Cheers,
Daniel

Tillmann Rendel

unread,

Oct 28, 2007, 3:41:03 PM10/28/07

to jerzy.kar...@info.unicaen.fr, haskell-cafe

Hi,

Prabhakar Ragde wrote:
> divisors i = [j | j<-[1..i-1], i `mod` j == 0]
> main = print [i | i<-[1..10000], i == sum (divisors i)]

Jerzy Karczmarczuk wrote:
> My point didn't concern that point. Haskell compiler cannot change an
> algorithm using lists into something which deals with indexable arrays,
> usually faster. Indexing may be faster than the indirection, and the
> allocation of memory costs. And there is laziness...

This may be true, but it isn't relevant in this case, since the "obvious
c program" doesn't need any arrays, only two loops:

for (int i = 1; i <= 10000; i++) {
int sum = 0;
for (int j = 1; j < i; j++)
if (i % j == 0)
sum += i;
if (sum == i)
print(i);
}

Loops can be expressed with lazy lists in Haskell. Therefore, the
presented Haskell program is perfectly equivalent to the "obvious c
program".

Tillmann Rendel

Ryan Dickie

unread,

Oct 28, 2007, 4:00:56 PM10/28/07

to Prabhakar Ragde, haskel...@haskell.org

One thing I've noticed is that turning on optimizations significantly
increases the speed of haskell code. Are you comparing code between
languages with -O2 or without opts?

On 10/28/07, Prabhakar Ragde <plr...@uwaterloo.ca> wrote:
>

Don Stewart

unread,

Oct 28, 2007, 4:26:04 PM10/28/07

to Tillmann Rendel, jerzy.kar...@info.unicaen.fr, haskell-cafe, Duncan Coutts

rendel:

> Prabhakar Ragde wrote:
> >divisors i = [j | j<-[1..i-1], i `mod` j == 0]
> >main = print [i | i<-[1..10000], i == sum (divisors i)]
>
> Jerzy Karczmarczuk wrote:
> >My point didn't concern that point. Haskell compiler cannot change an
> >algorithm using lists into something which deals with indexable arrays,
> >usually faster. Indexing may be faster than the indirection, and the
> >allocation of memory costs. And there is laziness...
>
> This may be true, but it isn't relevant in this case, since the "obvious
> c program" doesn't need any arrays, only two loops:
>
> for (int i = 1; i <= 10000; i++) {
> int sum = 0;
> for (int j = 1; j < i; j++)
> if (i % j == 0)
> sum += i;
> if (sum == i)
> print(i);
> }
>
> Loops can be expressed with lazy lists in Haskell. Therefore, the
> presented Haskell program is perfectly equivalent to the "obvious c
> program".

So what we would hope is that GHC could transform a set of composed lazy list functions
into a doubly nested strict loop in Int#...

Let's see if we can get that result from GHC, using a bit of fusion.

First, to explain what is happening, let's first replace the `mod` call with
`rem`, which is faster, and then desugar the list comprehension and
enumerations syntax, to expose the underlying program:

default(Int)

divisors i = filter (\j -> i `rem`j == 0) (enumFromTo 1 (i-1))
main = print $ filter (\i -> i == sum (divisors i)) (enumFromTo 1 10000)

Looking at this we see some good chances for fusion to take place: the
enumFromTo should fuse twice with 'filter', using build/foldr fusion.

And with stream fusion, the left fold 'sum' should also fuse with pipeline that
results from divisors. So my prediction would be that this program would run
slightly faster with stream fusion. Let's see...

Compiling with -O2 and ghc 6.8 snapshot, with build/foldr fusion, we see
two fusion sites, as expected, and a spec-constr of 'sum:

$ ghc-6.9.20070916 A.hs -O2 -ddump-simpl-stats
RuleFired
1 SPEC Data.List.sum
2 fold/build

Good, running this:

$ time ./A-stream
[6,28,496,8128]
./A-stream 1.29s user 0.02s system 99% cpu 1.319 total

Now we can try with stream fusion, using the stream-fusible list library
here:
http://www.cse.unsw.edu.au/~dons/code/streams/list/

To use these list functions in preference to the default, we have to
import:

import Prelude hiding (filter,sum,enumFromTo)
import Data.List.Stream

and since the base library doesn't include stream fusible enumFromTo,
we'll have to write our own definition in terms of stream:

import Data.Stream (enumFromToInt,unstream)
enumFromTo i j = unstream (enumFromToInt i j)

Ok, that's easy. Compiling this, we hope to see 3 fusion sites, and all
heap-allocated Ints removed:

$ ghc-6.9.20070916 A.hs -O2 -ddump-simpl-stats -package list
RuleFired
2 filter -> fusible
1 sumInt -> fusible
1 sum spec Int
3 STREAM stream/unstream fusion

Terrific! The 'sum' was specialised to Int, then translated to a stream
version, the two filters also were translated, then 3 fusion sites were found
and fused. Our program should now run faster:

$ time ./A-stream
[6,28,496,8128]
./A-stream 1.23s user 0.01s system 99% cpu 1.251 total

And so it does, with no list allocated for the sum loop. In fact the entire program reduces
to a strict unboxed nested loop:

unfold = Int# -> [Int]
wloop_sum_sV5 :: Int# -> Int# -> Int#

So almost identical types to the C program (bar for the return [Int]).

Finally, we can manually translate the C code into a confusing set of nested
loops with interleaved IO,

main = loop 1
where
loop !i | i > 10000 = return ()
| otherwise = if i == go i 0 1 then print i >> loop (i+1)
else loop (i+1)

go !i !s !j | j <= i-1 = if i `rem` j == 0 then go i (s+j) (j+1)
else go i s (j+1)
| otherwise = s

And we get *no speed benefit* at all!

time ./A-loop 1.24s user 0.00s system 98% cpu 1.256 total

So the lesson is: write in a high level style, and the compiler can do the work
for you. Or, GHC is pretty smart on high level code.

-- Don

Stefan O'Rear

unread,

Oct 28, 2007, 4:28:18 PM10/28/07

to Daniel Fischer, Haskell-cafe

On Sun, Oct 28, 2007 at 08:40:28PM +0100, Daniel Fischer wrote:
> Am Sonntag, 28. Oktober 2007 20:09 schrieb Derek Elkins:
> <snip>
> > >
> > > That fits with my experience writing low level numeric code -- Integer
> > > can be a killer.
> >
> > Inline machine operations v. out-of-line calls to an arbitrary precision
> > integer C library: there shouldn't be any surprise here.
>
> Obviously. However, what if 32 bit integers aren't sufficient?
> What perpetually puzzles me is that in C long long int has very good
> performance, *much* faster than gmp, in Haskell, on my computer, Int64 is
> hardly faster than Integer.
> So I stick to Integer mostly, it may be slow but it's correct.

Int64 in Glasgow Haskell is not implemented for speed - it uses the FFI
to call a number of addition/subtraction/whatever primitives in the
runtime. If this is a problem for you, file a feature request on the
GHC bugtracker.

Stefan

signature.asc

Stefan O'Rear

unread,

Oct 28, 2007, 4:31:15 PM10/28/07

to Don Stewart, Duncan Coutts, jerzy.kar...@info.unicaen.fr, haskell-cafe

On Sun, Oct 28, 2007 at 01:25:19PM -0700, Don Stewart wrote:
> Finally, we can manually translate the C code into a confusing set of nested
> loops with interleaved IO,
>
> main = loop 1
> where
> loop !i | i > 10000 = return ()
> | otherwise = if i == go i 0 1 then print i >> loop (i+1)
> else loop (i+1)
>
> go !i !s !j | j <= i-1 = if i `rem` j == 0 then go i (s+j) (j+1)
> else go i s (j+1)
> | otherwise = s
>
> And we get *no speed benefit* at all!
>
> time ./A-loop 1.24s user 0.00s system 98% cpu 1.256 total
>
> So the lesson is: write in a high level style, and the compiler can do the work
> for you. Or, GHC is pretty smart on high level code.

IO blocks unboxing in GHC. How fast is your mock-C code refactored to
do IO outside of the loops only?

Stefan

signature.asc

Don Stewart

unread,

Oct 28, 2007, 4:43:51 PM10/28/07

to Stefan O'Rear, haskell-cafe

stefanor:

It doesn't! The above code yields:

Main.$wloop :: GHC.Prim.Int#
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)

$wgo_rMK :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
where
$s$wgo_rMI :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#

Stefan O'Rear

unread,

Oct 28, 2007, 4:54:05 PM10/28/07

to Don Stewart, haskell-cafe

On Sun, Oct 28, 2007 at 01:43:07PM -0700, Don Stewart wrote:
> stefanor:

> > IO blocks unboxing in GHC. How fast is your mock-C code refactored to
> > do IO outside of the loops only?
>
> It doesn't! The above code yields:
>
> Main.$wloop :: GHC.Prim.Int#
> -> GHC.Prim.State# GHC.Prim.RealWorld
> -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
>
> $wgo_rMK :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
> where
> $s$wgo_rMI :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#

Ah, interesting. I was reading something too general from
http://hackage.haskell.org/trac/ghc/ticket/1592#comment:5.

Stefan

signature.asc

Tim Chevalier

unread,

Oct 28, 2007, 5:03:00 PM10/28/07

to Stefan O'Rear, haskell-cafe

Right, unboxing is successful here because the arguments are marked as
strict with (!) annotations. The IO hack described in that bug report
would only be a problem if the arguments were not marked strict, and
were used in code that has to be executed after an IO call.

Although I don't think all of the strictness annotations are even
necessary here. In any case, loop must evaluate i before the call to
print; there would only be an issue if it called an IO function before
doing anything that demands i.

Cheers,
Tim

--
Tim Chevalier * catamorphism.org * Often in error, never in doubt
"Accordingly, computer scientists commonly choose models which have
bottoms, but prefer them topless." -- Davey & Priestley, _Introduction
to Lattices and Order_

Don Stewart

unread,

Oct 28, 2007, 5:07:48 PM10/28/07

to Stefan O'Rear, haskell-cafe

dons:

> stefanor:
> > On Sun, Oct 28, 2007 at 01:25:19PM -0700, Don Stewart wrote:
> > > Finally, we can manually translate the C code into a confusing set of nested
> > > loops with interleaved IO,
> > >
> > > main = loop 1
> > > where
> > > loop !i | i > 10000 = return ()
> > > | otherwise = if i == go i 0 1 then print i >> loop (i+1)
> > > else loop (i+1)
> > >
> > > go !i !s !j | j <= i-1 = if i `rem` j == 0 then go i (s+j) (j+1)
> > > else go i s (j+1)
> > > | otherwise = s
> > >
> > > And we get *no speed benefit* at all!
> > >
> > > time ./A-loop 1.24s user 0.00s system 98% cpu 1.256 total
> > >
> > > So the lesson is: write in a high level style, and the compiler can do the work
> > > for you. Or, GHC is pretty smart on high level code.
> >

Oh, and we can fuse with the print loop too, yielding an entire program
of type Int# -> IO (), and really no intermediate lists (even for the return
list).

Again, we need to use the fusible version of mapM_:

import Prelude hiding (filter,sum,enumFromTo,mapM_,sequence_,map,foldr)
import Data.List.Stream
import Data.Stream (enumFromToInt,unstream)

enumFromTo i j = unstream (enumFromToInt i j)

mapM_ f as = sequence_ (map f as)
sequence_ ms = foldr (>>) (return ()) ms
-- ^ fuse happily

default(Int)

main = mapM_ print $ filter (\i -> i == sum (divisors i)) (enumFromTo 1 10000)

divisors i = filter (\j -> i `rem`j == 0) (enumFromTo 1 (i-1))

And we see the map and foldr fuse in sequence, which in turn fuses with the
filter (and rest of the program):

18 RuleFired
5 STREAM stream/unstream fusion
2 filter -> fusible
1 foldr -> fusible
1 map -> fusible
1 sumInt -> fusible

The program flattens to a single nested loop,

Main.$wloop_foldr :: GHC.Prim.Int#

-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)

which really is equivalent in terms of control flow and intermediate structures
to the C program.

-- Don

Peter Hercek

unread,

Oct 28, 2007, 5:39:01 PM10/28/07

to haskel...@haskell.org

Daniel Fischer wrote:
> What perpetually puzzles me is that in C long long int has very good
> performance, *much* faster than gmp, in Haskell, on my computer, Int64 is
> hardly faster than Integer.

I tried the example with Int64 and Integer. The integer version
was actually quicker ... which is the reason I decided to post
the results.

C++ version times: 1.125; 1.109; 1.125
Int32 cpu times: 3.203; 3.172; 3.172
Int64 cpu times: 11.734; 11.797; 11.844
Integer cpu times: 9.609; 9.609; 9.500

Interesting that Int64 is *slower* than Integer.

On the other side the C version is not that much quicker. I guess
the Haskell version is using generic versions of mod and sum
(since they are from a library) which would mean indirect calls.
The Haskell version probably also creates the list nodes ...
even when they get almost immediately garbage collected.

Thanks for pointing out Int64 sucks so much :)

Peter.

Don Stewart

unread,

Oct 28, 2007, 6:03:18 PM10/28/07

to Peter Hercek, haskel...@haskell.org

peter:

> Daniel Fischer wrote:
> >What perpetually puzzles me is that in C long long int has very good
> >performance, *much* faster than gmp, in Haskell, on my computer, Int64 is
> >hardly faster than Integer.
>
> I tried the example with Int64 and Integer. The integer version
> was actually quicker ... which is the reason I decided to post
> the results.
>
> C++ version times: 1.125; 1.109; 1.125
> Int32 cpu times: 3.203; 3.172; 3.172
> Int64 cpu times: 11.734; 11.797; 11.844
> Integer cpu times: 9.609; 9.609; 9.500
>
> Interesting that Int64 is *slower* than Integer.
>
> On the other side the C version is not that much quicker. I guess
> the Haskell version is using generic versions of mod and sum
> (since they are from a library) which would mean indirect calls.
> The Haskell version probably also creates the list nodes ...
> even when they get almost immediately garbage collected.
>
> Thanks for pointing out Int64 sucks so much :)

With -O2 ghc will only have a function call to 'sum', with -O2 and the
list stream fusion library, there are no indirect calls and the entire
program is specialised to a nested loop.

Do you have your C++ program handy? I'd expect the fully fused version
to be somewhere between 1 and 2x slower.

-- Don

Peter Hercek

unread,

Oct 28, 2007, 6:16:25 PM10/28/07

to haskel...@haskell.org

Peter Hercek wrote:
> C++ version times: 1.125; 1.109; 1.125
> Int32 cpu times: 3.203; 3.172; 3.172
> Int64 cpu times: 11.734; 11.797; 11.844
> Integer cpu times: 9.609; 9.609; 9.500

Ooops, my results ware wrong (nonoptimizing ms cl
compiler used and I used -O instead of -O2 in ghc).

C++ version times: 1.109; 1.125; 1.125
Int32 cpu times: 1.359; 1.359; 1.375
Int64 cpu times: 11.688; 11.719; 11.766
Integer cpu times: 9.719; 9.703; 9.703

Great result from ghc.

Don Stewart

unread,

Oct 28, 2007, 6:22:54 PM10/28/07

to Peter Hercek, haskel...@haskell.org

peter:

> Peter Hercek wrote:
> >C++ version times: 1.125; 1.109; 1.125
> >Int32 cpu times: 3.203; 3.172; 3.172
> >Int64 cpu times: 11.734; 11.797; 11.844
> >Integer cpu times: 9.609; 9.609; 9.500
>
> Ooops, my results ware wrong (nonoptimizing ms cl
> compiler used and I used -O instead of -O2 in ghc).
>
> C++ version times: 1.109; 1.125; 1.125
> Int32 cpu times: 1.359; 1.359; 1.375
> Int64 cpu times: 11.688; 11.719; 11.766
> Integer cpu times: 9.719; 9.703; 9.703
>
> Great result from ghc.

What Haskell program were you using for this test? The original
naive/high level implementation?

-- Don

Peter Hercek

unread,

Oct 28, 2007, 6:35:08 PM10/28/07

to haskel...@haskell.org

Don Stewart wrote:
>> C++ version times: 1.109; 1.125; 1.125
>> Int32 cpu times: 1.359; 1.359; 1.375
>> Int64 cpu times: 11.688; 11.719; 11.766
>> Integer cpu times: 9.719; 9.703; 9.703
>>
>> Great result from ghc.
>
> What Haskell program were you using for this test? The original
> naive/high level implementation?
>
> -- Don

Here it is (I played only with the types for divisors and perfect;
bottom is there from my previous tests, but it should not matter):
<---cut--->

module Main (bottom, divisors, perfect, main) where
import Data.Int

bottom = error "_|_"

divisors :: Int -> [Int]

divisors i = [j | j<-[1..i-1], i `mod` j == 0]

perfect :: [Int]
perfect = [i | i<-[1..10000], i == sum (divisors i)]

main = print perfect

<---cut--->
and here is the C++ version:
<---cut--->

#include <iostream>
using namespace std;

int main() {

for (int i = 1; i <= 10000; i++) {
int sum = 0;
for (int j = 1; j < i; j++)
if (i % j == 0)

sum += j;
if (sum == i)
cout << i << " ";
}
return 0;
}

<---cut--->

OS winxp 64 bit
ghc v. 6.6.1 (optins -O2)
MS cl.exe version 13.10.3077 (options /G7 /MD)

Peter.

Peter Hercek

unread,

Oct 28, 2007, 7:03:37 PM10/28/07

to haskel...@haskell.org

Peter Hercek wrote:
> MS cl.exe version 13.10.3077 (options /G7 /MD)

And I had cl options wrong too - I need to start to
optimize not only to set the optimization target.
/G7 /MD -> 1.109; 1.125; 1.125
/Ox /G7 /MD -> 0.922; 0.984; 0.984

Still it does not change the results too much.

Derek Elkins

unread,

Oct 28, 2007, 7:48:18 PM10/28/07

to Peter Hercek, haskel...@haskell.org

On Sun, 2007-10-28 at 23:34 +0100, Peter Hercek wrote:
> Don Stewart wrote:
> >> C++ version times: 1.109; 1.125; 1.125
> >> Int32 cpu times: 1.359; 1.359; 1.375
> >> Int64 cpu times: 11.688; 11.719; 11.766
> >> Integer cpu times: 9.719; 9.703; 9.703
> >>
> >> Great result from ghc.
> >
> > What Haskell program were you using for this test? The original
> > naive/high level implementation?
> >
> > -- Don
>
> Here it is (I played only with the types for divisors and perfect;
> bottom is there from my previous tests, but it should not matter):
> <---cut--->
>
> module Main (bottom, divisors, perfect, main) where
> import Data.Int
>
> bottom = error "_|_"
>
> divisors :: Int -> [Int]
> divisors i = [j | j<-[1..i-1], i `mod` j == 0]
>
> perfect :: [Int]
> perfect = [i | i<-[1..10000], i == sum (divisors i)]
>
> main = print perfect

Try with rem instead of mod.

(What the heck is with bottom?)

a...@spamcop.net

unread,

Oct 28, 2007, 9:27:02 PM10/28/07

to haskel...@haskell.org

G'day all.

Quoting Don Stewart <do...@galois.com>:

> That fits with my experience writing low level numeric code -- Integer
> can be a killer.

Mind you, if you're intending to work with large integers or rationals,
Integer is great! The bottleneck is almost always "show".

Cheers,
Andrew Bromage

Don Stewart

unread,

Oct 28, 2007, 11:05:24 PM10/28/07

to Peter Hercek, haskel...@haskell.org

peter:

> Don Stewart wrote:
> >>C++ version times: 1.109; 1.125; 1.125
> >>Int32 cpu times: 1.359; 1.359; 1.375
> >>Int64 cpu times: 11.688; 11.719; 11.766
> >>Integer cpu times: 9.719; 9.703; 9.703
> >>
> >>Great result from ghc.
> >
> >What Haskell program were you using for this test? The original
> >naive/high level implementation?
> >
> >-- Don
>
> Here it is (I played only with the types for divisors and perfect;
> bottom is there from my previous tests, but it should not matter):
> <---cut--->
>
> module Main (bottom, divisors, perfect, main) where
> import Data.Int
>
> bottom = error "_|_"
>
> divisors :: Int -> [Int]
> divisors i = [j | j<-[1..i-1], i `mod` j == 0]
>
> perfect :: [Int]
> perfect = [i | i<-[1..10000], i == sum (divisors i)]
>

This should be a little faster , as sum will fuse,

perfect :: [Int]
perfect = [i | i<-[1..10000], i == sum' (divisors i)]

where sum' = foldr (+) 0

Peter Hercek

unread,

Oct 29, 2007, 3:49:44 AM10/29/07

to haskel...@haskell.org

Don Stewart wrote:
>> perfect :: [Int]
>> perfect = [i | i<-[1..10000], i == sum (divisors i)]
>>
>
> This should be a little faster , as sum will fuse,
>
> perfect :: [Int]
> perfect = [i | i<-[1..10000], i == sum' (divisors i)]
> where sum' = foldr (+) 0

sum' did not help. Times are about the same with Int type.

Peter.

Peter Hercek

unread,

Oct 29, 2007, 4:20:32 AM10/29/07

to haskel...@haskell.org

Derek Elkins wrote:
>
> Try with rem instead of mod.
>
> (What the heck is with bottom?)

The bottom was there by error and I was lazy to redo
the tests so I rather posted exactly what I was
doing. I do not know the compiler that good to be
absolutely sure it cannot have impact on result
... so I rather did not doctor what I did :-)

Ok, rem did help quite a bit. Any comments why it is so?

Here are summary of results for those interested. I run
all the tests once again. Haskell was about 13% slower
than C++.

MS cl.exe options: /Ox /G7 /MD
ghc options: -O2

C++ version: 1.000; 0.984; 0.984
Haskell version specialized to Int: 1.125; 1.125; 1.109
Haskell version specialized to Integer: 8.781; 8.813; 8.813
Haskell version specialized to Int64: 9.781; 9.766; 9.831

The code:

% cat b.hs
module Main (divisors, perfect, main) where
import Data.Int

divisors :: Int -> [Int]
divisors i = [j | j<-[1..i-1], i `rem` j == 0]

perfect :: [Int]
perfect = [i | i<-[1..10000], i == sum (divisors i)]

main = print perfect

% cat b.cpp

#include <iostream>
using namespace std;

int main() {
for (int i = 1; i <= 10000; i++) {
int sum = 0;
for (int j = 1; j < i; j++)
if (i % j == 0)
sum += j;
if (sum == i)
cout << i << " ";
}
return 0;
}

%

Rodrigo Queiro

unread,

Oct 29, 2007, 6:01:32 AM10/29/07

to Peter Hercek, haskel...@haskell.org

rem is faster because it has slightly different behaviour to mod, and
there happens to be an intel instruction that maps more directly to
rem than to mod, thus making it much faster on intel processors.

Why do you expose perfect and divisors? Maybe if you just expose main,
perfect and divisors will be inlined (although this will only save
10,000 function entries, so will probably have negligible effect).

Rodrigo

Peter Hercek

unread,

Oct 29, 2007, 6:45:15 AM10/29/07

to haskel...@haskell.org

Rodrigo Queiro wrote:
> Why do you expose perfect and divisors? Maybe if you just expose main,
> perfect and divisors will be inlined (although this will only save
> 10,000 function entries, so will probably have negligible effect).

I exposed them so that I can check types in ghci.
Hiding them does not seem to have noticeable effect.

Thanks,
Peter.

Prabhakar Ragde

unread,

Oct 29, 2007, 8:17:39 AM10/29/07

to Ryan Dickie, haskel...@haskell.org

Ryan Dickie wrote:
> One thing I've noticed is that turning on optimizations significantly
> increases the speed of haskell code. Are you comparing code between
> languages with -O2 or without opts?

I had done no optimization, but neither -O nor -O2 make a significant
difference in either the C or Haskell programs. But using `rem` instead
of `mod`, together with the type annotation, makes the two programs take
pretty much the same amount of time. --PR

Dusan Kolar

unread,

Oct 29, 2007, 8:50:23 AM10/29/07

to Peter Hercek, haskel...@haskell.org

Hello all,

just to compare the stuff, I get quite other results being on other
OS. Thus, the result of C++ compiler may not be that interesting as I do
not have the one presented below.

My machine:
Linux 2.6.23-ARCH #1 SMP PREEMPT Mon Oct 22 12:50:26 CEST 2007 x86_64
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux

Compilers:
g++ --version
g++ (GCC) 4.2.2
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.6.1

Measurement:
compiled with ghc -O2
time ./mainInteger
real 0m4.866s
user 0m4.843s
sys 0m0.020s

compiled with ghc -O2
time ./mainInt64
real 0m2.213s
user 0m2.210s
sys 0m0.003s

compiled with ghc -O2
time ./mainInt
real 0m1.149s
user 0m1.143s
sys 0m0.003s

compiled with g++ -O3
time ./mainC
real 0m0.271s
user 0m0.270s
sys 0m0.000s

I don't know what is the reason, but the difference between Int, Int64
and Integer is not that dramatic as in example below, nevertheless, the
difference between GHC and GNU C++ is very bad.... :-\

Dusan

--

Dusan Kolar tel: +420 54 114 1238
UIFS FIT VUT Brno fax: +420 54 114 1270
Bozetechova 2 e-mail: ko...@fit.vutbr.cz
Brno 612 66
Czech Republic

--

Simon Marlow

unread,

Oct 29, 2007, 9:08:55 AM10/29/07

to Peter Hercek, haskel...@haskell.org

Peter Hercek wrote:
> Daniel Fischer wrote:
>> What perpetually puzzles me is that in C long long int has very good
>> performance, *much* faster than gmp, in Haskell, on my computer, Int64
>> is hardly faster than Integer.
>
> I tried the example with Int64 and Integer. The integer version
> was actually quicker ... which is the reason I decided to post
> the results.
>
> C++ version times: 1.125; 1.109; 1.125
> Int32 cpu times: 3.203; 3.172; 3.172
> Int64 cpu times: 11.734; 11.797; 11.844
> Integer cpu times: 9.609; 9.609; 9.500
>
> Interesting that Int64 is *slower* than Integer.

I can believe that. Integer is actually optimised for small values:
there's a specialised representation for values that fit in a single word
that avoids calling out to the GMP library.

As Stefan pointed out, there's a lot of room to improve the performance of
Int64, it's just never been a high priority.

Cheers,
Simon

Daniel Fischer

unread,

Oct 29, 2007, 9:27:00 AM10/29/07

to Dusan Kolar, Haskell-cafe

Am Montag, 29. Oktober 2007 13:49 schrieb Dusan Kolar:
> Hello all,
>
> just to compare the stuff, I get quite other results being on other
> OS. Thus, the result of C++ compiler may not be that interesting as I do
> not have the one presented below.

Just to chime in, my results with the code below:
dafis@linux:~> uname -a
Linux linux 2.4.20-4GB-athlon #1 Mon Mar 17 17:56:47 UTC 2003 i686 unknown
unknown GNU/Linux

on a 1200 MHz Duron

g++ is version 3.3, C++ code compiled with -O3, Haskell with -O2 (GHC 6.6.1)
dafis@linux:~> time ./mainC
6 28 496 8128
real 0m1.945s
user 0m1.910s
sys 0m0.010s
dafis@linux:~> time ./mainInt
[6,28,496,8128]

real 0m2.407s
user 0m2.300s
sys 0m0.010s
dafis@linux:~> time ./mainInt64
[6,28,496,8128]

real 0m24.009s
user 0m23.900s
sys 0m0.050s
dafis@linux:~> time ./mainInteger
[6,28,496,8128]

real 0m21.555s
user 0m20.870s
sys 0m0.010s

So Int is not so much slower than C, Int64 and Integer dramatically slower
with Integer beating Int64 here, too.

Cheers,
Daniel

> > The code:

Peter Hercek

unread,

Oct 29, 2007, 9:33:57 AM10/29/07

to haskel...@haskell.org

OK, if somebody wants to speculate (and if it even makes sense for
such a microbenchmark) here are some more data.
Except different OS and C++ compiler also processor is different.
On my side it was AMD Athlon 64 X2 4800+ (2.4GHz, 2x1MiB L2 cache
non-shared; C&Q was not switched off during the test). The system has
2GiB RAM. The C++ version had working set about 1.7 MiB, ghc version
had it about 2 times bigger.

Peter.

Isaac Dupree

unread,

Oct 29, 2007, 9:54:45 AM10/29/07

to Don Stewart, Duncan Coutts, jerzy.kar...@info.unicaen.fr, haskell-cafe

Don Stewart wrote:
> default(Int)
>
> divisors i = filter (\j -> i `rem`j == 0) (enumFromTo 1 (i-1))
> main = print $ filter (\i -> i == sum (divisors i)) (enumFromTo 1 10000)

..

> So almost identical types to the C program (bar for the return [Int]).
>
> Finally, we can manually translate the C code into a confusing set of nested
> loops with interleaved IO,

how lazy is `print` supposed to be? If it's strict, we need to return /
accumulate a list (and in this case, that is not very time-consuming
because there are only four numbers in the list)

from http://haskell.org/onlinereport/standard-prelude.html

print x = putStrLn (show x)

show on lists is lazy and doesn't matter that it's not for Ints

putStrLn :: String -> IO ()
putStrLn s = do putStr s
putStr "\n"

putStr :: String -> IO ()
putStr s = mapM_ putChar s

okay, mapM_ makes it lazy.

putChar :: Char -> IO ()
putChar = primPutChar

(It's easy to get confused with OS buffering effects too...)

so it should ALL be able to fuse, with no intermediate lists, I think
(not sure about [Char] from show... or whether fusion works with IO
sequencing...)

Isaac