divisors i = [j | j<-[1..i-1], i `mod` j == 0]
main = print [i | i<-[1..10000], i == sum (divisors i)]
I know this is mathematically stupid, but the point is to do a moderate
nested-loops computation. On my 2.33GHz dual-core MacBookPro, the
obvious C program takes about .3 seconds, and a compiled OCaML program
(tail recursion, no lists) about .33 seconds. The above takes about 4
seconds.
I've tried using foldl', and doing explicit tail recursion with strict
accumulators, but I can't get the running time below 3 seconds. Is it
possible to come within striking distance of the other languages?
Thanks. --PR
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
You could try giving divisors type signature:
divisors :: Int -> [Int]
Thank you. That brings the time down to 0.5 seconds. I'm glad it was
something as simple as that. --PR
> For the purposes of learning, I am trying to optimize some variation of
> the following code for computing all perfect numbers less than 10000.
>
> divisors i = [j | j<-[1..i-1], i `mod` j == 0]
> main = print [i | i<-[1..10000], i == sum (divisors i)]
>
> I know this is mathematically stupid, but the point is to do a moderate
> nested-loops computation. On my 2.33GHz dual-core MacBookPro, the obvious
> C program takes about .3 seconds, and a compiled OCaML program (tail
> recursion, no lists) about .33 seconds. The above takes about 4 seconds.
>
> I've tried using foldl', and doing explicit tail recursion with strict
> accumulators, but I can't get the running time below 3 seconds. Is it
> possible to come within striking distance of the other languages? Thanks.
Just a trivial comment...
1. Don't speak about comparing *languages* when you compare *algorithms*,
and in particular data structures.
2. Please, DO code the above in C, using linked lists. Compare then.
3. Check the influence of bare printing, separated from the computation.
Jerzy Karczmarczuk
> Just a trivial comment...
>
> 1. Don't speak about comparing *languages* when you compare *algorithms*,
> and in particular data structures.
> 2. Please, DO code the above in C, using linked lists. Compare then.
>
> 3. Check the influence of bare printing, separated from the computation.
Isn't GHC clever enough to optimize away the entire computation if there
is no I/O?
I take your points, though I am trying to do something admittedly vague,
which is to compare fairly naive yet reasonably idiomatic programs that
might be written by a beginning undergraduate. This particular task is
already biased towards C, so it is reassuring that code that could be
written less than fifty pages into Hutton is competitive with code that
could be written -- hmmm, sixty pages into K&R, and a hundred pages into
the C text I plan to use. --PR
Yes, but GHC is not clever enough to solve the perfect number
classification problem. 'length' will suffice, and is prefered for most
enumeration benchmarks.
Note, however, that there are a grand total of 4 perfect numbers less
than 10,000. Not exactly an IO problem.
Stefan
I assume GHC was smart enough to do inlining and such in this case, so
the difference is that it defaulted to Integer, whose operations are
somewhat slower.
Isaac
> Prabhakar Ragde wrote:
>> Jerzy Karczmarczuk wrote:
>>
>>> Just a trivial comment... 1. Don't speak about comparing *languages* when
>>> you compare *algorithms*,
>>> and in particular data structures.
>>> 2. Please, DO code the above in C, using linked lists. Compare then. 3.
>>> Check the influence of bare printing, separated from the computation.
>>
>> Isn't GHC clever enough to optimize away the entire computation if there is
>> no I/O?
>
> Yes, but GHC is not clever enough to solve the perfect number
> classification problem. 'length' will suffice, and is prefered for most
> enumeratioon benchmarks.
My point didn't concern that point. Haskell compiler cannot change an
algorithm using lists into something which deals with indexable arrays,
usually faster. Indexing may be faster than the indirection, and the
allocation of memory costs. And there is laziness...
That's why I proposed to check what happens if one uses linked links in
"C". Well, the follow-ups seem to suggest that the main time eater was the
overloading. I must say that I am really astonished. It is hard to believe
that such a signature can make a factor of 8. Never seen that before.
Jerzy Karczmarczuk
That fits with my experience writing low level numeric code -- Integer
can be a killer.
-- Don
I'm not sure if Jaak Randmets explained this, but the reason this makes
a difference is that Haskell defaults to Integer, an arbitrary precision
integer type, that is far more costly than an Int.
Inline machine operations v. out-of-line calls to an arbitrary precision
integer C library: there shouldn't be any surprise here. I could make
it even slower by making a Nat type and giving it the type Nat -> [Nat].
Also, this is a well known issue. It is common for people to, for
example, write naive fib and then say that Haskell is useless because
it's orders of magnitude slower than naive fib in C or whatever. Then
you tell them to add an Int -> Int type signature and all of a sudden
Haskell is beating C soundly.
Obviously. However, what if 32 bit integers aren't sufficient?
What perpetually puzzles me is that in C long long int has very good
performance, *much* faster than gmp, in Haskell, on my computer, Int64 is
hardly faster than Integer.
So I stick to Integer mostly, it may be slow but it's correct.
Cheers,
Daniel
Prabhakar Ragde wrote:
> divisors i = [j | j<-[1..i-1], i `mod` j == 0]
> main = print [i | i<-[1..10000], i == sum (divisors i)]
Jerzy Karczmarczuk wrote:
> My point didn't concern that point. Haskell compiler cannot change an
> algorithm using lists into something which deals with indexable arrays,
> usually faster. Indexing may be faster than the indirection, and the
> allocation of memory costs. And there is laziness...
This may be true, but it isn't relevant in this case, since the "obvious
c program" doesn't need any arrays, only two loops:
for (int i = 1; i <= 10000; i++) {
int sum = 0;
for (int j = 1; j < i; j++)
if (i % j == 0)
sum += i;
if (sum == i)
print(i);
}
Loops can be expressed with lazy lists in Haskell. Therefore, the
presented Haskell program is perfectly equivalent to the "obvious c
program".
Tillmann Rendel
On 10/28/07, Prabhakar Ragde <plr...@uwaterloo.ca> wrote:
>
So what we would hope is that GHC could transform a set of composed lazy list functions
into a doubly nested strict loop in Int#...
Let's see if we can get that result from GHC, using a bit of fusion.
First, to explain what is happening, let's first replace the `mod` call with
`rem`, which is faster, and then desugar the list comprehension and
enumerations syntax, to expose the underlying program:
default(Int)
divisors i = filter (\j -> i `rem`j == 0) (enumFromTo 1 (i-1))
main = print $ filter (\i -> i == sum (divisors i)) (enumFromTo 1 10000)
Looking at this we see some good chances for fusion to take place: the
enumFromTo should fuse twice with 'filter', using build/foldr fusion.
And with stream fusion, the left fold 'sum' should also fuse with pipeline that
results from divisors. So my prediction would be that this program would run
slightly faster with stream fusion. Let's see...
Compiling with -O2 and ghc 6.8 snapshot, with build/foldr fusion, we see
two fusion sites, as expected, and a spec-constr of 'sum:
$ ghc-6.9.20070916 A.hs -O2 -ddump-simpl-stats
RuleFired
1 SPEC Data.List.sum
2 fold/build
Good, running this:
$ time ./A-stream
[6,28,496,8128]
./A-stream 1.29s user 0.02s system 99% cpu 1.319 total
Now we can try with stream fusion, using the stream-fusible list library
here:
http://www.cse.unsw.edu.au/~dons/code/streams/list/
To use these list functions in preference to the default, we have to
import:
import Prelude hiding (filter,sum,enumFromTo)
import Data.List.Stream
and since the base library doesn't include stream fusible enumFromTo,
we'll have to write our own definition in terms of stream:
import Data.Stream (enumFromToInt,unstream)
enumFromTo i j = unstream (enumFromToInt i j)
Ok, that's easy. Compiling this, we hope to see 3 fusion sites, and all
heap-allocated Ints removed:
$ ghc-6.9.20070916 A.hs -O2 -ddump-simpl-stats -package list
RuleFired
2 filter -> fusible
1 sumInt -> fusible
1 sum spec Int
3 STREAM stream/unstream fusion
Terrific! The 'sum' was specialised to Int, then translated to a stream
version, the two filters also were translated, then 3 fusion sites were found
and fused. Our program should now run faster:
$ time ./A-stream
[6,28,496,8128]
./A-stream 1.23s user 0.01s system 99% cpu 1.251 total
And so it does, with no list allocated for the sum loop. In fact the entire program reduces
to a strict unboxed nested loop:
unfold = Int# -> [Int]
wloop_sum_sV5 :: Int# -> Int# -> Int#
So almost identical types to the C program (bar for the return [Int]).
Finally, we can manually translate the C code into a confusing set of nested
loops with interleaved IO,
main = loop 1
where
loop !i | i > 10000 = return ()
| otherwise = if i == go i 0 1 then print i >> loop (i+1)
else loop (i+1)
go !i !s !j | j <= i-1 = if i `rem` j == 0 then go i (s+j) (j+1)
else go i s (j+1)
| otherwise = s
And we get *no speed benefit* at all!
time ./A-loop 1.24s user 0.00s system 98% cpu 1.256 total
So the lesson is: write in a high level style, and the compiler can do the work
for you. Or, GHC is pretty smart on high level code.
-- Don
Int64 in Glasgow Haskell is not implemented for speed - it uses the FFI
to call a number of addition/subtraction/whatever primitives in the
runtime. If this is a problem for you, file a feature request on the
GHC bugtracker.
Stefan
IO blocks unboxing in GHC. How fast is your mock-C code refactored to
do IO outside of the loops only?
Stefan
It doesn't! The above code yields:
Main.$wloop :: GHC.Prim.Int#
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
$wgo_rMK :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
where
$s$wgo_rMI :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
Ah, interesting. I was reading something too general from
http://hackage.haskell.org/trac/ghc/ticket/1592#comment:5.
Stefan
Right, unboxing is successful here because the arguments are marked as
strict with (!) annotations. The IO hack described in that bug report
would only be a problem if the arguments were not marked strict, and
were used in code that has to be executed after an IO call.
Although I don't think all of the strictness annotations are even
necessary here. In any case, loop must evaluate i before the call to
print; there would only be an issue if it called an IO function before
doing anything that demands i.
Cheers,
Tim
--
Tim Chevalier * catamorphism.org * Often in error, never in doubt
"Accordingly, computer scientists commonly choose models which have
bottoms, but prefer them topless." -- Davey & Priestley, _Introduction
to Lattices and Order_
Oh, and we can fuse with the print loop too, yielding an entire program
of type Int# -> IO (), and really no intermediate lists (even for the return
list).
Again, we need to use the fusible version of mapM_:
import Prelude hiding (filter,sum,enumFromTo,mapM_,sequence_,map,foldr)
import Data.List.Stream
import Data.Stream (enumFromToInt,unstream)
enumFromTo i j = unstream (enumFromToInt i j)
mapM_ f as = sequence_ (map f as)
sequence_ ms = foldr (>>) (return ()) ms
-- ^ fuse happily
default(Int)
main = mapM_ print $ filter (\i -> i == sum (divisors i)) (enumFromTo 1 10000)
divisors i = filter (\j -> i `rem`j == 0) (enumFromTo 1 (i-1))
And we see the map and foldr fuse in sequence, which in turn fuses with the
filter (and rest of the program):
18 RuleFired
5 STREAM stream/unstream fusion
2 filter -> fusible
1 foldr -> fusible
1 map -> fusible
1 sumInt -> fusible
The program flattens to a single nested loop,
Main.$wloop_foldr :: GHC.Prim.Int#
-> GHC.Prim.State# GHC.Prim.RealWorld
-> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
which really is equivalent in terms of control flow and intermediate structures
to the C program.
-- Don
I tried the example with Int64 and Integer. The integer version
was actually quicker ... which is the reason I decided to post
the results.
C++ version times: 1.125; 1.109; 1.125
Int32 cpu times: 3.203; 3.172; 3.172
Int64 cpu times: 11.734; 11.797; 11.844
Integer cpu times: 9.609; 9.609; 9.500
Interesting that Int64 is *slower* than Integer.
On the other side the C version is not that much quicker. I guess
the Haskell version is using generic versions of mod and sum
(since they are from a library) which would mean indirect calls.
The Haskell version probably also creates the list nodes ...
even when they get almost immediately garbage collected.
Thanks for pointing out Int64 sucks so much :)
Peter.
With -O2 ghc will only have a function call to 'sum', with -O2 and the
list stream fusion library, there are no indirect calls and the entire
program is specialised to a nested loop.
Do you have your C++ program handy? I'd expect the fully fused version
to be somewhere between 1 and 2x slower.
-- Don
Ooops, my results ware wrong (nonoptimizing ms cl
compiler used and I used -O instead of -O2 in ghc).
C++ version times: 1.109; 1.125; 1.125
Int32 cpu times: 1.359; 1.359; 1.375
Int64 cpu times: 11.688; 11.719; 11.766
Integer cpu times: 9.719; 9.703; 9.703
Great result from ghc.
What Haskell program were you using for this test? The original
naive/high level implementation?
-- Don
Here it is (I played only with the types for divisors and perfect;
bottom is there from my previous tests, but it should not matter):
<---cut--->
module Main (bottom, divisors, perfect, main) where
import Data.Int
bottom = error "_|_"
divisors :: Int -> [Int]
divisors i = [j | j<-[1..i-1], i `mod` j == 0]
perfect :: [Int]
perfect = [i | i<-[1..10000], i == sum (divisors i)]
main = print perfect
<---cut--->
and here is the C++ version:
<---cut--->
#include <iostream>
using namespace std;
int main() {
for (int i = 1; i <= 10000; i++) {
int sum = 0;
for (int j = 1; j < i; j++)
if (i % j == 0)
sum += j;
if (sum == i)
cout << i << " ";
}
return 0;
}
<---cut--->
OS winxp 64 bit
ghc v. 6.6.1 (optins -O2)
MS cl.exe version 13.10.3077 (options /G7 /MD)
Peter.
And I had cl options wrong too - I need to start to
optimize not only to set the optimization target.
/G7 /MD -> 1.109; 1.125; 1.125
/Ox /G7 /MD -> 0.922; 0.984; 0.984
Still it does not change the results too much.
Try with rem instead of mod.
(What the heck is with bottom?)
Quoting Don Stewart <do...@galois.com>:
> That fits with my experience writing low level numeric code -- Integer
> can be a killer.
Mind you, if you're intending to work with large integers or rationals,
Integer is great! The bottleneck is almost always "show".
Cheers,
Andrew Bromage
This should be a little faster , as sum will fuse,
perfect :: [Int]
perfect = [i | i<-[1..10000], i == sum' (divisors i)]
where sum' = foldr (+) 0
sum' did not help. Times are about the same with Int type.
Peter.
The bottom was there by error and I was lazy to redo
the tests so I rather posted exactly what I was
doing. I do not know the compiler that good to be
absolutely sure it cannot have impact on result
... so I rather did not doctor what I did :-)
Ok, rem did help quite a bit. Any comments why it is so?
Here are summary of results for those interested. I run
all the tests once again. Haskell was about 13% slower
than C++.
MS cl.exe options: /Ox /G7 /MD
ghc options: -O2
C++ version: 1.000; 0.984; 0.984
Haskell version specialized to Int: 1.125; 1.125; 1.109
Haskell version specialized to Integer: 8.781; 8.813; 8.813
Haskell version specialized to Int64: 9.781; 9.766; 9.831
The code:
% cat b.hs
module Main (divisors, perfect, main) where
import Data.Int
divisors :: Int -> [Int]
divisors i = [j | j<-[1..i-1], i `rem` j == 0]
perfect :: [Int]
perfect = [i | i<-[1..10000], i == sum (divisors i)]
main = print perfect
% cat b.cpp
#include <iostream>
using namespace std;
int main() {
for (int i = 1; i <= 10000; i++) {
int sum = 0;
for (int j = 1; j < i; j++)
if (i % j == 0)
sum += j;
if (sum == i)
cout << i << " ";
}
return 0;
}
%
Why do you expose perfect and divisors? Maybe if you just expose main,
perfect and divisors will be inlined (although this will only save
10,000 function entries, so will probably have negligible effect).
Rodrigo
I exposed them so that I can check types in ghci.
Hiding them does not seem to have noticeable effect.
Thanks,
Peter.
I had done no optimization, but neither -O nor -O2 make a significant
difference in either the C or Haskell programs. But using `rem` instead
of `mod`, together with the type annotation, makes the two programs take
pretty much the same amount of time. --PR
just to compare the stuff, I get quite other results being on other
OS. Thus, the result of C++ compiler may not be that interesting as I do
not have the one presented below.
My machine:
Linux 2.6.23-ARCH #1 SMP PREEMPT Mon Oct 22 12:50:26 CEST 2007 x86_64
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux
Compilers:
g++ --version
g++ (GCC) 4.2.2
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.6.1
Measurement:
compiled with ghc -O2
time ./mainInteger
real 0m4.866s
user 0m4.843s
sys 0m0.020s
compiled with ghc -O2
time ./mainInt64
real 0m2.213s
user 0m2.210s
sys 0m0.003s
compiled with ghc -O2
time ./mainInt
real 0m1.149s
user 0m1.143s
sys 0m0.003s
compiled with g++ -O3
time ./mainC
real 0m0.271s
user 0m0.270s
sys 0m0.000s
I don't know what is the reason, but the difference between Int, Int64
and Integer is not that dramatic as in example below, nevertheless, the
difference between GHC and GNU C++ is very bad.... :-\
Dusan
--
Dusan Kolar tel: +420 54 114 1238
UIFS FIT VUT Brno fax: +420 54 114 1270
Bozetechova 2 e-mail: ko...@fit.vutbr.cz
Brno 612 66
Czech Republic
--
I can believe that. Integer is actually optimised for small values:
there's a specialised representation for values that fit in a single word
that avoids calling out to the GMP library.
As Stefan pointed out, there's a lot of room to improve the performance of
Int64, it's just never been a high priority.
Cheers,
Simon
Just to chime in, my results with the code below:
dafis@linux:~> uname -a
Linux linux 2.4.20-4GB-athlon #1 Mon Mar 17 17:56:47 UTC 2003 i686 unknown
unknown GNU/Linux
on a 1200 MHz Duron
g++ is version 3.3, C++ code compiled with -O3, Haskell with -O2 (GHC 6.6.1)
dafis@linux:~> time ./mainC
6 28 496 8128
real 0m1.945s
user 0m1.910s
sys 0m0.010s
dafis@linux:~> time ./mainInt
[6,28,496,8128]
real 0m2.407s
user 0m2.300s
sys 0m0.010s
dafis@linux:~> time ./mainInt64
[6,28,496,8128]
real 0m24.009s
user 0m23.900s
sys 0m0.050s
dafis@linux:~> time ./mainInteger
[6,28,496,8128]
real 0m21.555s
user 0m20.870s
sys 0m0.010s
So Int is not so much slower than C, Int64 and Integer dramatically slower
with Integer beating Int64 here, too.
Cheers,
Daniel
> > The code:
Peter.
..
> So almost identical types to the C program (bar for the return [Int]).
>
> Finally, we can manually translate the C code into a confusing set of nested
> loops with interleaved IO,
how lazy is `print` supposed to be? If it's strict, we need to return /
accumulate a list (and in this case, that is not very time-consuming
because there are only four numbers in the list)
from http://haskell.org/onlinereport/standard-prelude.html
print x = putStrLn (show x)
show on lists is lazy and doesn't matter that it's not for Ints
putStrLn :: String -> IO ()
putStrLn s = do putStr s
putStr "\n"
putStr :: String -> IO ()
putStr s = mapM_ putChar s
okay, mapM_ makes it lazy.
putChar :: Char -> IO ()
putChar = primPutChar
(It's easy to get confused with OS buffering effects too...)
so it should ALL be able to fuse, with no intermediate lists, I think
(not sure about [Char] from show... or whether fusion works with IO
sequencing...)
Isaac