enumFromTo m n = unfoldr f m where f k | k <= n = Just (k,k+1) | otherwise = Nothing
main = print . sum $ enumFromTo 1 (10^9 :: Int)
---- snip ----
dolio@zeke % time ./Test 500000000500000000 /Test 3.12s user 0.03s system 80% cpu 3.922 total dolio@zeke % time ./Test-sum0 500000000500000000 /Test-sum0 3.47s user 0.02s system 80% cpu 4.348 total dolio@zeke % time ./Test-sum0 500000000500000000 /Test-sum0 3.60s user 0.02s system 90% cpu 4.009 total dolio@zeke % time ./Test 500000000500000000 /Test 3.11s user 0.02s system 81% cpu 3.846 total
---- snip ----
"Test-sum0" is with the sum0 function
"Test" is the code at the top of this mail.
-fvia-c -optc-O3 didn't seem to make a big difference with either Haskell example, so they're both with the default backend.
Your C++ code runs slowly on my system (around 1 second), but that's because it uses 32-bit ints, I guess (switching to long int sped it up).
Sorry for replying to myself, but I got suspicious about the 6ms runtime of the 64-bit C++ code on my machine. So I looked at the assembly and found this:
I'm no assembly guru, but that makes me think that there's no actual computation going on in the runtime for the 64-bit C++ program, whereas the 32-bit one is clearly doing work on my system, since it takes around 1 second.
Not that I'd be sad if GHC could reduce that whole constant at compile time, but GCC isn't doing 1 billion adds in 6 (or even 60) milliseconds. _______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Nice! Now we know that gcc can calculate faster than Haskell can calculate and print. Next time, use exitWith, please.
-- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.
When I change the C++ program into: int n; scanf("%d", &n);
for(i=0; i<n;i++) { sum += i;
}
GCC need 100 milliseconds on my 3.0GHz new Xeon with loop unrolling enabled.
Without loop unrolling GCC needs about 635ms
Visual C++ does it in 577 ms, generating the following code:
loop: add rbx,rax inc rax cmp rax,rcx jl loop
GHC with -O2 -fvia-c (the fastest I could make it) needs
13075 for the naive sum 2100 ms with sum0 2018 ms using the stream-fusion
Interesting to see that the stream-fusion was slower when not doing -fvia-c (more than twice as slow with -O)
So GHC is about 3 to 4 times slower as Visual C++ / GCC without loop unrolling, which is not too bad since GHC does not perform register optimization and loop unrolling yet no?
> Nice! Now we know that gcc can calculate faster than Haskell can > calculate and print. Next time, use exitWith, please.
> -- > (c) this sig last receiving data processing entity. Inspect headers > for copyright history. All rights reserved. Copying, hiring, renting, > performance and/or quoting of this signature prohibited.
>>>>> "Peter" == Peter Verswyvelen <bugf...@gmail.com> writes:
Peter> So GHC is about 3 to 4 times slower as Visual C++ / GCC Peter> without loop unrolling, which is not too bad since GHC does Peter> not perform register optimization and loop unrolling yet Peter> no?
I would call it rather poor.
And I don't accept a since of that form as valid mitigation. -- Colin Adams Preston Lancashire _______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
> Friday, February 20, 2009, 5:44:44 PM, you wrote:
> > Nice! Now we know that gcc can calculate faster than Haskell can > > calculate and print. Next time, use exitWith, please.
> it was done in order to simplify sources. are you really believe that > ghc needs more than 1 millisecond to print one number? :)
Well, I know that (Show a) is about as slow as you can get.
-- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.
Well C# does it with a for loop in 2300ms, and when using a IEnumerable sequence it needs 19936ms. Very much like the Haskell code. But of course the Haskell code could optimize the sum I guess, I assume it is using the lazy version of sum by default.
Anyway it was more of a question. Does GHC perform register allocation (e.g. using graph colouring) and loop unrolling?
On Fri, Feb 20, 2009 at 4:22 PM, Colin Paul Adams <co...@colina.demon.co.uk>wrote:
> >>>>> "Peter" == Peter Verswyvelen <bugf...@gmail.com> writes:
> Peter> So GHC is about 3 to 4 times slower as Visual C++ / GCC > Peter> without loop unrolling, which is not too bad since GHC does > Peter> not perform register optimization and loop unrolling yet > Peter> no?
> I would call it rather poor.
> And I don't accept a since of that form as valid mitigation. > -- > Colin Adams > Preston Lancashire
> Well C# does it with a for loop in 2300ms, and when using a > IEnumerable sequence it needs 19936ms. Very much like the Haskell > code. But of course the Haskell code could optimize the sum I guess, > I assume it is using the lazy version of sum by default.
the question is what is the natural for every language
> Anyway it was more of a question. Does GHC perform register > allocation (e.g. using graph colouring) and loop unrolling?
afaik, ghc can be compared with 20-years old C compilers. it uses registers for performing tight loops but has very simple register allocation procedure. also it doesn't unroll loops
> On Fri, Feb 20, 2009 at 4:22 PM, Colin Paul Adams <co...@colina.demon.co.uk> wrote:
>>>>>> "Peter" == Peter Verswyvelen <bugf...@gmail.com> writes:
> Peter> So GHC is about 3 to 4 times slower as Visual C++ / GCC > Peter> without loop unrolling, which is not too bad since GHC does > Peter> not perform register optimization and loop unrolling yet > Peter> no?
> I would call it rather poor.
> And I don't accept a since of that form as valid mitigation. > -- > Colin Adams > Preston Lancashire
-- Best regards, Bulat mailto:Bulat.Zigans...@gmail.com
On Fri, Feb 20, 2009 at 6:39 AM, Dan Doel <dan.d...@gmail.com> wrote: > Sorry for replying to myself, but I got suspicious about the 6ms runtime of > the 64-bit C++ code on my machine. So I looked at the assembly and found > this:
> I'm no assembly guru, but that makes me think that there's no actual > computation going on in the runtime for the 64-bit C++ program, whereas the > 32-bit one is clearly doing work on my system, since it takes around 1 > second.
> Not that I'd be sad if GHC could reduce that whole constant at compile > time, > but GCC isn't doing 1 billion adds in 6 (or even 60) milliseconds.
The GCC optimizer must know that you can't return a value to user space of that large as a return result.
In Haskell you're printing it... why not print it in C++?
> Friday, February 20, 2009, 6:34:04 PM, you wrote:
> > Well C# does it with a for loop in 2300ms, and when using a > > IEnumerable sequence it needs__19936ms. Very much like the Haskell > > code. But of course the Haskell code could optimize the sum I guess, > > I assume it is using the lazy version of sum by default.
> the question is what is the natural for every language
> > Anyway it was more of a question.__Does GHC perform register > > allocation (e.g. using graph colouring) __and loop unrolling?
> afaik, ghc can be compared with 20-years old C compilers. it uses > registers for performing tight loops but has very simple register > allocation procedure. also it doesn't unroll loops
hmmm... do we have magic-hash vector types and folds and maps on them? I'm only asking because gcc fails to use _anything_ but plain registers.
-- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.
On Friday 20 February 2009 10:52:03 am David Leimbach wrote:
> The GCC optimizer must know that you can't return a value to user space of > that large as a return result.
> In Haskell you're printing it... why not print it in C++?
I actually changed my local copy to print out the result (since I wanted to make sure it was using 64 bit ints). It didn't make a difference in the timing (of either the 32 or 64 bit version). _______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Now, given GHC gets most of the way there -- I think this might make a good bug report against GHC head, so we can see if the new register allocator helps any.
>> main = print $ sum[1..10^9::Int] > This won't be comparable to your loop below, as 'sum' is a left fold > (which doesn't fuse under build/foldr). > You should use the list implementation from the stream-fusion package (or > uvector) if you're expecting it to fuse to the following loop:
it was comparison of native haskell, low-level haskell (which is harder to write than native C) and native C. stream-fusion and any other packages provides libraries for some tasks but they can't make faster maps, for example. so i used plain list
> Which seems ... OK.
really? :D
> Well, that's a bit different. It doesn't print the result, and it returns a different > results on 64 bit....
doesn't matter for testing speed
> I don't get anything near the 0.062s which is interesting.
it was beautiful gcc optimization - it added 8 values at once. with xor results are:
xor.hs 12.605 xor-fast.hs 1.856 xor.cpp 0.339
> The print statement slows things down, I guess...
are you really believe that printing one number needs so much time? :)
why not compare to ghc -O0? also you can disable loop unrolling in gcc and unroll loops manually in haskell. or you can generate asm code on the fly. there are plenty of tricks to "prove" that gcc generates bad code :D
> So. some lessons. GHC is around 3-4x slower on this tight loop. (Which isn't as > bad as it used to be).
really? what i see: low-level haskell code is usually 3 times harder to write and 3 times slower than gcc code. native haskell code is tens to thousands times slower than C code (just recall that real programs use type classes and monads in addition to laziness)
> That's actually a worse margin than any current shootout program, where we are no > worse than 2.9 slower on larger things:
1) most benchmarks there depend on libraries speed. in one test, for example, php is winner 2) for the sum program ghc libs was modified to win in benchmark 3) the remaining 1 or 2 programs that measure speed of ghc-generated code was hardly optimized using low-level code, so they don't have anything common with real haskell code most of us write every day
> Now, given GHC gets most of the way there -- I think this might make a good bug > report against GHC head, so we can see if the new register allocator helps any.
you mean that 6.11 includes new allocator? in that case you can test it too
i believe that ghc developers are able to test sum performance without my bugreports :D
-- Best regards, Bulat mailto:Bulat.Zigans...@gmail.com
> Friday, February 20, 2009, 7:41:33 PM, you wrote:
> >> main = print $ sum[1..10^9::Int]
> > This won't be comparable to your loop below, as 'sum' is a left fold > > (which doesn't fuse under build/foldr).
> > You should use the list implementation from the stream-fusion package (or > > uvector) if you're expecting it to fuse to the following loop:
> it was comparison of native haskell, low-level haskell (which is > harder to write than native C) and native C. stream-fusion and any > other packages provides libraries for some tasks but they can't make faster > maps, for example. so i used plain list
Hmm? Maybe you're not familiar with the state of the art?
$ time ./A 500000000500000000 ./A 0.97s user 0.01s system 99% cpu 0.982 total
Now, (trying to avoid the baiting...) this is actually *very* interesting. Why is this faster than the manual recursion we did earlier why do we get better assembly? Again, if you stick to specifics, there's some interesting things we can learn here.
> > Which seems ... OK.
> really? :D
No, see above.
> > I don't get anything near the 0.062s which is interesting.
> it was beautiful gcc optimization - it added 8 values at once. with > xor results are:
> why not compare to ghc -O0? also you can disable loop unrolling in gcc > and unroll loops manually in haskell. or you can generate asm code on > the fly. there are plenty of tricks to "prove" that gcc generates bad > code :D
No, we want to show (I imagine) that GHC is within a factor or two of "C". I usually set my benchmark to beat gcc -O0 fwiw, and then to hope to be within 2x of optimised C. I'm not sure what you're standards are.
> > So. some lessons. GHC is around 3-4x slower on this tight loop. (Which isn't as > > bad as it used to be).
> really? what i see: low-level haskell code is usually 3 times harder > to write and 3 times slower than gcc code. native haskell code is tens > to thousands times slower than C code (just recall that real programs > use type classes and monads in addition to laziness)
"thousands times", now you're just undermining your own credibility here. Stick to what you can measure. If anything we'd expect GCC's magic loop skillz to be less useful on large code bases.
> > That's actually a worse margin than any current shootout program, where we are no > > worse than 2.9 slower on larger things:
> 1) most benchmarks there depend on libraries speed. in one test, for > example, php is winner > 2) for the sum program ghc libs was modified to win in benchmark
It is interesting that the < 2.9x slower in the shootout is pretty much what we found in this benchmark too.
> 3) the remaining 1 or 2 programs that measure speed of ghc-generated > code was hardly optimized using low-level code, so they don't have > anything common with real haskell code most of us write every day
Depends on where you work.
> > Now, given GHC gets most of the way there -- I think this might make a good bug > > report against GHC head, so we can see if the new register allocator helps any.
> you mean that 6.11 includes new allocator? in that case you can > test it too
> i believe that ghc developers are able to test sum performance without my > bugreports :D
No! This is not how open source works! You *should submit bug reports* and *analysis*. It is so so much more useful than complaining and throwing stones.
Don Stewart <d...@galois.com> wrote: > No! This is not how open source works! You *should submit bug > reports* and *analysis*. It is so so much more useful than > complaining and throwing stones.
Exactly. I don't know where, but I read that the vast majorities of Linux bugs are reported, nailed, and then fixed, by at least three different persons: The first reports a misbehaviour, the second manages to find it surfacing in a certain line of code, the third instantly knows how to make it go away.
-- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited.