Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

benchmarks? java vs .net

2 views
Skip to first unread message

King Raz

unread,
May 30, 2008, 11:28:07 PM5/30/08
to
The shootout site has benchmarks comparing different languages. It
includes C# Mono vs Java but not C# .NET vs Java. So I went through
all the benchmark on the site ...

http://kingrazi.blogspot.com/2008/05/shootout-c-net-vs-java-benchmarks.html

Just to keep the post on topic for my friends at comp.lang.c++, how do
I play default windows sounds with C++?

Andrea Francia

unread,
May 31, 2008, 2:09:12 PM5/31/08
to

This guy had trolled the comp.lang.java.programmer and comp.lang.c++
newsgroup for many weeks with pretentious post about the performance
comparison of C++ and Java.
The comparison was done in a not scientific way with only purpose of get
replies.
To get replies he/she send many replies to each message he/she receives.

For reach is target he/she changed his/hers email address many times to
avoid killfiles.

I suggest the people not interested who have thunderbird to install the
"Right click ingore/watch thread" extension and use it for ignoring this
thread.

--
Andrea Francia
http://andreafrancia.blogspot.com/

Razii

unread,
May 31, 2008, 3:24:10 PM5/31/08
to
On Sat, 31 May 2008 18:09:12 GMT, Andrea Francia
<andrea....@REMOVE-FROM-HERE.ohoihihoihoih.TO-HERE.gmx.it> wrote:

>This guy had trolled the comp.lang.java.programmer and comp.lang.c++
>newsgroup for many weeks with pretentious post about the performance
>comparison of C++ and Java.

This got to be the funniest post. She (or is that he?) cross-posts
(spams?) three newsgroups with the subject "Please don't feed the
troll" and this post was the only response to my thread. Thanks for
feeding me. She/He also quotes my entire post so anyone who missed my
first post now read it :) Now I know at least one person read my
post; that is the brilliant: Andrea Francia, for Italy :)

Were you born this stupid or did someone hit you with a baseball bal?

Razii

unread,
May 31, 2008, 8:40:20 PM5/31/08
to
On Sat, 31 May 2008 14:24:10 -0500, Razii <fg...@mail.com> wrote:

>This got to be the funniest post. She (or is that he?) cross-posts
>(spams?) three newsgroups with the subject "Please don't feed the
>troll" and this post was the only response to my thread. Thanks for
>feeding me. She/He also quotes my entire post so anyone who missed my
>first post now read it :) Now I know at least one person read my
>post; that is the brilliant: Andrea Francia, for Italy :)

Arrrgh. I meant "from Italy"

In any case, if you think I am a troll, why would you feed me by
responding to my post? Why would you quote my entire post, advertising
it for free? People who missed the first post now must have read my
post, thanks to Andrea Francia.

Well, it seems, the programming-related newsgroups have some of
the dumbest people on USENET. I wonder why?

Jeff Higgins

unread,
May 31, 2008, 8:45:57 PM5/31/08
to

Andrea Francia wrote:

> I suggest...

!Thanks for the respam. I'd have never been aware of this
post had it not been for your announcement and sage advice.

I'm sure k.rzi thanks you for reposting a link to his site.


Jon Harrop

unread,
Jun 2, 2008, 5:22:29 AM6/2/08
to

Yes. I have ported a variety of benchmarks and found there is no significant
difference between the performance of .NET and Java. The one result that
stood out was .NET being 2x faster at the Mersenne Twister PRNG.

--
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com/products/?u

Razii

unread,
Jun 2, 2008, 4:56:24 PM6/2/08
to
On Mon, 02 Jun 2008 10:22:29 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>Yes. I have ported a variety of benchmarks and found there is no significant
>difference between the performance of .NET and Java. The one result that
>stood out was .NET being 2x faster at the Mersenne Twister PRNG.

.NET is twice slower in four benchmarks: binarytrees, mandelbrot,
regexdna, sumcol. .NET was twice faster only in trig related
benchmark.

Razii

unread,
Jun 2, 2008, 5:33:40 PM6/2/08
to
On Mon, 02 Jun 2008 10:22:29 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>Yes. I have ported a variety of benchmarks and found there is no significant


>difference between the performance of .NET and Java. The one result that
>stood out was .NET being 2x faster at the Mersenne Twister PRNG.

.NET is twice slower in four benchmarks: binarytrees, mandelbrot,

Jon Harrop

unread,
Jun 2, 2008, 8:23:00 PM6/2/08
to

You have not optimized the .NET code:

On the mandelbrot benchmark, most of the time is spent doing unbuffered IO.
Use buffered IO and .NET becomes ~10% faster than Java.

On the regexdna benchmark, the regular expressions are not being compiled.
Ask for compilation and the .NET is <20% slower than Java.

The sumcol benchmark is also using unbuffered IO.

So neither VM is significantly faster overall.

Razii

unread,
Jun 2, 2008, 9:21:04 PM6/2/08
to
On Tue, 03 Jun 2008 01:23:00 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>On the mandelbrot benchmark, most of the time is spent doing unbuffered IO.


>Use buffered IO and .NET becomes ~10% faster than Java.

The output was directed to /NULL so I don't see how buffering makes
the difference. In any case, post the version that makes it 10%
faster. Right now it's 2 times slower.

>On the regexdna benchmark, the regular expressions are not being compiled.
>Ask for compilation and the .NET is <20% slower than Java.

The version that compiles regex is slower that then the version that
doesn't compile on Mono

http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all

That's because the regular expression is used only once and compiling
it makes it slower. I doubt the compiled version will be faster on
.NET. It will be slower.

>The sumcol benchmark is also using unbuffered IO.

Post the right version that buffers then.

>So neither VM is significantly faster overall.

.NET is 2 times slower in 4 of the benchmarks. Post the version that
makes it faster so we can verify. You haven't done it yet.

Razii

unread,
Jun 2, 2008, 9:35:44 PM6/2/08
to
On Mon, 02 Jun 2008 20:21:04 -0500, Razii <klgd...@mail.com> wrote:

>The version that compiles regex is slower that then the version that
>doesn't compile on Mono
>
>http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all

Ok, I did try compiled version on .NET and it was faster than
uncompiled version.

9.539s (uncompiled)
5.165s (compiled)

the java version was 3.956s


Razii

unread,
Jun 2, 2008, 10:03:12 PM6/2/08
to
On Tue, 03 Jun 2008 01:23:00 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>The sumcol benchmark is also using unbuffered IO.

I changed the line to

using (StreamReader r = new
StreamReader(Console.OpenStandardInput(8192)))

Buffereing 8192 bytes now?

It made no difference.

4.861s (unbuffered)
4.839s (buffered)

still two times slower.

Razii

unread,
Jun 2, 2008, 10:06:57 PM6/2/08
to
On Tue, 03 Jun 2008 01:23:00 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>On the mandelbrot benchmark, most of the time is spent doing unbuffered IO.


>Use buffered IO and .NET becomes ~10% faster than Java.

Doesn't this line mean

http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=csharp&id=2

Stream s = Console.OpenStandardOutput(1024);

it's buffering already, 1024 bytes?

vs this java version

http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=javaxx&id=0

it's 3.4 times slower.

Mark Thornton

unread,
Jun 3, 2008, 2:55:07 AM6/3/08
to

Rather amusing really as unintentional use of unbuffered IO is a
frequent cause of Java benchmarks running more slowly than they should.
It seems that .NET copied that characteristic as well.

Mark Thornton

Jon Harrop

unread,
Jun 3, 2008, 4:12:37 AM6/3/08
to
Mark Thornton wrote:
> Rather amusing really as unintentional use of unbuffered IO is a
> frequent cause of Java benchmarks running more slowly than they should.
> It seems that .NET copied that characteristic as well.

Yes. I've no idea why they do that. Isn't buffered IO a better default?!

Razii

unread,
Jun 3, 2008, 7:49:28 AM6/3/08
to
On Tue, 03 Jun 2008 07:55:07 +0100, Mark Thornton
<mtho...@optrak.co.uk> wrote:

>Rather amusing really as unintentional use of unbuffered IO is a
>frequent cause of Java benchmarks running more slowly than they should.
>It seems that .NET copied that characteristic as well.

Harpo was wrong though. mandelbrot is buffered

Stream s = Console.OpenStandardOutput(1024);

It's still 3.4 times slower.

Verify it yourself.

http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=csharp&id=2

vs

http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=javaxx&id=0


3.4 times slower.

Jon Harrop

unread,
Jun 3, 2008, 11:08:17 AM6/3/08
to
Razii wrote:
> On Tue, 03 Jun 2008 01:23:00 +0100, Jon Harrop <j...@ffconsultancy.com>
> wrote:
>>On the mandelbrot benchmark, most of the time is spent doing unbuffered
>>IO. Use buffered IO and .NET becomes ~10% faster than Java.
>
> The output was directed to /NULL so I don't see how buffering makes
> the difference. In any case, post the version that makes it 10%
> faster. Right now it's 2 times slower.

Change:

Stream s = Console.OpenStandardOutput(1024);

to:

BufferedStream s = new BufferedStream(Console.OpenStandardOutput());

and the entire program becomes 2.5x faster.

>>On the regexdna benchmark, the regular expressions are not being compiled.
>>Ask for compilation and the .NET is <20% slower than Java.
>

> Ok, I did try compiled version on .NET and it was faster than
> uncompiled version.
>
> 9.539s (uncompiled)
> 5.165s (compiled)
>
> the java version was 3.956s

Yes.

>>So neither VM is significantly faster overall.
>
> .NET is 2 times slower in 4 of the benchmarks.

Java is only 2x faster in one benchmark (binarytrees) here and that is not a
well formed benchmark. Java is still over 2x slower on both partialsums and
the Mersenne Twister.

So you cannot reasonably conclude that Java is faster.

Jon Harrop

unread,
Jun 3, 2008, 11:18:34 AM6/3/08
to
Razii wrote:
> On Tue, 03 Jun 2008 07:55:07 +0100, Mark Thornton
> <mtho...@optrak.co.uk> wrote:
>>Rather amusing really as unintentional use of unbuffered IO is a
>>frequent cause of Java benchmarks running more slowly than they should.
>>It seems that .NET copied that characteristic as well.
>
> Harpo was wrong though. mandelbrot is buffered
>
> Stream s = Console.OpenStandardOutput(1024);

Use:

BufferedStream s = new BufferedStream(Console.OpenStandardOutput());

> It's still 3.4 times slower.

Not if you optimize it properly.

> Verify it yourself.

I did.

There is no merit in comparing unoptimized code on .NET with optimized Java.

Java is also not significant faster on any of the SciMark2 benchmarks. The
only significant difference I have seen is on the Mersenne Twister PRNG
where Java is 2x slower than .NET.

John B. Matthews

unread,
Jun 3, 2008, 11:33:35 AM6/3/08
to
In article <epba4495e658920oe...@4ax.com>,
Razii <klgfsdf...@mail.com> wrote:

[...]
> Verify it yourself.
>
> http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=cshar
> p&id=2
>
> vs
>
> http://shootout.alioth.debian.org/gp4/benchmark.php?test=mandelbrot&lang=javax
> x&id=0
>
>
> 3.4 times slower.

Indeed, I see C#/Mono = 7.15s and Java = 3.22s. I would say that Java
took (3.22s / 7.15s = 0.450) less that half as long as C#/Mono or that
Java was (7.15s / 3.22s = 2.220) over twice as fast. I don't see where
3.4 comes from. Just asking.

John
--
John B. Matthews
trashgod at gmail dot com
home dot woh dot rr dot com slash jbmatthews

Jon Harrop

unread,
Jun 3, 2008, 11:39:25 AM6/3/08
to
John B. Matthews wrote:
> Indeed, I see C#/Mono = 7.15s and Java = 3.22s. I would say that Java
> took (3.22s / 7.15s = 0.450) less that half as long as C#/Mono or that
> Java was (7.15s / 3.22s = 2.220) over twice as fast. I don't see where
> 3.4 comes from. Just asking.

FWIW, F#/Mono is 3x slower than F#/.NET on the SciMark2 benchmark. I'd like
to know how performance compares between these platforms for parallel code.

Does Java have anything comparable to .NET's Task Parallel Library?

Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 12:19:22 PM6/3/08
to
On Jun 3, 4:39 pm, Jon Harrop <j...@ffconsultancy.com> wrote:
> FWIW, F#/Mono is 3x slower than F#/.NET on the SciMark2 benchmark. I'd like
> to know how performance compares between these platforms for parallel code.

If Razii is genuinely comparing Java with Mono (I haven't looked at
any of the figures) it's a silly test to start with (unless you're
specifically interested in Mono, of course). The vast majority of C#
code runs on .NET rather than Mono - and while I applaud the Mono
team's work, I seriously doubt that it has quite has much effort going
into it as Microsoft is putting into .NET. I'd expect .NET to
outperform Mono, and on microbencharks like these the difference could
be quite significant in some cases.

> Does Java have anything comparable to .NET's Task Parallel Library?

I haven't used it, but I've heard about Doug Lea's Fork/Join
framework:
http://gee.cs.oswego.edu/dl/papers/fj.pdf

One problem with creating anything like Parallel Extensions (which I
assume is what you meant - AFAIK the TPL is just part of Parallel
Extensions, although I could be misunderstanding you completely) for
Java is that it doesn't (currently) have closures other than anonymous
inner classes which are ugly as hell.

The state of closures in Java 7 is currently up in the air.

Jon

Razii

unread,
Jun 3, 2008, 12:49:33 PM6/3/08
to
On Tue, 03 Jun 2008 16:08:17 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

> Java is still over 2x slower on both partialsums and
>the Mersenne Twister.

The partialsums is only slower due to trig accuracy required by Java's
specification. Your C# results are not accurate enough to satisfy
Java's specification. It's easy to solve this problem. Try this Java
version of partialsums. It's two times faster.

http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=java&id=4

In binarytrees C# is twice slower. You haven't posted any workaround
yet.

Razii

unread,
Jun 3, 2008, 12:56:13 PM6/3/08
to
On Tue, 03 Jun 2008 11:33:35 -0400, "John B. Matthews"
<nos...@nospam.com> wrote:

>I don't see where
>3.4 comes from. Just asking.

Did you use -server flag? It's 7.578 vs 2.198 on my computer.
7.578 / 2.198 = 3.44


Razii

unread,
Jun 3, 2008, 1:06:11 PM6/3/08
to
On Tue, 03 Jun 2008 16:08:17 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

> BufferedStream s = new BufferedStream(Console.OpenStandardOutput());


>
>and the entire program becomes 2.5x faster.

Yes, that was faster. I submitted it to shootout site. I am not sure
if it will be faster on Mono too but I will let Gouy test it.


Razii

unread,
Jun 3, 2008, 1:16:51 PM6/3/08
to
On Tue, 03 Jun 2008 16:18:34 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>There is no merit in comparing unoptimized code on .NET with optimized Java.

That's why posted it here so you can "optimize it". You fixed
mandelbrot but C# is still twice slower in three other benchmarks:

binarytrees
(the command line argument should be -server -Xms64m binarytrees)

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=javaxx&id=2
vs
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=binarytrees&lang=csharp&id=0


revcomp
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=revcomp&lang=javaxx&id=4
vs
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=revcomp&lang=csharp&id=2


sumcol
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=sumcol&lang=javaxx&id=4
vs
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=sumcol&lang=csharp&id=0


also, C# significantly slower in recursion

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=recursive&lang=javaxx&id=0
vs
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=recursive&lang=csharp&id=0


>I have seen is on the Mersenne Twister PRNG where Java is 2x slower than .NET.

Post the benchmark. Let me see...


Razii

unread,
Jun 3, 2008, 1:31:31 PM6/3/08
to
On Tue, 3 Jun 2008 09:19:22 -0700 (PDT), "Jon Skeet [C# MVP]"
<sk...@pobox.com> wrote:

>If Razii is genuinely comparing Java with Mono (I haven't looked at
>any of the figures) it's a silly test to start with (unless you're
>specifically interested in Mono, of course).

If you have no clue what is this thread about, why post at all?

Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 1:51:10 PM6/3/08
to
Razii <klj...@mail.com> wrote:
> >If Razii is genuinely comparing Java with Mono (I haven't looked at
> >any of the figures) it's a silly test to start with (unless you're
> >specifically interested in Mono, of course).
>
> If you have no clue what is this thread about, why post at all?

Well gee, you keep post URLs comparing Mono with Java. That's *not*
comparing .NET with Java - and is thus relatively pointless IMO. (It's
also made more pointless by you not mentioning in your blog post which
version of Java you're using, or which version of .NET.)

To make the conversation significant (well, as significant as this kind
of thing can be):

1) Don't bother posting about the Mono benchmarks at all. Keep it to
.NET and Java.
2) Explain the precise runtime environments.

The IO suggestions so far seem to be an indication of the general merit
of the benchmark, mind you.

--
Jon Skeet - <sk...@pobox.com>
Web site: http://www.pobox.com/~skeet
Blog: http://www.msmvps.com/jon.skeet
C# in Depth: http://csharpindepth.com

Razii

unread,
Jun 3, 2008, 2:20:39 PM6/3/08
to
On Tue, 3 Jun 2008 18:51:10 +0100, Jon Skeet [C# MVP]
<sk...@pobox.com> wrote:

>Well gee, you keep post URLs comparing Mono with Java. That's *not*
>comparing .NET with Java - and is thus relatively pointless IMO. (It's
>also made more pointless by you not mentioning in your blog post which
>version of Java you're using, or which version of .NET.)

The link I posted was this:

http://kingrazi.blogspot.com/

It's clear what it is about.


Jon Harrop

unread,
Jun 3, 2008, 2:18:50 PM6/3/08
to

It makes no difference on Mono and the shootout does not test .NET.

Razii

unread,
Jun 3, 2008, 2:24:15 PM6/3/08
to
On Tue, 3 Jun 2008 18:51:10 +0100, Jon Skeet [C# MVP]
<sk...@pobox.com> wrote:

>The IO suggestions so far seem to be an indication of the general merit
>of the benchmark, mind you.

The criteria is same for both sides. For now C# is slower by wide
margin in these benchmarks. How about fixing them? If you can't, I
will conclude C# is just slower.

Mark Thornton

unread,
Jun 3, 2008, 2:32:38 PM6/3/08
to
Jon Harrop wrote:
> Mark Thornton wrote:
>> Rather amusing really as unintentional use of unbuffered IO is a
>> frequent cause of Java benchmarks running more slowly than they should.
>> It seems that .NET copied that characteristic as well.
>
> Yes. I've no idea why they do that. Isn't buffered IO a better default?!
>

I think the reasoning is simplicity --- you create what you want through
composition of a smaller number of classes. Otherwise you need more
classes or extra options on constructors to provide for all possible
cases. Unfortunately they spoiled this a bit by providing a few
composites (java.io.FileReader for example). It is also a bit confusing
that InputStreamReader includes some buffering, but adding buffering to
an already buffered stream doesn't usually increase the cost by much.

Mark Thornton

Jon Harrop

unread,
Jun 3, 2008, 2:33:41 PM6/3/08
to
Razii wrote:
> The criteria is same for both sides. For now C# is slower by wide
> margin in these benchmarks. How about fixing them?

I did.

> If you can't, I will conclude C# is just slower.

You had drawn that conclusion before you even tried to measure anything.

Mark Thornton

unread,
Jun 3, 2008, 2:37:53 PM6/3/08
to
Jon Skeet [C# MVP] wrote:
>
>> Does Java have anything comparable to .NET's Task Parallel Library?
>
> I haven't used it, but I've heard about Doug Lea's Fork/Join
> framework:
> http://gee.cs.oswego.edu/dl/papers/fj.pdf
>
I have used it and find it works quite well even without closures. I
haven't used .NET so can't compare them.

For more information and downloads of the package:
http://g.oswego.edu/dl/concurrency-interest/

Mark Thornton

Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 2:37:47 PM6/3/08
to

That was the link you posted *initially*. Since then you have posted
multiple links to http://shootout.alioth.debian.org which is about
Mono.

Jon Harrop

unread,
Jun 3, 2008, 2:37:29 PM6/3/08
to
Mark Thornton wrote:
> Jon Skeet [C# MVP] wrote:
>>> Does Java have anything comparable to .NET's Task Parallel Library?
>>
>> I haven't used it, but I've heard about Doug Lea's Fork/Join
>> framework:
>> http://gee.cs.oswego.edu/dl/papers/fj.pdf
>>
> I have used it and find it works quite well even without closures. I
> haven't used .NET so can't compare them.

The Task Parallel Library is awesome, even though it is only a CTP for now.
I highly recommend it. For example, read the article:

"Surviving in the multicore era with Microsoft's ParallelFX" - The F#.NET
Journal, 30th April 2008
http://www.ffconsultancy.com/products/fsharp_journal/?cs

Jon Harrop

unread,
Jun 3, 2008, 2:37:57 PM6/3/08
to
Razii wrote:
> On Tue, 03 Jun 2008 16:08:17 +0100, Jon Harrop <j...@ffconsultancy.com>
> wrote:
>> Java is still over 2x slower on both partialsums and
>>the Mersenne Twister.
>
> The partialsums is only slower due to trig accuracy required by Java's
> specification. Your C# results are not accurate enough to satisfy
> Java's specification. It's easy to solve this problem. Try this Java
> version of partialsums. It's two times faster.

But still 2x slower than everything else and now 2x more bloated as well:

C 1.300s
C# 1.363s
Java 2.750s
Java 3.840s (original)

> In binarytrees C# is twice slower. You haven't posted any workaround
> yet.

Optimizing binarytrees is trivial because the benchmark is fundamentally
flawed. I described this in detail on the shootout mailing list and urged
the maintainers to persue well defined benchmarks but they did not take
heed.

Razii

unread,
Jun 3, 2008, 2:50:17 PM6/3/08
to
On Tue, 03 Jun 2008 19:37:57 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>But still 2x slower than everything else and now 2x more bloated as well:


>
>C 1.300s
>C# 1.363s
>Java 2.750s

that's not what I get

time java -server partialsums 2500000

0m1.697s

and for Mono I get

0m1.525s

Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 2:50:21 PM6/3/08
to
Razii <klg...@mail.com> wrote:
> >The IO suggestions so far seem to be an indication of the general merit
> >of the benchmark, mind you.
>
> The criteria is same for both sides.

Um, the C# version (as posted on the site you've linked to below) uses
unbuffered IO while the Java version doesn't. They're just not doing
the same thing. Even if they were, that wouldn't give any meaningful
data about which language/platform "in general". It might help if you
happen to know that your code is going to require a huge amount of
recursion - to the extent that it's going to be the performance
bottleneck - *and* performance is your primary concern.

> For now C# is slower by wide margin in these benchmarks. How about
> fixing them? If you can't, I will conclude C# is just slower.

Feel free to draw that conclusion, while the rest of us instead
conclude that you can't claim that C# (which is a *language* after all,
not a platform) is slower or faster than another language.

Even running .NET rather than Mono will only go so far - you'd need to
run it on both 64 bit and 32 bit platforms, as the runtimes do
different things. (IIRC the 64 bit CLR uses tail recursion more heavily
than the 32 bit CLR, for instance.)

If I were to find a really slow JVM, would that prove that Java is a
slower *language* than C#?

I can easily write a benchmark where Java (under Hotspot) "beats" .NET:
just call a very small virtual method which is never overridden.
Hotspot will inline the call aggressively, as it is able to "undo" the
optimisation later on. .NET, with a single-pass JIT, won't do that.
Does that prove that Java is a faster language? No, not at all. It just
shows one area situation where it works better. It pretty much has to,
given that far more methods tend to be virtual in Java code than in
.NET code. I can probably find similar situations where .NET stomps
over Java (being able to create custom value types would probably be a
good starting point - force heap allocation and garbage collection in
Java but not .NET; even though it's fast, it's not free) - but that
wouldn't prove that .NET is faster than Java, either.

Jon Harrop

unread,
Jun 3, 2008, 2:48:16 PM6/3/08
to

Yes. Mono is extremely slow. Almost as slow as Java on this benchmark. As we
have seen, .NET is much faster...

Razii

unread,
Jun 3, 2008, 2:55:26 PM6/3/08
to
On Tue, 03 Jun 2008 19:33:41 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>> If you can't, I will conclude C# is just slower.


>
>You had drawn that conclusion before you even tried to measure anything.

What the heck? I measured everything and posted the linhk. That's how
the thread started.

BufferedStream didn't help in sumcol

using (StreamReader r = new StreamReader(new
BufferedStream(Console.OpenStandardInput())))

makes no difference.

The only fix you have made so far is to mandelbrot. What about the 4
other I mentioned?

Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 2:55:25 PM6/3/08
to
Mark Thornton <mtho...@optrak.co.uk> wrote:
> > I haven't used it, but I've heard about Doug Lea's Fork/Join
> > framework:
> > http://gee.cs.oswego.edu/dl/papers/fj.pdf
>
> I have used it and find it works quite well even without closures.

Cool - that's good to hear. I'd be very surprised if closures didn't
make it even simpler to use (possibly with a bit of refactoring towards
single-method interfaces if necessary).

Given the rest of Doug Lea's work, I'd expect it to be good :)

> I haven't used .NET so can't compare them.

Here's a little example from a Mandelbrot benchmark I was writing a few
weeks ago (oh the irony).

Simple initial code:

public override void Generate()
{
int index = 0;
for (int row = 0; row < Height; row++)
{
for (int col = 0; col < Width; col++)
{
Data[index++] = ComputeMandelbrotIndex(row, col);
}
}
}

Now using Parallel Extensions, and the lamdba expressions of C# 3:

public override void Generate()
{
Parallel.For(0, Height, row =>
{
int index = row * Width;
for (int col = 0; col < Width; col++)
{
Data[index++] = ComputeMandelbrotIndex(row, col);
}
});
}

I can't imagine many transformations being simpler than that - and the
results are great.

See http://preview.tinyurl.com/58vfav for rather more :)

> For more information and downloads of the package:
> http://g.oswego.edu/dl/concurrency-interest/

Thanks, will take a look.

Razii

unread,
Jun 3, 2008, 2:57:42 PM6/3/08
to
On Tue, 3 Jun 2008 19:37:47 +0100, Jon Skeet [C# MVP]
<sk...@pobox.com> wrote:

>That was the link you posted *initially*. Since then you have posted
>multiple links to http://shootout.alioth.debian.org which is about
>Mono.

Huh? I will be polite for now and just say you didn't read the thread
carefully. The benchmark I tested are at that site. I don't have Mono.
I only have .NET.


Razii

unread,
Jun 3, 2008, 2:59:09 PM6/3/08
to
On Tue, 03 Jun 2008 19:37:57 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>But still 2x slower than everything else and now 2x more bloated as well:

The FastMath class can be separated and used instead of java's Math in
all cases. The bloat is ZERO.


Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 2:57:57 PM6/3/08
to
Jon Harrop <j...@ffconsultancy.com> wrote:
> Mark Thornton wrote:
> > Jon Skeet [C# MVP] wrote:
> >>> Does Java have anything comparable to .NET's Task Parallel Library?
> >>
> >> I haven't used it, but I've heard about Doug Lea's Fork/Join
> >> framework:
> >> http://gee.cs.oswego.edu/dl/papers/fj.pdf
> >>
> > I have used it and find it works quite well even without closures. I
> > haven't used .NET so can't compare them.
>
> The Task Parallel Library is awesome, even though it is only a CTP for now.

Agreed. Now is a particularly good time to get into it, as the second
CTP has just been released:

http://blogs.msdn.com/pfxteam/archive/2008/06/02/8567093.aspx

> I highly recommend it. For example, read the article:
>
> "Surviving in the multicore era with Microsoft's ParallelFX" - The F#.NET
> Journal, 30th April 2008
> http://www.ffconsultancy.com/products/fsharp_journal/?cs

Or not, given that that's subscription only...

Razii

unread,
Jun 3, 2008, 3:00:07 PM6/3/08
to
On Tue, 03 Jun 2008 19:48:16 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>Yes. Mono is extremely slow. Almost as slow as Java on this benchmark. As we


>have seen, .NET is much faster...

That was a typo. I don't have Mono. I get that with .NET.


Razii

unread,
Jun 3, 2008, 3:04:31 PM6/3/08
to
On Tue, 3 Jun 2008 19:50:21 +0100, Jon Skeet [C# MVP]
<sk...@pobox.com> wrote:

>Um, the C# version (as posted on the site you've linked to below) uses
>unbuffered IO while the Java version doesn't. They're just not doing
>the same thing.

What benchmark are you talking about? Changing the line to

using (StreamReader r = new StreamReader(new
BufferedStream(Console.OpenStandardInput())))

didn't make a difference in sum-file benchmark.

By the way, StreamTokenizer is not buffered. It reads a byte each time
from the stream. It maybe that System.in is buffered by default in
Java.

Jon Harrop

unread,
Jun 3, 2008, 3:03:49 PM6/3/08
to
Razii wrote:
> On Tue, 03 Jun 2008 19:33:41 +0100, Jon Harrop <j...@ffconsultancy.com>
> wrote:
>>> If you can't, I will conclude C# is just slower.
>>
>>You had drawn that conclusion before you even tried to measure anything.
>
> What the heck? I measured everything...

You only measured unoptimized C#. I have optimized the C# for you but your
response has only been to cherry pick results that back the conclusion you
wanted to draw.

> The only fix you have made so far is to mandelbrot.

Have you already forgotten that you were not compiling the regular
expressions in the C# implementation of regexdna?

> What about the 4 other I mentioned?

You said ".NET is twice slower in four benchmarks: binarytrees, mandelbrot,
regexdna, sumcol.". I have already corrected two of your results (mandelbrot
and regexdna) and provided two more counter examples where .NET is >2x
faster than Java and explained why binarytrees is flawed.

Razii

unread,
Jun 3, 2008, 3:08:35 PM6/3/08
to

Jon Harrop

unread,
Jun 3, 2008, 3:05:33 PM6/3/08
to

Java is 2x slower than .NET on my AMD64 for this benchmark. What CPU are you
using?

Jon Harrop

unread,
Jun 3, 2008, 3:07:00 PM6/3/08
to

So having to write your own basic trig functions in Java so that you can be
only 2x slower than other languages doesn't bother you?

Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 3:11:45 PM6/3/08
to
Razii <klg...@mail.com> wrote:
> >That was the link you posted *initially*. Since then you have posted
> >multiple links to http://shootout.alioth.debian.org which is about
> >Mono.
>
> Huh? I will be polite for now and just say you didn't read the thread
> carefully. The benchmark I tested are at that site. I don't have Mono.
> I only have .NET.

Let's look at the most recent set of links you posted to "prove" that
Java is faster than C#:

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?
test=binarytrees&lang=javaxx&id=2

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?
test=binarytrees&lang=csharp&id=0


revcomp
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?
test=revcomp&lang=javaxx&id=4
vs
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?
test=revcomp&lang=csharp&id=2


sumcol
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?
test=sumcol&lang=javaxx&id=4
vs
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?
test=sumcol&lang=csharp&id=0

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?
test=recursive&lang=javaxx&id=0
vs
http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?
test=recursive&lang=csharp&id=0


Now, you're using *those results* to form conclusions, right? If not,
there was little point in posting them. However, those results are of
Mono, not .NET.

This is why I've urged consistency. What's the point in debating Java
vs .NET if you keep posting links to results of Java vs Mono?

Razii

unread,
Jun 3, 2008, 3:21:11 PM6/3/08
to
On Tue, 03 Jun 2008 20:05:33 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>Java is 2x slower than .NET on my AMD64 for this benchmark. What CPU are you
>using?

AMD 3 core phenom

Are you sure you used this version?

http://shootout.alioth.debian.org/gp4sandbox/benchmark.php?test=partialsums&lang=java&id=4

$ time partialsums 2500000 (.NET)
3.000000000 (2/3)^k
3160.817621887 k^-0.5
0.999999600 1/k(k+1)
30.314541510 Flint Hills
42.995233998 Cookson Hills
15.309017155 Harmonic
1.644933667 Riemann Zeta
0.693146981 Alternating Harmonic
0.785398063 Gregory

real 0m1.530s
user 0m0.000s
sys 0m0.031s

$ time java -server partialsums 2500000
3.000000000 (2/3)^k
3160.817621887 k^-0.5
0.999999600 1/k(k+1)
30.314541510 Flint Hills
42.995233998 Cookson Hills
15.309017155 Harmonic
1.644933667 Riemann Zeta
0.693146981 Alternating Harmonic
0.785398063 Gregory

real 0m1.697s
user 0m0.000s
sys 0m0.031s

Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 3:23:08 PM6/3/08
to
Razii <klg...@mail.com> wrote:
> On Tue, 3 Jun 2008 19:50:21 +0100, Jon Skeet [C# MVP]
> <sk...@pobox.com> wrote:
>
> >Um, the C# version (as posted on the site you've linked to below) uses
> >unbuffered IO while the Java version doesn't. They're just not doing
> >the same thing.
>
> What benchmark are you talking about?

The mandelbrot one. Look back in the thread from where I first talked
about IO, and you'll see it's the last one referenced.

> Changing the line to
>
> using (StreamReader r = new StreamReader(new
> BufferedStream(Console.OpenStandardInput())))
>
> didn't make a difference in sum-file benchmark.

The benchmark I wasn't referring to? Oh.

> By the way, StreamTokenizer is not buffered. It reads a byte each time
> from the stream. It maybe that System.in is buffered by default in
> Java.

Which benchmark are you talking about now? sum-file uses BufferedReader
but not StreamTokenizer as far as I can see.

It would be interesting to know what the sum-file benchmark is trying
to measure - there are six things off the top of my head:

1) File IO
2) Conversion from binary to text
3) Splitting a stream of textual data into lines
4) Parsing text into integers
5) Integer addition
6) Writing the result to the console

The benchmark gives no information as to which of those is the
bottleneck for any particular platform.

I note, by the way, that the Java version of sum-col assumes that using
the platform default encoding is good enough. The C# version always
uses UTF-8. Depending on what the default encoding is, that may or may
not be significant - but it does show that the programs are not doing
the same thing.


How did you measure your .NET results, by the way? By starting the
process thousands of times, as shown on the other web site, or by
running the same code within the process many times?

Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 3:27:19 PM6/3/08
to
Razii <klg...@mail.com> wrote:
> $ time partialsums 2500000 (.NET)
> 3.000000000 (2/3)^k
> 3160.817621887 k^-0.5
> 0.999999600 1/k(k+1)
> 30.314541510 Flint Hills
> 42.995233998 Cookson Hills
> 15.309017155 Harmonic
> 1.644933667 Riemann Zeta
> 0.693146981 Alternating Harmonic
> 0.785398063 Gregory
>
> real 0m1.530s
> user 0m0.000s
> sys 0m0.031s

That looks very much like Unix output rather than Windows. Are you
running under cygwin or something similar?

It would really help if you gave a full explanation of how *you* (as
opposed to the shoot-out) site are running your tests.

(I'd also argue that benchmarks taking only a second and a half to
complete aren't likely to have a good startup vs running program
balance.)

Mark Thornton

unread,
Jun 3, 2008, 3:36:42 PM6/3/08
to
Jon Skeet [C# MVP] wrote:

There are some FJ examples here:
http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W32

The recursive matrix multiplication is my contribution.

Mark Thornton

Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 3:44:01 PM6/3/08
to
Mark Thornton <mtho...@optrak.co.uk> wrote:

<snip>

> There are some FJ examples here:
> http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W32
>
> The recursive matrix multiplication is my contribution.

Right. It still looks a bit involved - which is only to be expected,
really. I haven't taken it in fully yet though - will aim to do so at a
later date. Might try porting to Parallel Extensions...

Razii

unread,
Jun 3, 2008, 3:52:47 PM6/3/08
to
On Tue, 03 Jun 2008 20:07:00 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>So having to write your own basic trig functions in Java so that you can be


>only 2x slower than other languages doesn't bother you?

It's not 2x slower with my trig function on my computer.

On the other hand, C# results are not accurate for trig functions.

using System;
using System.IO;

class Test
{
static void Main(){
Console.WriteLine(Math.Sin (1e15));
}
}


The answer I get from C# is: 0.858272132476373

with Java I get 0.8582727931702359

Checking the answer with maple, I get
0.8582727931702358355238863908484066466002034

How are you going to fix C# in this case?

Does it bother you that C# gave wrong answer?

Razii

unread,
Jun 3, 2008, 3:54:48 PM6/3/08
to
On Tue, 3 Jun 2008 20:11:45 +0100, Jon Skeet [C# MVP]
<sk...@pobox.com> wrote:

>Now, you're using *those results* to form conclusions, right? If not,
>there was little point in posting them. However, those results are of
>Mono, not .NET.

Is this guy Jon Skeet really this stupid?

Let me know so I can add him to ignore list.


Razii

unread,
Jun 3, 2008, 3:55:31 PM6/3/08
to
On Tue, 3 Jun 2008 20:27:19 +0100, Jon Skeet [C# MVP]
<sk...@pobox.com> wrote:

>That looks very much like Unix output rather than Windows. Are you
>running under cygwin or something similar?

Yes, I am running cygwin.

Razii

unread,
Jun 3, 2008, 3:59:57 PM6/3/08
to
On Tue, 03 Jun 2008 20:03:49 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>You said ".NET is twice slower in four benchmarks: binarytrees, mandelbrot,


>regexdna, sumcol.". I have already corrected two of your results (mandelbrot
>and regexdna) and provided two more counter examples where .NET is >2x
>faster than Java and explained why binarytrees is flawed.

Yes, you did correct two of them. I added recursive to the list, so
there are 3 more to go.

With FastMath, C# is not much faster in partialsums on my computer.
You haven't posted the other prime benchmark yet so I can't comment on
that.

Razii

unread,
Jun 3, 2008, 4:03:14 PM6/3/08
to
On Tue, 3 Jun 2008 20:23:08 +0100, Jon Skeet [C# MVP]
<sk...@pobox.com> wrote:

>The mandelbrot one. Look back in the thread from where I first talked
>about IO, and you'll see it's the last one referenced.

Yes, that was fixed by Harpo a while ago.

>Which benchmark are you talking about now? sum-file uses BufferedReader
>but not StreamTokenizer as far as I can see.

I suspect System.in is already buffered by default.

>1) File IO
>2) Conversion from binary to text
>3) Splitting a stream of textual data into lines
>4) Parsing text into integers
>5) Integer addition
>6) Writing the result to the console

(1), (2), (4), (5), (6) ?


Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 4:04:47 PM6/3/08
to
Razii <klg...@mail.com> wrote:
> >That looks very much like Unix output rather than Windows. Are you
> >running under cygwin or something similar?
>
> Yes, I am running cygwin.

Has it occurred to you that that may have an effect on the results?

The benchmarks would be *much* better in my view if each benchmark
started up a single process and ran with it for a significant amount of
time. After all, that's far more realistic in terms of how almost all
software is actually run in the real world.

Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 4:09:04 PM6/3/08
to
Razii <klg...@mail.com> wrote:
> >The mandelbrot one. Look back in the thread from where I first talked
> >about IO, and you'll see it's the last one referenced.
>
> Yes, that was fixed by Harpo a while ago.

And yet you seem to be completely ignoring my point: a shootout which
has had such a trivial but significant flaw for a significant time is
suspect in its methodology. (Heck, the measurement style is dodgy to
start with. Including the startup cost in the test is pretty crazy.)



> >Which benchmark are you talking about now? sum-file uses BufferedReader
> >but not StreamTokenizer as far as I can see.
>
> I suspect System.in is already buffered by default.

In what way does your sentence relate to my two?

> >1) File IO
> >2) Conversion from binary to text
> >3) Splitting a stream of textual data into lines
> >4) Parsing text into integers
> >5) Integer addition
> >6) Writing the result to the console
>
> (1), (2), (4), (5), (6) ?

What is your list of numbers meant to signify?

Razii

unread,
Jun 3, 2008, 4:14:52 PM6/3/08
to
Starting new thread, since that thread got too large (same applies to
C++).

On Tue, 03 Jun 2008 20:07:00 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>So having to write your own basic trig functions in Java so that you can be


>only 2x slower than other languages doesn't bother you?

It's not 2x slower with my trig function on my computer.

Mark Thornton

unread,
Jun 3, 2008, 4:37:46 PM6/3/08
to
Jon Skeet [C# MVP] wrote:
> Mark Thornton <mtho...@optrak.co.uk> wrote:
>
> <snip>
>
>> There are some FJ examples here:
>> http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W32
>>
>> The recursive matrix multiplication is my contribution.
>
> Right. It still looks a bit involved - which is only to be expected,
> really. I haven't taken it in fully yet though - will aim to do so at a
> later date. Might try porting to Parallel Extensions...
>

Frameworks like Fork/Join can only go so far in easing tasks like matrix
multiplication. Given that my Java code won't be making best use of
SSE, I was quite pleased with the performance of the result.

Mark Thornton

Barry Kelly

unread,
Jun 3, 2008, 5:27:12 PM6/3/08
to
Razii wrote:

> Is this guy Jon Skeet really this stupid?
>
> Let me know so I can add him to ignore list.

Razii, if you hope to ever have any positive effect in your
communication - e.g. convince anyone that you have a clue - you'll have
to work much harder not to look like a naive 13 year old with
undeveloped social skills.

Plonk etc.

-- Barry

--
http://barrkel.blogspot.com/

Barry Kelly

unread,
Jun 3, 2008, 5:32:41 PM6/3/08
to
Jon Skeet wrote:

> Now using Parallel Extensions, and the lamdba expressions of C# 3:
>
> public override void Generate()
> {
> Parallel.For(0, Height, row =>
> {
> int index = row * Width;
> for (int col = 0; col < Width; col++)
> {
> Data[index++] = ComputeMandelbrotIndex(row, col);
> }
> });
> }
>
> I can't imagine many transformations being simpler than that - and the
> results are great.

How have you found Parallel.For to perform?

I've found that my own naive (i.e. quick hack with little thought)
implementations of Parallel.For that dispatch over a fixed number of
threads (one thread per core) outperform the ones in the TPL by quite a
bit - TPL seemed to have had around ~20% overhead in the particular
microbenchmark I was trying to scale over my quad core.

I must dig it out and analyse the discrepancy, and blog about it, when I
find time.

Razii

unread,
Jun 3, 2008, 5:46:19 PM6/3/08
to
On Tue, 03 Jun 2008 22:27:12 +0100, Barry Kelly
<barry....@gmail.com> wrote:

>Plonk etc.

You must be a newbie, moron. No one ever plonks me. You can try
though, as you did.

As for Jon Skeet, what kind of reply do you expect? I explained to him
that I don't have Mono, and the benchmarks I am using are at shootout
site (not that I myself am using Mono). The link to the blog I posted
also makes that clear, but yet the guy continued with the same line
that I am comparing Mono. Is he stupid or just doesn't take the time
to read carefully? Take your pick.


Jon Skeet [C# MVP]

unread,
Jun 3, 2008, 5:46:32 PM6/3/08
to
Barry Kelly <barry....@gmail.com> wrote:
> > I can't imagine many transformations being simpler than that - and the
> > results are great.
>
> How have you found Parallel.For to perform?

Very well - see my last few blog posts.

> I've found that my own naive (i.e. quick hack with little thought)
> implementations of Parallel.For that dispatch over a fixed number of
> threads (one thread per core) outperform the ones in the TPL by quite a
> bit - TPL seemed to have had around ~20% overhead in the particular
> microbenchmark I was trying to scale over my quad core.
>
> I must dig it out and analyse the discrepancy, and blog about it, when I
> find time.

I wouldn't be surprised if some individual test cases did badly in
microbenchmarks. Without even looking at your code, my guess is that
you may find your code doesn't work as well when there's other load on
the processor (e.g. effectively taking up one core with a different
process) whereas work stealing should (I believe) cope with that
reasonably well.

But as I point out on the most recent blog entry, accurately
benchmarking this kind of thing and being able to interpret the results
sensibly is basically beyond my current understanding (and time to
spend on it).

Razii

unread,
Jun 3, 2008, 6:15:09 PM6/3/08
to
On Tue, 3 Jun 2008 21:04:47 +0100, Jon Skeet [C# MVP]
<sk...@pobox.com> wrote:

>Has it occurred to you that that may have an effect on the results?

No, it won't have a major effect.

Razii

unread,
Jun 3, 2008, 6:59:42 PM6/3/08
to
On Tue, 03 Jun 2008 15:14:52 -0500, Razii <klg...@mail.com> wrote:

>The answer I get from C# is: 0.858272132476373
>
>with Java I get 0.8582727931702359
>
>Checking the answer with maple, I get
>0.8582727931702358355238863908484066466002034
>
>How are you going to fix C# in this case?
>
>Does it bother you that C# gave wrong answer?

Console.WriteLine(Math.Sin (1e7));

0.420547793190771 (C# with .NET)
0.42054779319078249129850658974095 (right answer)

Console.WriteLine(Math.Sin (1e10));

-0.48750602507627 (C# with .NET)
-0.48750602508751069152779429434811 (right answer)

Console.WriteLine(Math.Sin (1e15));

0.858272132476373 (C# with .NET)
0.8582727931702358355238863908484 (right answer)

So C# doesn't get 15-17 digit accuracy of double for sin and cos.
That's the only reason it's faster in partialsums benchmark. By using
FastMath class, the times for both is about same on my computer.

On the other hand, you still need to 'optimize' 4 benchmarks where
.NET is much slower: binarytrees, sum-col, recursive, revcomp.


Barry Kelly

unread,
Jun 3, 2008, 8:29:26 PM6/3/08
to
Jon Skeet wrote:

> Barry Kelly <barry....@gmail.com> wrote:
> > > I can't imagine many transformations being simpler than that - and the
> > > results are great.
> >
> > How have you found Parallel.For to perform?
>
> Very well - see my last few blog posts.

I've been reading - I'm more wondering "compared to what", of course :)

> I wouldn't be surprised if some individual test cases did badly in
> microbenchmarks. Without even looking at your code, my guess is that
> you may find your code doesn't work as well when there's other load on
> the processor (e.g. effectively taking up one core with a different
> process) whereas work stealing should (I believe) cope with that
> reasonably well.

My code does naive work-stealing too, though through a centralized
queue, rather than having a per-thread queue that bored threads wander
away from - that would avoid some cache sloshing etc.

> But as I point out on the most recent blog entry, accurately
> benchmarking this kind of thing and being able to interpret the results
> sensibly is basically beyond my current understanding (and time to
> spend on it).

For the interest of others, I have appended my code to this post. Here
are some of the results I got:

(Running .NET 3.5 on XP SP3 on Q6600 @ 2.8GHz w/ 3.25GB RAM)

- When all cores are fairly idle (no significant activity):

---8<---
Switching Overhead (Naive) : 0.078 (fastest 0.078)
Switching Overhead (TPL) : 0.045 (fastest 0.042)
Lots of small tasks (Naive) : 0.682 (fastest 0.679)
Lots of small tasks (TPL) : 0.081 (fastest 0.081)
Few big tasks (Naive) : 0.967 (fastest 0.966)
Few big tasks (TPL) : 1.259 (fastest 1.253)
--->8---

This shows that TPL's switching overhead is definitely lower than my
"dumbass pool" when the body of the loop is just sleeping, and is far,
far more efficient for lots and lots of small tasks, it seems less
efficient - a lot less efficient - for bigger tasks.

I can see obvious ways to improve my Naive parallel-for

- per-thread queues with bored threads going work stealing
- take advantage of ParallelFor's knowledge of total job size by adding
lots of work to per-thread queues in one go, rather than bottlenecking
on synchronization overhead for each single iteration

Of course, the whole point of my naive algorithm is that it is naive, so
I didn't do any of this. It's a 30-minute hack. I just find it's
interesting that TPL can, on occasion, take close to 30% more time than
a straightforward naive implementation, depending on work chunk size.

Running 1 instance of super-pi to eat up 1 core, leaving only 3 cores
free:

---8<---
Switching Overhead (Naive) : 0.078 (fastest 0.078)
Switching Overhead (TPL) : 0.045 (fastest 0.041)
Lots of small tasks (Naive) : 0.734 (fastest 0.708)
Lots of small tasks (TPL) : 0.087 (fastest 0.086)
Few big tasks (Naive) : 0.983 (fastest 0.976)
Few big tasks (TPL) : 1.317 (fastest 1.294)
--->8---

Running 2 instances of super-pi, leaving 2 cores free:

---8<---
Switching Overhead (Naive) : 0.078 (fastest 0.078)
Switching Overhead (TPL) : 0.045 (fastest 0.042)
Lots of small tasks (Naive) : 1.063 (fastest 1.021)
Lots of small tasks (TPL) : 0.094 (fastest 0.094)
Few big tasks (Naive) : 1.158 (fastest 1.147)
Few big tasks (TPL) : 1.376 (fastest 1.368)
--->8---

Here, TPL seems to be catching up with my Naive implementation the more
is loaded on other cores, but it's still a fair ways off.


Code follows:

using System;
using System.Collections.Generic;
using System.Text;
using System.Threading;
using System.Diagnostics;
using System.Runtime.CompilerServices;

// So dumb it's a stack, not a queue, but we never promised
// anything about order.
class DumbassQueue<T>
{
class Item
{
public Item next;
public T value;
}

// volatile to get most recent head in non-blocking loops.
volatile Item _head;

Semaphore _itemCount;
Semaphore _emptyCount;

public DumbassQueue(int maxCount)
{
_itemCount = new Semaphore(0, maxCount);
_emptyCount = new Semaphore(maxCount, maxCount);
}

public T Dequeue()
{
_itemCount.WaitOne();
try
{
for (;;)
{
Item result = _head;
if (Interlocked.CompareExchange(ref _head,
result.next, result) == result)
return result.value;
}
}
finally
{
_emptyCount.Release();
}
}

public void Enqueue(T item)
{
_emptyCount.WaitOne();
try
{
Item added = new Item();
added.value = item;
for (;;)
{
added.next = _head;
if (Interlocked.CompareExchange(ref _head, added,
added.next) == added.next)
break;
}
}
finally
{
_itemCount.Release();
}
}
}

class DumbassPool
{
Thread[] _threads;
DumbassQueue<Action> _queue;

public DumbassPool()
{
_threads = new Thread[Environment.ProcessorCount];
_queue = new DumbassQueue<Action>(
Environment.ProcessorCount * 2);
Thread.MemoryBarrier();
for (int i = 0; i < _threads.Length; ++i)
{
_threads[i] = new Thread(ThreadMain);
_threads[i].IsBackground = true;
_threads[i].Start();
}
}

void ThreadMain()
{
for (;;)
_queue.Dequeue()();
}

public void ParallelFor(int start, int limit, Action<int> body)
{
using (DoneWaiter waiter = new DoneWaiter(limit - start))
{
for (int i = start; i < limit; ++i)
{
int index = i;
_queue.Enqueue(delegate
{
body(index);
waiter.Signal();
});
}
}
}

class DoneWaiter : IDisposable
{
int _taskCount;
ManualResetEvent _done;

public DoneWaiter(int taskCount)
{
_taskCount = taskCount;
_done = new ManualResetEvent(false);
}

public void Dispose()
{
_done.WaitOne();
_done.Close();
}

public void Signal()
{
if (Interlocked.Decrement(ref _taskCount) == 0)
_done.Set();
}
}
}

class Program
{
[MethodImpl(MethodImplOptions.NoInlining)]
static void Use(int x)
{
if (x == -42)
Console.WriteLine("Got the magic number");
}

static void Time(string title, Action proc)
{
double totalTime = 0;
double fastest = double.MaxValue;
int iterCount = 3;

// warmup
proc();

for (int i = 0; i < iterCount; ++i)
{
Stopwatch w = Stopwatch.StartNew();
proc();
double time = w.ElapsedTicks / (double) Stopwatch.Frequency;
totalTime += time;
if (time < fastest)
fastest = time;
}

Console.WriteLine("{0,-30}: {1,6:f3} (fastest {2,6:f3})",
title, totalTime / iterCount, fastest);
}

static void Main(string[] args)
{
DumbassPool pool = new DumbassPool();

Time("Switching Overhead (Naive)", delegate
{
pool.ParallelFor(0, 20, i =>
{
Thread.Sleep(10);
});
});

Time("Switching Overhead (TPL)", delegate
{
Parallel.For(0, 20, i =>
{
Thread.Sleep(10);
});
});

Time("Lots of small tasks (Naive)", delegate
{
pool.ParallelFor(0, 100000, i =>
{
for (int j = 0; j < 1000; ++j)
Use(j);
});
});

Time("Lots of small tasks (TPL)", delegate
{
Parallel.For(0, 100000, i =>
{
for (int j = 0; j < 1000; ++j)
Use(j);
});
});

Time("Few big tasks (Naive)", delegate
{
pool.ParallelFor(0, 10, i =>
{
for (int j = 0; j < 100000000; ++j)
Use(j);
});
});

Time("Few big tasks (TPL)", delegate
{
Parallel.For(0, 10, i =>
{
for (int j = 0; j < 100000000; ++j)
Use(j);
});
});

Lew

unread,
Jun 3, 2008, 9:52:02 PM6/3/08
to
Jon Skeet [C# MVP] wrote:
> One problem with creating anything like Parallel Extensions (which I
> assume is what you meant - AFAIK the TPL is just part of Parallel
> Extensions, although I could be misunderstanding you completely) for
> Java is that it doesn't (currently) have closures other than anonymous
> inner classes which are ugly as hell.

I personally like anonymous (and named) inner classes in lieu of closures, myself.

I take more the attitude of the maintenance programmer than the initial
developer. The very verbosity of inner classes help me understand what's
going on.

> The state of closures in Java 7 is currently up in the air.

In part because a lot of Java mavens feel similarly, or they subscribe to Josh
Bloch's opinion of, "When in doubt, leave it out", or they have similar
concerns as he about what the closure idiom would do to other long-standing
Java idioms and semantics.

Some of those in favor of closures, by my general impression of other threads
that have discussed this, tend to take as axiomatic their value equations that
make closures seem desirable, and to dismiss the concerns of those opposed to
closures as somehow benighted. The trouble with that is that both sides then
flame each other for failing to understand their values.

I see the value of closures, and while they are marginally more powerful than
inner classes, their cost to the Java zeitgeist could be as high as Mr. Bloch
and others feel, and I also find in practice that they are too terse for the
maintenance programmer, compared to the inner class idiom. YMMV, of course,
since this seems to me a matter of style.

The opponents to closures in Java point out that inner classes get the job
done in practice, so that the need for closures is small, while the complexity
and difficulties of it are large.

It's harder to retrofit such a feature to a language that was deliberately
designed to exclude it than to put it in the beginning into a language that
was designed to use it. It's also unnecessary to add closures to Java.

--
Lew

Lew

unread,
Jun 3, 2008, 9:53:12 PM6/3/08
to
Mark Thornton wrote:
> Given that my Java code won't be making best use of SSE

Are you sure of that?

If true, it's one more area for Hotspot authors to approach.

--
Lew

Jon Skeet [C# MVP]

unread,
Jun 4, 2008, 1:57:40 AM6/4/08
to
On Jun 3, 11:15 pm, Razii <klgf...@mail.com> wrote:
> <sk...@pobox.com> wrote:
> >Has it occurred to you that that may have an effect on the results?
>
> No, it won't have a major effect.

And you know this because...?

Just claiming that it won't have an effect isn't exactly compelling
evidence.

Jon

Jon Skeet [C# MVP]

unread,
Jun 4, 2008, 2:28:10 AM6/4/08
to
On Jun 4, 1:29 am, Barry Kelly <barry.j.ke...@gmail.com> wrote:

<snip>

> Of course, the whole point of my naive algorithm is that it is naive, so
> I didn't do any of this. It's a 30-minute hack. I just find it's
> interesting that TPL can, on occasion, take close to 30% more time than
> a straightforward naive implementation, depending on work chunk size.

Out of interest, which release of Parallel Extensions did you measure
against? Have you installed the CTP from Monday yet? I'd be interested
to hear whether that is better or worse than the December one in your
case.

Also, have you posted on the PFX forum about this? I'm sure the team
would be interested in examining it.

Jon

Jon Harrop

unread,
Jun 4, 2008, 3:11:54 AM6/4/08
to
Jon Skeet [C# MVP] wrote:
> Razii <klg...@mail.com> wrote:
>> >The mandelbrot one. Look back in the thread from where I first talked
>> >about IO, and you'll see it's the last one referenced.
>>
>> Yes, that was fixed by Harpo a while ago.
>
> And yet you seem to be completely ignoring my point: a shootout which
> has had such a trivial but significant flaw for a significant time is
> suspect in its methodology. (Heck, the measurement style is dodgy to
> start with. Including the startup cost in the test is pretty crazy.)

Yes. The problems with Mandelbrot are tiny compared to the problems with
binarytrees though. If you want to benchmark properly then you need to put
a lot more effort into creating the tasks than simply timing a one-liner
you downloaded from the 'net without making any attempt to optimize it.

--
Dr Jon D Harrop, Flying Frog Consultancy
http://www.ffconsultancy.com/products/?u

Jon Skeet [C# MVP]

unread,
Jun 4, 2008, 3:36:15 AM6/4/08
to
On Jun 4, 2:52 am, Lew <con...@lewscanon.com.invalid> wrote:
> > One problem with creating anything like Parallel Extensions (which I
> > assume is what you meant - AFAIK the TPL is just part of Parallel
> > Extensions, although I could be misunderstanding you completely) for
> > Java is that it doesn't (currently) have closures other than anonymous
> > inner classes which are ugly as hell.
>
> I personally like anonymous (and named) inner classes in lieu of closures, myself.
>
> I take more the attitude of the maintenance programmer than the initial
> developer. The very verbosity of inner classes help me understand what's
> going on.

I can't see why - it's just more fluff that gets in the way of the
actual logic.

For instance, I find it much easier to understand this:

people.Where(person => person.Age > 18);

than this:

people.where(new Predicate()
{
public boolean match(Person person)
{
return person.Age > 18;
}
});

It's entirely reasonable to chain together several method calls with
lambda expressions and still end up with readable code. By the time
you've got more than a couple of anonymous inner classes in a single
statement, I think the readability ends up being lost.

> > The state of closures in Java 7 is currently up in the air.
>
> In part because a lot of Java mavens feel similarly, or they subscribe to Josh
> Bloch's opinion of, "When in doubt, leave it out", or they have similar
> concerns as he about what the closure idiom would do to other long-standing
> Java idioms and semantics.

I have concerns over some of the proposals under consideration. I
personally don't like the idea of closures being able to return beyond
their own call, if you see what I mean - I think of closures as a way
of expressing logic for "normal" methods more easily, rather than a
way of introducing whole new control structures.

None of the proposals currently on the table is entirely satisfactory
to me, and I'd like to see some more with possibly more limited scope
but simpler syntax. However, I *do* find closures incredibly useful as
a tool to have available, and I really don't like all the extra cruft
that anonymous inner classes requires.

> Some of those in favor of closures, by my general impression of other threads
> that have discussed this, tend to take as axiomatic their value equations that
> make closures seem desirable, and to dismiss the concerns of those opposed to
> closures as somehow benighted. The trouble with that is that both sides then
> flame each other for failing to understand their values.

Not having been in the discussion, I can certainly see how that would
happen. There's truth in Blub's Paradox, but there's also certainly
value in keeping things simple. It's highly unlikely that everyone is
going to end up happy, I'd say.

> I see the value of closures, and while they are marginally more powerful than
> inner classes, their cost to the Java zeitgeist could be as high as Mr. Bloch
> and others feel, and I also find in practice that they are too terse for the
> maintenance programmer, compared to the inner class idiom. YMMV, of course,
> since this seems to me a matter of style.

It's absolutely a matter of style - but I think it opens up a style
which ends up simplifying code significantly precisely *because* it's
terse.

It certainly takes a little bit of extra up-front learning for
developers (whether maintenance or not) but I personally think the pay-
back is massive.

> The opponents to closures in Java point out that inner classes get the job
> done in practice, so that the need for closures is small, while the complexity
> and difficulties of it are large.

Anonymous inner classes get the job done in a way which discourages a
lot of the more effective ways of using closures, IMO. Yes, they
achieve largely the same goals (I need to consider the value of
capturing variables vs values, compared with the cost in complexity -
it *does* trip people up in C#) but it's the extra cruft that I have a
problem with.

> It's harder to retrofit such a feature to a language that was deliberately
> designed to exclude it than to put it in the beginning into a language that
> was designed to use it.

Certainly. But taking C# as an example, closures weren't available *at
all* in C# 1 (not even in an anonymous inner class manner) - and yet
have been introduced in a very elegant manner.

> It's also unnecessary to add closures to Java.

It was also "unnecessary" to add generics, or the enhanced for loop,
or enums though. "Unnecessary" != "not useful".

This is now wildly off the topic of Razii's benchmarks, but I'd be
very interested in further discussions, particularly across both C#
and Java. It's been a while since I've been a regular poster on
comp.lang.java.programmer, but do you think a new thread discussing
closures and cross-posted between
microsoft.public.dotnet.languages.csharp and comp.lang.java.programmer
would be useful? If the topic's been done to death on the Java groups
already, that's fine - I just wonder whether the C# crowd might have
some interesting perspectives to bring in. There's always the *risk*
that it would turn into yet another language oneupmanship contest of
course, but we could try to make it clear that that's not the point of
the thread...

Alternatively, if you'd be happy to take this up on email instead,
please drop me a mail (sk...@pobox.com)

If you're sick of the whole debate, I quite understand - and thanks
for the previous post.

Jon

Jon Harrop

unread,
Jun 4, 2008, 4:04:01 AM6/4/08
to
Razii wrote:
> Console.WriteLine(Math.Sin (1e15));
>
> 0.858272132476373 (C# with .NET)
> 0.8582727931702358355238863908484 (right answer)
>
> So C# doesn't get 15-17 digit accuracy of double for sin and cos.

For double precision floating point your argument of 1e15 to sin has already
lost all precision because it is fourteen orders of magnitude outside the
primary domain of this trig function.

> That's the only reason it's faster in partialsums benchmark.

That is pure speculation.

> By using FastMath class, the times for both is about same on my computer.

Java is still 2x slower here on this and other benchmarks.

Jon Harrop

unread,
Jun 4, 2008, 4:12:10 AM6/4/08
to
Barry Kelly wrote:
> Jon Skeet wrote:
>> Now using Parallel Extensions, and the lamdba expressions of C# 3:
>>
>> public override void Generate()
>> {
>> Parallel.For(0, Height, row =>
>> {
>> int index = row * Width;
>> for (int col = 0; col < Width; col++)
>> {
>> Data[index++] = ComputeMandelbrotIndex(row, col);
>> }
>> });
>> }
>>
>> I can't imagine many transformations being simpler than that - and the
>> results are great.
>
> How have you found Parallel.For to perform?

Yes, but you have to write your code properly and all of the freely
available examples I have seen (like Jon's above) are poorly written.
Indeed, even the examples that come from the TPL documentation share the
same problems.

The F#.NET Journal article I cited describes how you can parallelize
routines adaptively in order to amortize the cost of queuing and achieve
near-optimal performance across a much wider variety of tasks. We have
started to convert our products (e.g. F# for Numerics and F# for
Visualization) to use the TPL and the results are already extremely
compelling.

Jon Skeet [C# MVP]

unread,
Jun 4, 2008, 4:39:30 AM6/4/08
to
On Jun 4, 9:12 am, Jon Harrop <j...@ffconsultancy.com> wrote:

<snip>

> >> I can't imagine many transformations being simpler than that - and the
> >> results are great.
>
> > How have you found Parallel.For to perform?
>
> Yes, but you have to write your code properly and all of the freely
> available examples I have seen (like Jon's above) are poorly written.

Can you elaborate on that? I'm not going to try to claim the example I
gave is going to be optimal. However, I find it important that it's a
*very* simple transformation. I suspect that most of the more optimal
transformations are also more complicated - and thus likely to be
beyond a large segment of the developer population. (I'm not trying to
be patronising - it seems to me to be pretty obvious that the general
level of understanding of threading complexities is very low. I'd
consider myself to know "a bit more than most" but I still get very
concerned when things get complicated.) I like the idea of "here's a
really simple transformation which can give you pretty good
parallelisation - more powerful and complex building blocks are also
available if you need to squeeze every bit of performance out".

Of course, if you've got examples of transformations which are about
as simple, but give even better results, that would be fabulous.

Jon

Jon Harrop

unread,
Jun 4, 2008, 5:48:19 AM6/4/08
to
Jon Skeet [C# MVP] wrote:
> On Jun 4, 9:12 am, Jon Harrop <j...@ffconsultancy.com> wrote:
>> >> I can't imagine many transformations being simpler than that - and the
>> >> results are great.
>>
>> > How have you found Parallel.For to perform?
>>
>> Yes, but you have to write your code properly and all of the freely
>> available examples I have seen (like Jon's above) are poorly written.
>
> Can you elaborate on that? I'm not going to try to claim the example I
> gave is going to be optimal. However, I find it important that it's a
> *very* simple transformation.

The problem is that it is a simple transformation that introduces massive
performance degradation in a variety of cases that are likely to be of
practical importance.

> I suspect that most of the more optimal
> transformations are also more complicated - and thus likely to be
> beyond a large segment of the developer population. (I'm not trying to
> be patronising - it seems to me to be pretty obvious that the general
> level of understanding of threading complexities is very low. I'd
> consider myself to know "a bit more than most" but I still get very
> concerned when things get complicated.)

The problem is not related to threading, it is related to the fact that the
complexity of your algorithm now has more variables because it is a
function of how you dice the computation between parallel computation units
and the costs of spawning parallelism and the characteristics of the
communication fabric.

You must take that into account when slicing up the computations or you will
end up spawning parallelism for trivial computations which is often vastly
slower than doing them in serial. Blindly replacing "for"
with "Parallel.For" is a bad idea, particularly in library code.

> I like the idea of "here's a
> really simple transformation which can give you pretty good
> parallelisation - more powerful and complex building blocks are also
> available if you need to squeeze every bit of performance out".

This is more about avoiding performance pitfalls rather than squeezing
performance out. If you follow the style you promoted then you can easily
make a wide variety of functions 100x slower.

> Of course, if you've got examples of transformations which are about
> as simple, but give even better results, that would be fabulous.

This was all described in detail with measurements in the article I cited
but, as you say, the information is not free. I'll propose to MS that they
get us to write a whitepaper on the subject for them so that you can read
it for free. Good idea. :-)

In the most advanced cases that I have studied you even augment data
structures with information that conveys how computations should be divided
across the data structure. We have started introducing such augmentations
into the data structures provided by our libraries because we believe it is
essential to fully exploit multicores in order to survive commercially.

Roedy Green

unread,
Jun 4, 2008, 6:34:49 AM6/4/08
to
On Fri, 30 May 2008 22:28:07 -0500, King Raz <klgfg...@mail.com>
wrote, quoted or indirectly quoted someone who said :

>Just to keep the post on topic for my friends at comp.lang.c++, how do
>I play default windows sounds with C++?

see http://mindprod.com/products.html#HONK

C++ source code included.
--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Jon Skeet [C# MVP]

unread,
Jun 4, 2008, 7:26:59 AM6/4/08
to
On Jun 4, 10:48 am, Jon Harrop <j...@ffconsultancy.com> wrote:
> > Can you elaborate on that? I'm not going to try to claim the example I
> > gave is going to be optimal. However, I find it important that it's a
> > *very* simple transformation.
>
> The problem is that it is a simple transformation that introduces massive
> performance degradation in a variety of cases that are likely to be of
> practical importance.

Doesn't that just mean you have to apply it selectively?

<snip>

> You must take that into account when slicing up the computations or you will
> end up spawning parallelism for trivial computations which is often vastly
> slower than doing them in serial. Blindly replacing "for"
> with "Parallel.For" is a bad idea, particularly in library code.

So don't do it blindly :)

I wouldn't start replacing every for loop with Parallel.For, or every
foreach loop with Parallel.ForEach - I'd do it based on evidence and
thought.

> > I like the idea of "here's a
> > really simple transformation which can give you pretty good
> > parallelisation - more powerful and complex building blocks are also
> > available if you need to squeeze every bit of performance out".
>
> This is more about avoiding performance pitfalls rather than squeezing
> performance out. If you follow the style you promoted then you can easily
> make a wide variety of functions 100x slower.

Again, that's just a case of not applying it blindly, isn't it?

> > Of course, if you've got examples of transformations which are about
> > as simple, but give even better results, that would be fabulous.
>
> This was all described in detail with measurements in the article I cited
> but, as you say, the information is not free. I'll propose to MS that they
> get us to write a whitepaper on the subject for them so that you can read
> it for free. Good idea. :-)

I suspect you'd get a lot more subscriptions if you'd give away a few
good sample articles - as well as that being a Good Thing in general.

> In the most advanced cases that I have studied you even augment data
> structures with information that conveys how computations should be divided
> across the data structure. We have started introducing such augmentations
> into the data structures provided by our libraries because we believe it is
> essential to fully exploit multicores in order to survive commercially.

It really depends on what you're doing. As ever, one size doesn't fit
all in software engineering.

Jon

Jon Harrop

unread,
Jun 4, 2008, 8:38:52 AM6/4/08
to
Jon Skeet [C# MVP] wrote:
> On Jun 4, 10:48 am, Jon Harrop <j...@ffconsultancy.com> wrote:
>> > Can you elaborate on that? I'm not going to try to claim the example I
>> > gave is going to be optimal. However, I find it important that it's a
>> > *very* simple transformation.
>>
>> The problem is that it is a simple transformation that introduces massive
>> performance degradation in a variety of cases that are likely to be of
>> practical importance.
>
> Doesn't that just mean you have to apply it selectively?

Not quite, no. It means you must automate the selective application so that
it can be done at run time, i.e. make it adaptive.

>> You must take that into account when slicing up the computations or you
>> will end up spawning parallelism for trivial computations which is often
>> vastly slower than doing them in serial. Blindly replacing "for"
>> with "Parallel.For" is a bad idea, particularly in library code.
>
> So don't do it blindly :)
>
> I wouldn't start replacing every for loop with Parallel.For, or every
> foreach loop with Parallel.ForEach - I'd do it based on evidence and
> thought.

The problem is that part of the "evidence" is a function of the problem
being solved and, therefore, is only available at run time. So you cannot
do it statically (with changes to the source code) to good effect.

>> > I like the idea of "here's a
>> > really simple transformation which can give you pretty good
>> > parallelisation - more powerful and complex building blocks are also
>> > available if you need to squeeze every bit of performance out".
>>
>> This is more about avoiding performance pitfalls rather than squeezing
>> performance out. If you follow the style you promoted then you can easily
>> make a wide variety of functions 100x slower.
>
> Again, that's just a case of not applying it blindly, isn't it?

There is a substantial gap between "not blindly" and exploiting knowledge of
modern multicore machines in the context of the performance characteristics
of the Task Parallel Library.

>> > Of course, if you've got examples of transformations which are about
>> > as simple, but give even better results, that would be fabulous.
>>
>> This was all described in detail with measurements in the article I cited
>> but, as you say, the information is not free. I'll propose to MS that
>> they get us to write a whitepaper on the subject for them so that you can
>> read it for free. Good idea. :-)
>
> I suspect you'd get a lot more subscriptions if you'd give away a few
> good sample articles - as well as that being a Good Thing in general.

I suspect we would get fewer subscriptions if we made even more content
freely available.

>> In the most advanced cases that I have studied you even augment data
>> structures with information that conveys how computations should be
>> divided across the data structure. We have started introducing such
>> augmentations into the data structures provided by our libraries because
>> we believe it is essential to fully exploit multicores in order to
>> survive commercially.
>
> It really depends on what you're doing. As ever, one size doesn't fit
> all in software engineering.

There is still a lot of room for improvement though.

Jon Skeet [C# MVP]

unread,
Jun 4, 2008, 9:09:54 AM6/4/08
to
On Jun 4, 1:38 pm, Jon Harrop <j...@ffconsultancy.com> wrote:
> > Doesn't that just mean you have to apply it selectively?
>
> Not quite, no. It means you must automate the selective application so that
> it can be done at run time, i.e. make it adaptive.

Sometimes that's the case - but in many business cases, I believe it
can be statically decided, making development simpler at the risk that
in some cases you may not take the optimal course.

> > I wouldn't start replacing every for loop with Parallel.For, or every
> > foreach loop with Parallel.ForEach - I'd do it based on evidence and
> > thought.
>
> The problem is that part of the "evidence" is a function of the problem
> being solved and, therefore, is only available at run time. So you cannot
> do it statically (with changes to the source code) to good effect.

In some cases. In others it's pretty obvious, IMO. Take the Mandelbrot
case: on a single core, I don't believe Parallel.For will hurt much (a
case I should test some time...) but obviously it won't help. On
multiple cores, it will improve things significantly vs a non-parallel
implementation. There may be other solutions which will improve the
performance more, but only by introducing more complexity. The balance
between complexity and improvement depends on many things, of course -
including the comfort zone of the developers involved.

> > I suspect you'd get a lot more subscriptions if you'd give away a few
> > good sample articles - as well as that being a Good Thing in general.
>
> I suspect we would get fewer subscriptions if we made even more content
> freely available.

When you say "even more" content, it's not like the web site is
currently bursting at the seams with free samples. With so much free
content (much of it pretty good) available online these days, I'd want
more evidence than a single F# tutorial before spending £39 on a
subscription. Obviously it's your call though.

> > It really depends on what you're doing. As ever, one size doesn't fit
> > all in software engineering.
>
> There is still a lot of room for improvement though.

Sure - but I believe that *just* Parallel.For and Parallel.ForEach is
a quite staggering improvement for the relatively parallelism-scared
developer who wants to get more bang-for-the-buck with the minimum of
effort.

Jon

Barry Kelly

unread,
Jun 4, 2008, 9:24:00 AM6/4/08
to
Jon Skeet [C# MVP] wrote:

> On Jun 4, 1:29 am, Barry Kelly <barry.j.ke...@gmail.com> wrote:
>
> <snip>
>
> > Of course, the whole point of my naive algorithm is that it is naive, so
> > I didn't do any of this. It's a 30-minute hack. I just find it's
> > interesting that TPL can, on occasion, take close to 30% more time than
> > a straightforward naive implementation, depending on work chunk size.
>
> Out of interest, which release of Parallel Extensions did you measure
> against? Have you installed the CTP from Monday yet?

Yes, that's what I installed last night to test with.

I do note this blog posting:

http://blogs.msdn.com/pfxteam/archive/2008/03/12/8179013.aspx

> I'd be interested
> to hear whether that is better or worse than the December one in your
> case.

The two aren't compatible when installed side by side, unfortunately, so
I'd have to uninstall and reinstall. If I write it up I'll do that.

> Also, have you posted on the PFX forum about this? I'm sure the team
> would be interested in examining it.

I'm doing that now.

Barry Kelly

unread,
Jun 4, 2008, 9:36:28 AM6/4/08
to
(Sorry for dup in folks not following Microsoft NNTP server - it appears
to have been censoring the use of a synonym for "donkey"):

Jon Skeet wrote:

> Barry Kelly <barry....@gmail.com> wrote:
> > > I can't imagine many transformations being simpler than that - and the
> > > results are great.
> >
> > How have you found Parallel.For to perform?
>
> Very well - see my last few blog posts.

I've been reading - I'm more wondering "compared to what", of course :)

> I wouldn't be surprised if some individual test cases did badly in

> microbenchmarks. Without even looking at your code, my guess is that
> you may find your code doesn't work as well when there's other load on
> the processor (e.g. effectively taking up one core with a different
> process) whereas work stealing should (I believe) cope with that
> reasonably well.

My code does naive work-stealing too, though through a centralized


queue, rather than having a per-thread queue that bored threads wander
away from - that would avoid some cache sloshing etc.

> But as I point out on the most recent blog entry, accurately

> benchmarking this kind of thing and being able to interpret the results
> sensibly is basically beyond my current understanding (and time to
> spend on it).

For the interest of others, I have appended my code to this post. Here


are some of the results I got:

(Running .NET 3.5 on XP SP3 on Q6600 @ 2.8GHz w/ 3.25GB RAM)

- When all cores are fairly idle (no significant activity):

---8<---
Switching Overhead (Naive) : 0.078 (fastest 0.078)
Switching Overhead (TPL) : 0.045 (fastest 0.042)
Lots of small tasks (Naive) : 0.682 (fastest 0.679)
Lots of small tasks (TPL) : 0.081 (fastest 0.081)
Few big tasks (Naive) : 0.967 (fastest 0.966)
Few big tasks (TPL) : 1.259 (fastest 1.253)
--->8---

This shows that TPL's switching overhead is definitely lower than my

"naive pool" when the body of the loop is just sleeping, and is far,


far more efficient for lots and lots of small tasks, it seems less
efficient - a lot less efficient - for bigger tasks.

I can see obvious ways to improve my Naive parallel-for

- per-thread queues with bored threads going work stealing
- take advantage of ParallelFor's knowledge of total job size by adding
lots of work to per-thread queues in one go, rather than bottlenecking
on synchronization overhead for each single iteration

Of course, the whole point of my naive algorithm is that it is naive, so


I didn't do any of this. It's a 30-minute hack. I just find it's
interesting that TPL can, on occasion, take close to 30% more time than
a straightforward naive implementation, depending on work chunk size.

Running 1 instance of super-pi to eat up 1 core, leaving only 3 cores
free:


Code follows:

class NaiveQueue<T>


{
class Item
{
public Item next;
public T value;
}

// volatile to get most recent head in non-blocking loops.
volatile Item _head;

Semaphore _itemCount;
Semaphore _emptyCount;

public NaiveQueue(int maxCount)


{
_itemCount = new Semaphore(0, maxCount);
_emptyCount = new Semaphore(maxCount, maxCount);
}

public T Dequeue()
{
_itemCount.WaitOne();
try
{
for (;;)
{
Item result = _head;
if (Interlocked.CompareExchange(ref _head,
result.next, result) == result)
return result.value;
}
}
finally
{
_emptyCount.Release();
}
}

public void Enqueue(T item)
{
_emptyCount.WaitOne();
try
{
Item added = new Item();
added.value = item;
for (;;)
{
added.next = _head;
if (Interlocked.CompareExchange(ref _head, added,
added.next) == added.next)
break;
}
}
finally
{
_itemCount.Release();
}
}
}

class NaivePool
{
Thread[] _threads;
NaiveQueue<Action> _queue;

public NaivePool()


{
_threads = new Thread[Environment.ProcessorCount];

_queue = new NaiveQueue<Action>(

NaivePool pool = new NaivePool();

-- Barry

--
http://barrkel.blogspot.com/

Barry Kelly

unread,
Jun 4, 2008, 9:59:18 AM6/4/08
to
I wrote a reply, but it never showed up here in my newsreader. It's
online, however:

http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/588e9162b5bae717

(Apologies if other folks are seeing my message repeatedly.)

Jon Harrop

unread,
Jun 4, 2008, 10:11:32 AM6/4/08
to
Jon Skeet [C# MVP] wrote:
> On Jun 4, 1:38 pm, Jon Harrop <j...@ffconsultancy.com> wrote:
>> > Doesn't that just mean you have to apply it selectively?
>>
>> Not quite, no. It means you must automate the selective application so
>> that it can be done at run time, i.e. make it adaptive.
>
> Sometimes that's the case - but in many business cases, I believe it
> can be statically decided, making development simpler at the risk that
> in some cases you may not take the optimal course.

Again, the problem isn't that you fail to take the optimal course. The
problem is that you silently introduce massive performance degradations
because you don't understand when Parallel.For is slow.

At the very least you want to document the pathological cases but you can't
even do that without more information.

>> > I wouldn't start replacing every for loop with Parallel.For, or every
>> > foreach loop with Parallel.ForEach - I'd do it based on evidence and
>> > thought.
>>
>> The problem is that part of the "evidence" is a function of the problem
>> being solved and, therefore, is only available at run time. So you cannot
>> do it statically (with changes to the source code) to good effect.
>
> In some cases. In others it's pretty obvious, IMO. Take the Mandelbrot
> case: on a single core, I don't believe Parallel.For will hurt much (a
> case I should test some time...) but obviously it won't help. On
> multiple cores, it will improve things significantly vs a non-parallel
> implementation.

No. That is exactly what I was saying was wrong. Your implementation has
introduced massive performance degradations and you don't even know when
they appear. That is a serious problem, IMHO.

> There may be other solutions which will improve the
> performance more, but only by introducing more complexity. The balance
> between complexity and improvement depends on many things, of course -
> including the comfort zone of the developers involved.

Yes.

>> > I suspect you'd get a lot more subscriptions if you'd give away a few
>> > good sample articles - as well as that being a Good Thing in general.
>>
>> I suspect we would get fewer subscriptions if we made even more content
>> freely available.
>
> When you say "even more" content, it's not like the web site is
> currently bursting at the seams with free samples. With so much free
> content (much of it pretty good) available online these days, I'd want
> more evidence than a single F# tutorial before spending £39 on a
> subscription. Obviously it's your call though.

May I ask which articles you would most like for free?

http://www.ffconsultancy.com/products/fsharp_journal/free/introduction.html
http://www.ffconsultancy.com/dotnet/fsharp/rule30/index.html
http://www.ffconsultancy.com/dotnet/fsharp/teapot/index.html
http://www.ffconsultancy.com/dotnet/fsharp/raytracer/index.html
http://www.ffconsultancy.com/dotnet/fsharp/sudoku/index.html

I'll also upload the code from my forthcoming book F# for Scientists ASAP.

>> > It really depends on what you're doing. As ever, one size doesn't fit
>> > all in software engineering.
>>
>> There is still a lot of room for improvement though.
>
> Sure - but I believe that *just* Parallel.For and Parallel.ForEach is
> a quite staggering improvement for the relatively parallelism-scared
> developer who wants to get more bang-for-the-buck with the minimum of
> effort.

They must be made aware of when Parallel.For kills performance or they will
surely start going in the wrong direction.

Jon Skeet [C# MVP]

unread,
Jun 4, 2008, 10:36:34 AM6/4/08
to
On Jun 4, 3:11 pm, Jon Harrop <j...@ffconsultancy.com> wrote:
> >> Not quite, no. It means you must automate the selective application so
> >> that it can be done at run time, i.e. make it adaptive.
>
> > Sometimes that's the case - but in many business cases, I believe it
> > can be statically decided, making development simpler at the risk that
> > in some cases you may not take the optimal course.
>
> Again, the problem isn't that you fail to take the optimal course. The
> problem is that you silently introduce massive performance degradations
> because you don't understand when Parallel.For is slow.

I know that it will introduce significant degredations when the body
of the loop is very quick, thus dwarfed by the time taken to invoke a
delegate and the time taken to schedule threads etc.

Are there other situations you know about where the performance is
terrible? In many cases (not all by a long way, but many) the
developer will *know* that the body of the loop is a significant
amount of work.

> > In some cases. In others it's pretty obvious, IMO. Take the Mandelbrot
> > case: on a single core, I don't believe Parallel.For will hurt much (a
> > case I should test some time...) but obviously it won't help. On
> > multiple cores, it will improve things significantly vs a non-parallel
> > implementation.
>
> No. That is exactly what I was saying was wrong. Your implementation has
> introduced massive performance degradations and you don't even know when
> they appear. That is a serious problem, IMHO.

So do *you* know when they appear? In the Mandelbrot case, I'd expect
the ParallelFor version to be a lot slower for situations where
computing each row is a trivial amount of work compared to the
scheduling involved - i.e. when the number of columns is tiny. I'm
happy enough to ignore that case - any Mandelbrot image with so few
columns is going to be uninteresting anyway.

> > When you say "even more" content, it's not like the web site is
> > currently bursting at the seams with free samples. With so much free
> > content (much of it pretty good) available online these days, I'd want
> > more evidence than a single F# tutorial before spending £39 on a
> > subscription. Obviously it's your call though.
>
> May I ask which articles you would most like for free?

The ParallelFX and SciMark articles would be nice. Are the paid
articles longer/meatier than the free ones?

> I'll also upload the code from my forthcoming book F# for Scientists ASAP.

Cool.

> > Sure - but I believe that *just* Parallel.For and Parallel.ForEach is
> > a quite staggering improvement for the relatively parallelism-scared
> > developer who wants to get more bang-for-the-buck with the minimum of
> > effort.
>
> They must be made aware of when Parallel.For kills performance or they will
> surely start going in the wrong direction.

Agreed. If that really *is* a case of "don't do it when the body of
the loop is going to execute really quickly anyway" then that's easy
enough to understand. If there are more complicated situations where
using Parallel.For introduces degredations (rather than just not being
optimal) then I'd be really interested to hear about them.

Jon

Razii

unread,
Jun 4, 2008, 12:42:20 PM6/4/08
to
On Wed, 04 Jun 2008 09:04:01 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>> That's the only reason it's faster in partialsums benchmark.
>
>That is pure speculation.

It's not. I posted the evidence. C# gets wrong result.

>> By using FastMath class, the times for both is about same on my computer.
>
>Java is still 2x slower here on this and other benchmarks.

It's not on on my comp when using FastMath class. It's not on shootout
site when they are using FastMath class. You are the only one making
the claim.

http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=all

C++ GNU g++ 4.05 seconds
Java 6 4.89 seconds.

You haven't posted the other "benchmark" yet so there is no way to
confirm your claim. Also, you have yet to optimize 4 benchmarks where

Razii

unread,
Jun 4, 2008, 12:43:50 PM6/4/08
to
On Tue, 3 Jun 2008 22:57:40 -0700 (PDT), "Jon Skeet [C# MVP]"
<sk...@pobox.com> wrote:

>And you know this because...?

The one who asserts must prove. Post proof that any of these
benchmarks give different timing on Cygwin than on Windows consol.
Until you do that, I will assume it has no effect.

Razii

unread,
Jun 4, 2008, 12:46:04 PM6/4/08
to
On Wed, 04 Jun 2008 08:11:54 +0100, Jon Harrop <j...@ffconsultancy.com>
wrote:

>Yes. The problems with Mandelbrot are tiny compared to the problems with


>binarytrees though. If you want to benchmark properly then you need to put
>a lot more effort into creating the tasks than simply timing a one-liner
>you downloaded from the 'net without making any attempt to optimize it.

How about you optimize them? Here are 4 benchmarks where .NET is much

Jon Harrop

unread,
Jun 4, 2008, 12:45:53 PM6/4/08
to

Oh, wait a minute. Your not even using .NET properly? No wonder your results
are all screwed.

.NET is now several times faster than Java on many of these benchmarks (and
I have many others where that is also the case).

Jon Harrop

unread,
Jun 4, 2008, 12:46:18 PM6/4/08
to
Razii wrote:
> On the other hand, you still need to 'optimize' 4 benchmarks where
> .NET is much slower: binarytrees, sum-col, recursive, revcomp.

I have already addressed binarytrees elsewhere. I have now taken a look at
sumcol and recursive too. The recursive benchmark in F# is only 20% slower
than Java, so C# is being inefficient here and that has nothing to do
with .NET itself.

As for sumcol you, again, have a Java program that is doing something
completely different to the C# program. Specifically, the Java is using a
lexer to parse the input whereas the C# is reading lines in as strings and
parsing them, which incurs a huge amount of allocation that is completely
unnecessary.

So I wrote a simple lexer in F# that is 2.5x faster than Java (1s vs 2.5s)
on this benchmark as well. Here is my code:

let rec parse (ch : System.IO.FileStream) total accu =
let c = ch.ReadByte()
if c >= 48 && c <= 57 then
parse ch total (10 * accu + c - 48)
else
if c = 45 then parse_neg ch (total + accu) 0 else
if c >= 0 then parse ch (total + accu) 0 else total
and parse_neg ch total accu =
let c = ch.ReadByte()
if c >= 48 && c <= 57 then
parse_neg ch total (10 * accu + c - 48)
else
if c = 45 then parse_neg ch (total - accu) 0 else
if c >= 0 then parse ch (total - accu) 0 else total

do
use ch = System.IO.File.OpenRead(@"input.txt")
printf "%d\n" (parse ch 0 0)

As an aside, you cannot even write that directly in Java because it still
lacks tail calls.

Jon Harrop

unread,
Jun 4, 2008, 12:52:31 PM6/4/08
to
Razii wrote:
> On Wed, 04 Jun 2008 08:11:54 +0100, Jon Harrop <j...@ffconsultancy.com>
> wrote:
>>Yes. The problems with Mandelbrot are tiny compared to the problems with
>>binarytrees though. If you want to benchmark properly then you need to put
>>a lot more effort into creating the tasks than simply timing a one-liner
>>you downloaded from the 'net without making any attempt to optimize it.
>
> How about you optimize them?

I already did. You are just ignoring my improvements because you don't want
to concede that Java is now much slower on several benchmarks.

> Here are 4 benchmarks where .NET is much
> slower: binarytrees, sum-col, recursive, revcomp.

Java is 2.5x slower on sumcol as well, only 20% slower on recursive,
binarytrees is still ill-defined and I haven't looked at revcomp yet.

Jon Harrop

unread,
Jun 4, 2008, 12:53:19 PM6/4/08
to
Razii wrote:
> On Wed, 04 Jun 2008 09:04:01 +0100, Jon Harrop <j...@ffconsultancy.com>
> wrote:
>>> That's the only reason it's faster in partialsums benchmark.
>>
>>That is pure speculation.
>
> It's not. I posted the evidence. C# gets wrong result.

That says nothing about why it is faster.

>>> By using FastMath class, the times for both is about same on my
>>> computer.
>>
>>Java is still 2x slower here on this and other benchmarks.
>
> It's not on on my comp when using FastMath class. It's not on shootout
> site when they are using FastMath class. You are the only one making
> the claim.
>
>
http://shootout.alioth.debian.org/gp4/benchmark.php?test=partialsums&lang=all
>
> C++ GNU g++ 4.05 seconds
> Java 6 4.89 seconds.
>
> You haven't posted the other "benchmark" yet so there is no way to
> confirm your claim.

Mersenne Twister is freely available. Just time the generation of 10^8
int32s:

C: 0.93s
F#: 1.8s
Java: 3.5s

and note how Java is uniquely slow.

> Also, you have yet to optimize 4 benchmarks where
> .NET is much slower: binarytrees, sum-col, recursive, revcomp.

binarytrees: benchmark is flawed so I can make it as fast as you like.

sumcol: .NET is 2.5x faster than Java.

recursive: .NET is only 20% slower than Java.

revcomp: I have not yet examined. Perhaps this will turn out to be the only
benchmark where Java is significantly faster (but I seriously doubt it).

Jon Skeet [C# MVP]

unread,
Jun 4, 2008, 1:00:57 PM6/4/08
to
On Jun 4, 5:43 pm, Razii <nikjhlgf...@mail.com> wrote:
> <sk...@pobox.com> wrote:
> >And you know this because...?
>
> The one who asserts must prove.

I absolutely agree. Here's what I wrote:

<quote>Has it occurred to you that that may have an effect on the
results?</quote>

I don't see an assertion there - merely speculation that it *may* have
an effect.

Now here's what you wrote:

<quote>No, it won't have a major effect.</quote>

*That's* an assertion. Therefore by your own instructions, it's up to
you to prove your assertion.

If you're running a program in a non-standard way (and running .NET
under Cygwin certainly isn't the usual environment) then it's up to
you to investigate whether or not that has any effect.

Jon

Jon Harrop

unread,
Jun 4, 2008, 1:03:54 PM6/4/08
to
Jon Harrop wrote:
> revcomp: I have not yet examined. Perhaps this will turn out to be the
> only benchmark where Java is significantly faster (but I seriously doubt
> it).

I just had another look at revcomp and it has exactly the same problems as
mandelbrot and sumcol. Specifically, it is reading inefficiently from stdin
when idiomatic .NET reads from a file (which is 6x faster) and then it
outputs to an unbuffered stream (which is 2x slower).

No doubt if you make these trivial optimizations again the .NET will kick
the crap out of Java on this benchmark as well...

It is loading more messages.
0 new messages