Symbolic benchmark

Jon Harrop

unread,

Dec 10, 2007, 4:38:35 AM12/10/07

to

Following recent discussions here about benchmarking and when Java's
performance is worst, I proposed the idea of symbolic computation because
this entails very high allocation rates for short-lived objects when
written in an idiomatic functional style and Java should be uniquely bad at
this. While you might be able to work around the functional style using
imperative constructs, it will be extremely tedious and error-prone for
such applications.

So, perhaps someone would like to translate this simple symbolic simplifier
into Java:

http://www.lambdassociates.org/studies/study10.htm

Here is the OCaml implementation:

let rec simplify = function
| `Int _ | `Var _ as f -> f
| `Add (f, g) -> simplify f +: simplify g
| `Mul (f, g) -> simplify f *: simplify g

OCaml is not terribly efficient at this benchmark but I believe it will be
many times faster than any Java solution.

The program simply traverses a symbolic expression, reconstructing it whilst
applying some simple rewrite rules to simplify it. For example, "(a+0)*1*b"
becomes "a*b". The applications are obvious: parsers, compilers,
interpreters, scientific computing and computer algebra.

Note that the symbolic expression is immutable: you must not destroy your
input! Also, numbers are all arbitrary precision integers.

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?u

Lew

unread,

Dec 10, 2007, 10:27:38 AM12/10/07

to

Jon Harrop wrote:
> Following recent discussions here about benchmarking and when Java's
> performance is worst, I proposed the idea of symbolic computation because
> this entails very high allocation rates for short-lived objects when
> written in an idiomatic functional style and Java should be uniquely bad at
> this.

Java is very good at high allocation rates for short-lived objects; it's
optimized for it. Don't believe Jon's rants, people.

--
Lew

Jon Harrop

unread,

Dec 10, 2007, 5:32:21 PM12/10/07

to

Prove it by implementing that benchmark in Java.

Daniel Pitts

unread,

Dec 10, 2007, 6:48:20 PM12/10/07

to

Jon Harrop wrote:
> Lew wrote:
>> Jon Harrop wrote:
>>> Following recent discussions here about benchmarking and when Java's
>>> performance is worst, I proposed the idea of symbolic computation because
>>> this entails very high allocation rates for short-lived objects when
>>> written in an idiomatic functional style and Java should be uniquely bad
>>> at this.
>> Java is very good at high allocation rates for short-lived objects; it's
>> optimized for it. Don't believe Jon's rants, people.
>
> Prove it by implementing that benchmark in Java.
>

Actually, why don't you prove it (or disprove it). I'm sure most of the
Java advocates here have better things to do than chase your wild goose.

--
Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>

Steve Wampler

unread,

Dec 10, 2007, 7:18:41 PM12/10/07

to

Jon Harrop wrote:
> Lew wrote:
>> Jon Harrop wrote:
>>> Following recent discussions here about benchmarking and when Java's
>>> performance is worst, I proposed the idea of symbolic computation because
>>> this entails very high allocation rates for short-lived objects when
>>> written in an idiomatic functional style and Java should be uniquely bad
>>> at this.
>> Java is very good at high allocation rates for short-lived objects; it's
>> optimized for it. Don't believe Jon's rants, people.
>
> Prove it by implementing that benchmark in Java.

Since we've been talking about C/C++ and Java, got a version in C or C++? :)

--
Steve Wampler -- swam...@noao.edu
The gods that smiled on your birth are now laughing out loud.

Jon Harrop

unread,

Dec 10, 2007, 7:43:10 PM12/10/07

to

Because you would only blame Java's poor performance on me rather than Java.

Jon Harrop

unread,

Dec 10, 2007, 7:46:08 PM12/10/07

to

Steve Wampler wrote:

> Jon Harrop wrote:
>> Prove it by implementing that benchmark in Java.
>
> Since we've been talking about C/C++ and Java, got a version in C or C++?
> :)

I posted something similar here (for derivatives rather than
simplification):

http://www.codecodex.com/wiki/index.php?title=Derivative

but this problem really requires a GC to do it properly because you must be
able to handle shared subexpressions.

#include <iostream>

using namespace std;

class Var;

class Base {
public:
virtual ~Base() {};
virtual const Base *clone() = 0;
virtual const Base *d(const string &v) const = 0;
virtual ostream &print(ostream &o) const = 0;
};

ostream &operator<<(ostream &o, const Base &e) { e.print(o); return o; }

class Int : public Base {
const int n;
public:
Int(int m) : n(m) {}
~Int() {}
const Base *clone() { return new Int(n); }
const Base *d(const string &v) const { return new Int(0); }
ostream &print(ostream &o) const { return o << n; }
};

class Var : public Base {
const string var;
public:
Var(string v) : var(v) {}
~Var() {}
const Base *clone() { return new Var(var); }
const Base *d(const string &v) const { return new Int(var == v ? 1 : 0); }
ostream &print(ostream &o) const { return o << var; }
};

class Plus : public Base {
const Base *e1, *e2;
public:
Plus(const Base *a, const Base *b) : e1(a), e2(b) {}
~Plus() { delete e1; delete e2; }
const Base *clone() { return new Plus(e1, e2); }
const Base *d(const string &v) const { return new Plus(e1->d(v),
e2->d(v)); }
ostream &print(ostream &o) const
{ return o << "(" << *e1 << " + " << *e2 << ")"; }
};

class Times : public Base {
const Base *e1, *e2;
public:
Times(const Base *a, const Base *b) : e1(a), e2(b) {}
~Times() { delete e1; delete e2; }
const Base *clone() { return new Times(e1, e2); }
const Base *d(const string &v) const
{ return new Plus(new Times(e1, e2->d(v)), new Times(e1->d(v), e2)); }
ostream &print(ostream &o) const { return o << "(" << *e1 << " * " << *e2
<< ")"; }
};

class Expr {
public:
Base *e;
Expr(Base *a) : e(a) {}
};

const Expr operator+(const Expr e1, const Expr e2)
{ return Expr(new Plus(e1.e->clone(), e2.e->clone())); }
const Expr operator*(const Expr e1, const Expr e2)
{ return Expr(new Times(e1.e->clone(), e2.e->clone())); }

ostream &operator<<(ostream &o, const Expr e) { return o << e.e; }

int main() {
Var vx("x"), va("a"), vb("b"), vc("c");
Expr x(&vx), a(&va), b(&vb), c(&vc);
Expr e = a*x*x + b*x + c;
cout << "d(" << *(e.e) << ", x) = " << *(e.e->d("x")) << endl;
return 0;

Lew

unread,

Dec 10, 2007, 9:28:07 PM12/10/07

to

Jon Harrop wrote:
> Daniel Pitts wrote:
>> Jon Harrop wrote:
>>> Lew wrote:
>>>> Jon Harrop wrote:
>>>>> Following recent discussions here about benchmarking and when Java's
>>>>> performance is worst, I proposed the idea of symbolic computation
>>>>> because this entails very high allocation rates for short-lived objects
>>>>> when written in an idiomatic functional style and Java should be
>>>>> uniquely bad at this.
>>>> Java is very good at high allocation rates for short-lived objects; it's
>>>> optimized for it. Don't believe Jon's rants, people.
>>> Prove it by implementing that benchmark in Java.
>> Actually, why don't you prove it (or disprove it). I'm sure most of the
>> Java advocates here have better things to do than chase your wild goose.
>
> Because you would only blame Java's poor performance on me rather than Java.

Oh, so when it's time to shit or get off the pot, you just attack the person
and (likely inaccurately) predict bad behavior on their part to excuse your
lack of intellectual honesty, Mr. Cambridge Don.

--
Lew

Jon Harrop

unread,

Dec 11, 2007, 5:03:29 AM12/11/07

to

Lew wrote:
> Oh, so when it's time to shit or get off the pot, you just attack the
> person and (likely inaccurately) predict bad behavior on their part to
> excuse your lack of intellectual honesty, Mr. Cambridge Don.

Sounds like you can't write an efficient Java implementation of this
benchmark either.

Roger Lindsjö

unread,

Dec 11, 2007, 7:36:59 AM12/11/07

to

Jon Harrop wrote:
> Following recent discussions here about benchmarking and when Java's
> performance is worst, I proposed the idea of symbolic computation because
> this entails very high allocation rates for short-lived objects when
> written in an idiomatic functional style and Java should be uniquely bad at
> this. While you might be able to work around the functional style using
> imperative constructs, it will be extremely tedious and error-prone for
> such applications.
>
> So, perhaps someone would like to translate this simple symbolic simplifier
> into Java:

I'll post the code tonight. I have not tried to reduce the line count.

My program first prints the expression, then run 10 loops with 10000000
iterations and prints the duration for each loop. At the end it prints
the final reduction. The machine is a 2.4GHz Core 2 Quad.

[* x [+ [+ [* 12 0] [+ 23 8]] y]]
1355
1281
1275
1273
1277
1284
1279
1279
1272
1274
[* x [+ 31 y]]

I'm used to Ocaml, but it looked like Haskell, so I think my simplify is
correct.

> The program simply traverses a symbolic expression, reconstructing it whilst
> applying some simple rewrite rules to simplify it. For example, "(a+0)*1*b"
> becomes "a*b". The applications are obvious: parsers, compilers,
> interpreters, scientific computing and computer algebra.
>
> Note that the symbolic expression is immutable: you must not destroy your
> input! Also, numbers are all arbitrary precision integers.

The last rule, that it should handle arbitrary precision integers made
me use BigIntegers, with int the program ran about twice as fast.

//Roger Lindsjö

Lew

unread,

Dec 11, 2007, 9:32:16 AM12/11/07

to

Jon Harrop wrote:
> Lew wrote:
>> Oh, so when it's time to shit or get off the pot, you just attack the
>> person and (likely inaccurately) predict bad behavior on their part to
>> excuse your lack of intellectual honesty, Mr. Cambridge Don.
>
> Sounds like you can't write an efficient Java implementation of this
> benchmark either.

Yada, yada, Mr. Cambridge Almighty Don.

--
Lew

Daniel Pitts

unread,

Dec 11, 2007, 1:37:06 PM12/11/07

to

Jon Harrop wrote:
> Lew wrote:
>> Oh, so when it's time to shit or get off the pot, you just attack the
>> person and (likely inaccurately) predict bad behavior on their part to
>> excuse your lack of intellectual honesty, Mr. Cambridge Don.
>
> Sounds like you can't write an efficient Java implementation of this
> benchmark either.
>

What's with the personal attacks here? No one accused you of being unable.

I personally said I was unwilling to at this time.

Roger Lindsjö

unread,

Dec 11, 2007, 2:52:23 PM12/11/07

to

Roger Lindsjö wrote:
> I'm used to Ocaml, but it looked like Haskell, so I think my simplify is
> correct.

Of course I meant to say "I'm not used to Ocaml..."

//Roger Lindsjö

Jon Harrop

unread,

Dec 11, 2007, 3:18:47 PM12/11/07

to

Can we have the Java code? :-)

Roger Lindsjö

unread,

Dec 11, 2007, 6:08:02 PM12/11/07

to

Before being too enthusiastic, I did not have the associative rules
implemented, now my times are about 1.6 seconds for 10000000 iterations.

Also, while testing I notice that with the rules (at least the way I
implemented them) [+ [+ x 3] [+ 1 y]] would get simplified to
[+ [+ [+ x 3] 1] y]. Jon, it that true for your implementation also? (No
need posting code if it does not fully work).

Jon, feel free to email me, just use my first name instead of news.nospam.

//Roger Lindsjö

Jon Harrop

unread,

Dec 12, 2007, 8:16:12 AM12/12/07

to

Roger Lindsjö wrote:
> Before being too enthusiastic, I did not have the associative rules
> implemented, now my times are about 1.6 seconds for 10000000 iterations.
>
> Also, while testing I notice that with the rules (at least the way I
> implemented them) [+ [+ x 3] [+ 1 y]] would get simplified to
> [+ [+ [+ x 3] 1] y]. Jon, it that true for your implementation also? (No
> need posting code if it does not fully work).

That is the same, yes.

Chris Dollin

unread,

Dec 12, 2007, 8:58:21 AM12/12/07

to

Jon Harrop wrote:

> Following recent discussions here about benchmarking and when Java's
> performance is worst, I proposed the idea of symbolic computation because
> this entails very high allocation rates for short-lived objects when
> written in an idiomatic functional style and Java should be uniquely bad at
> this. While you might be able to work around the functional style using
> imperative constructs, it will be extremely tedious and error-prone for
> such applications.
>
> So, perhaps someone would like to translate this simple symbolic simplifier
> into Java:
>
> http://www.lambdassociates.org/studies/study10.htm

[[CAVEAT: hacked together over a lunchtime or so, in no way any
official statement of anything from anyone, etc.]]

OK, my broken-handed implementation running on a 3.06GHz Xeon with 2GB
memory (not that that matters much, since the JVM is only using the
default which is more like 64M) using FC5 with

Java(TM) SE Runtime Environment (build 1.6.0-b105)
Java HotSpot(TM) Client VM (build 1.6.0-b105, mixed mode, sharing)

and running (of course) with -server takes 2.498s, 2.94s, 3.002s,
for the 10000000-times loop (ie neglecting start-up times, which
according to `time` are about 0.2s). I don't know why there's so
much variability in the reported times.

Drat. OK, those times are using BigDecimal, not BigInteger, which
gets me 3.258s, 3.281s, 3.207s as the loop times. (Using plain
`int` gets those times down to 1.986s, 1.62s, 1.987s, as a measure
of the work done by the big-integer arithmetics.)

-server about halves the runtime. My initial -server-free code took
about 9s [with BigDecimal], but some straightforward optimisation work
(none of which comes close to the obvious caching cheat; just reordering
the tests and ensuring that the important values 0 and 1 were testable
with ==) reduced that.

I tried the obvious tactic of going for double-dispatches to avoid
multiple consecutive instanceof tests, but that made it /slower/,
perhaps because hotspot couldn't collapse code across method
boundaries? I don't really mind because in this case the multiple
dispatches made the code rather more long-winded than it already
was.

Count the line-size yourselves and subtract lines you deem not
relevant to the benchmark.

;;; -- code starteth here ----------------------------------

package temp;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertSame;

import java.math.BigDecimal;
import java.util.StringTokenizer;

import org.junit.Test;

/*

let rec ( *: ) f g = match f, g with

| `Int n, `Int m -> `Int (n * / m)

let rec simplify = function
| `Int _ | `Var _ as f -> f
| `Add (f, g) -> simplify f +: simplify g
| `Mul (f, g) -> simplify f * : simplify g

*/
public class SimplifyViaInstanceOf
{ @Test public void ensureParseCompleteConstants()
{
assertSame( zero, parse( "0" ) );
assertSame( one, parse( "1" ) );
assertEquals( new Val( new BigDecimal( "17" ) ), parse( "17" ) );
assertEquals( new Val( new BigDecimal( "1234567890" ) ), parse( "1234567890" ) );
}

@Test public void ensureParseCompleteNames()
{
assertEquals( new Sym( "x" ), parse( "x" ) );
assertEquals( new Sym( "y" ), parse( "y" ) );
assertEquals( new Sym( "calico" ), parse( "calico" ) );
assertEquals( new Sym( "whoot" ), parse( "whoot" ) );
}

@Test public void ensureParseSimpleExpression()
{
assertEquals( new Add( new Sym( "x" ), new Sym( "y" ) ), parse( "(x+y)" ) );
assertEquals( new Add( one, new Sym( "y" ) ), parse( "(1+y)" ) );
assertEquals( new Add( zero, new Sym( "y" ) ), parse( "(0+y)" ) );
assertEquals( new Add( new Sym( "x" ), zero ), parse( "(x+0)" ) );
assertEquals( new Mul( new Sym( "x" ), new Sym( "y" ) ), parse( "(x*y)" ) );
assertEquals( new Mul( new Sym( "z" ), zero ), parse( "(z*0)" ) );
assertEquals( new Mul( zero, new Sym( "x" ) ), parse( "(0*x)" ) );
}

@Test public void ensureParseCompoundExpression()
{
assertEquals( new Mul( parse( "(x+y)" ), one ), parse( "((x+y)*1)" ) );
assertEquals( new Mul( parse( "(x+y)" ), zero ), parse( "((x+y)*0)" ) );
assertEquals( new Mul( parse( "(x+y)" ), new Sym( "x" ) ), parse( "((x+y)*x)" ) );
assertEquals( new Mul( parse( "(x+y)"), parse( "(0*1)" ) ), parse( "((x+y)*(0*1))" ) );
}

private Exp parse( String string )
{
StringTokenizer st = new StringTokenizer( string, "+*()", true );
return parse( st, st.nextToken() );
}

private Exp parse( StringTokenizer st, String current )
{
char ch = current.charAt(0);
if (ch == '(')
{
Exp L = parse( st, st.nextToken() );
ch = st.nextToken().charAt( 0 );
Exp R = parse( st, st.nextToken() );
String ket = st.nextToken();
if (!ket.equals( ")" )) throw new RuntimeException( "ket needed: " + ket );
return ch == '*' ? new Mul( L, R ) : new Add( L, R );
}
else if (Character.isDigit( ch ))
return Val.create( new BigDecimal( current ) );
else if (current.equals( "*" ) || current.equals( "+" ) || current.equals( " )" ))
throw new RuntimeException( "misplaced punctuation: " + current );
else
return new Sym( current );
}

@Test public void ensureSimplifiesAddZero()
{
assertEquals( parse( "x" ), parse( "(x+0)" ).simplify() );
assertEquals( parse( "x" ), parse( "(0+x)" ).simplify() );
}

@Test public void ensureAddsLiterals()
{
assertEquals( parse( "3" ), parse( "(1+2)" ).simplify() );
assertSame( zero, parse( "0+0" ).simplify() );
}

@Test public void ensureMultipliesLiterals()
{
assertEquals( parse( "15" ), parse( "(5*3)" ).simplify() );
assertSame( zero, parse( "(0*0)" ).simplify() );
assertSame( zero, parse( "(x*0)" ).simplify() );
assertSame( zero, parse( "(0*y)" ).simplify() );
}

@Test public void ensureSimplifiesMulZero()
{
assertSame( zero, parse( "(x*0)" ).simplify() );
assertSame( zero, parse( "(0*x)" ).simplify() );
}

@Test public void ensureSimplifiesMulOne()
{
assertEquals( parse( "x" ), parse( "(x*1)" ).simplify() );
assertEquals( parse( "x" ), parse( "(1*x)" ).simplify() );
}

@Test public void ensureLeftAssociatesAdds()
{
assertEquals( parse( "((x+y)+z)" ), parse( "(x+(y+z))" ).simplify() );
}

@Test public void ensureLeftAssociatesMuls()
{
assertEquals( parse( "((x*y)*z)" ), parse( "(x*(y*z))" ).simplify() );
}

static abstract class Exp
{
abstract Exp simplify();
}

static final Val zero = new Val( BigDecimal.ZERO );
static final Val one = new Val( BigDecimal.ONE );

static class Val extends Exp
{
final BigDecimal val;

Val( BigDecimal val ) { this.val = val; }

Exp simplify() { return this; }

public String toString() { return val.toString(); }

public boolean equals( Object other )
{ return other instanceof Val && val.equals( ((Val) other).val ); }

public static Exp create( BigDecimal d )
{ return d.equals( BigDecimal.ZERO ) ? zero : d.equals( BigDecimal.ONE ) ? one : new Val( d ); }
}

static class Sym extends Exp
{
String name;
Sym( String name ) { this.name = name; }
Exp simplify() { return this; }
public boolean equals( Object other )
{ return other instanceof Sym && name.equals( ((Sym) other ).name ); }
public String toString() { return name; }
}

static abstract class Inf extends Exp
{
final Exp L, R;
Inf( Exp L, Exp R ) { this.L = L; this.R = R; }
}

static class Add extends Inf
{
Add( Exp L, Exp R ) { super( L, R ); }
Exp simplify()
{
return sum( L.simplify(), R.simplify() );
}
private Exp sum( Exp sL, Exp R )
{
if (sL ==zero) return R;
if (R == zero) return sL;
if (sL instanceof Val && R instanceof Val) return Val.create( ((Val) sL).val.add( ((Val) R).val ) );
if (R instanceof Add) return sum( sum( sL, ((Add) R).L ), ((Add) R).R );
return new Add( sL, R );
}
public String toString() { return "(" + L + " + " + R + ")"; }

public boolean equals( Object other )
{ return other instanceof Add && same( (Add) other ); }

private boolean same( Add other )
{ return L.equals( other.L ) && R.equals( other.R ); }
}

static class Mul extends Inf
{
Mul( Exp L, Exp R ) { super( L, R ); }
Exp simplify(){ return prod( L.simplify(), R.simplify() ); }
private Exp prod( Exp L, Exp R )
{
if (L == zero || R ==zero) return zero;
if (L == one) return R;
if (R == one) return L;
if (L instanceof Val && R instanceof Val) return Val.create( ((Val) L).val.multiply( ((Val) R).val ) );
if (R instanceof Mul) return prod( prod( L, ((Mul) R).L ), ((Mul) R).R );
return new Mul( L, R );
}
public String toString() { return "(" + L + " * " + R + ")"; }

public boolean equals( Object other )
{ return other instanceof Mul && same( (Mul) other ); }

private boolean same( Mul other )
{ return L.equals( other.L ) && R.equals( other.R ); }
}

public static void main( String [] args )
{
// [* x [+ [+ [* 12 0] [+ 23 8]] y]]
Exp a28_8 = new Add( Val.create( new BigDecimal( 23 ) ), Val.create( new BigDecimal( 8 ) ) );
Exp m12_0 = new Mul( Val.create( new BigDecimal( 12 ) ), Val.create( new BigDecimal( 0 ) ) );
Exp aAB = new Add( m12_0, a28_8 );
Exp aABY = new Add( aAB, new Sym( "y" ) );
Exp it = new Mul( new Sym( "x" ), aABY );
Exp itx = new SimplifyViaInstanceOf().parse( "(x*(((12*0)+(23+8))+y))" );
System.err.println( ">> original expression: " + it );
System.err.println( ">> original expression: " + itx );
System.err.println( ">> simplified expression: " + itx.simplify() );
long base = System.currentTimeMillis();
for (int i = 0; i < 10000000; i += 1) it.simplify();
long after = System.currentTimeMillis();
System.err.println( ">> took " + (after - base)/1000.0 + "s" );
}
}

--
Hewlett-Packard Limited registered office: Cain Road, Bracknell,
registered no: 690597 England Berks RG12 1HN

Jon Harrop

unread,

Dec 12, 2007, 10:44:32 AM12/12/07

to

Chris Dollin wrote:
> OK, my broken-handed implementation running on a 3.06GHz Xeon with 2GB
> memory (not that that matters much, since the JVM is only using the
> default which is more like 64M) using FC5 with
>
> Java(TM) SE Runtime Environment (build 1.6.0-b105)
> Java HotSpot(TM) Client VM (build 1.6.0-b105, mixed mode, sharing)
>
> and running (of course) with -server takes 2.498s, 2.94s, 3.002s,
> for the 10000000-times loop (ie neglecting start-up times, which
> according to `time` are about 0.2s). I don't know why there's so
> much variability in the reported times.

Great.

> Drat. OK, those times are using BigDecimal, not BigInteger, which
> gets me 3.258s, 3.281s, 3.207s as the loop times. (Using plain
> `int` gets those times down to 1.986s, 1.62s, 1.987s, as a measure
> of the work done by the big-integer arithmetics.)

Let's forget about the arbitrary-precision stuff and focus on the rest for
now. On my machine, your original code gives 2.853. Changing to
machine-precision ints that drops to 1.508s. The equivalent OCaml takes
only 0.711s.

> -server about halves the runtime. My initial -server-free code took
> about 9s [with BigDecimal], but some straightforward optimisation work
> (none of which comes close to the obvious caching cheat; just reordering
> the tests and ensuring that the important values 0 and 1 were testable
> with ==) reduced that.

Yes, the OCaml compiler will automatically factor common subexpressions like
the constant integer expressions.

> I tried the obvious tactic of going for double-dispatches to avoid
> multiple consecutive instanceof tests, but that made it /slower/,
> perhaps because hotspot couldn't collapse code across method
> boundaries? I don't really mind because in this case the multiple
> dispatches made the code rather more long-winded than it already
> was.

Interesting.

> Count the line-size yourselves and subtract lines you deem not
> relevant to the benchmark.

I get 175 LOC for Roger's Java code, 117 LOC for your Java and 40 LOC for my
OCaml.

Roger Lindsjö

unread,

Dec 12, 2007, 3:49:15 PM12/12/07

to

I have been talking to Don off line and I used Chris Dollin's idea of
using cached values for 0 and 1. (Chris and my code were actually quite
similar from start, and on par for speed, but after using Cris ideo of
cached 0 and 1 my end up 10% faster or so). I have changed some names to
look more like Chris code to make them easier to compare. Initially I
did not use instanceof but instead had a type for each element which I
could use to check the type. This was actually slower than instanceof
(surprised me).

Using BigInteger or BigDecimal made no difference in my test, so I used
BigDecimal.

The machine I used was a 2.4GHz Core 2 Quad with 2GB memory, and while
running any of the programs they seemed to use 100% of one core.

java Simplify 10000000 3

[* x [+ [+ [* 12 0] [+ 23 8]] y]]

Bench 0
Took 1.16 seconds
Bench 1
Took 1.11 seconds
Bench 2
Took 1.104 seconds

[* x [+ 31 y]]

However, after installing Ocaml I compiled Dan's program and ran. The
output was (I modified it to run several times):
time ./simplify
Took 1.514769s
Took 1.515770s
Took 1.518769s

So now I get confused, according to Don this code should be about twice
as fast as the Java code, but in my tests Java is slightly faster. Why
is that? Perhaps I have the wrong compiler, mine says 3.09.3. There
seemed to be a 3.10 out, but is wasn't part of Fedora 7 regular
repository. Could upgrading to 3.10 actually make the program at least
twice as fast?

Can anyone explain? Could it be that on my architecture the HotSpot is
better at optimizing than the Ocaml compiler?

I then modified my program to use long instead of BigDecimal and got the
following:
java SimplifyLong 10000000 3

[* x [+ [+ [* 12 0] [+ 23 8]] y]]

Bench 0
Took 0.61 seconds
Bench 1
Took 0.549 seconds
Bench 2
Took 0.558 seconds

[* x [+ 31 y]]

Simplify.java
import java.math.BigDecimal;

public class Simplify {
public static void main(String[] args) {
int lim = Integer.parseInt(args[0]);
int nbench = Integer.parseInt(args[1]);

Expr expr =
mul(var("x"),
add(
add(mul(rat(12),rat(0)), add(rat(23),rat(8))),
var("y")
)
);

System.out.println(expr);
Expr reduced = null;
for (int batch = 0; batch < nbench; batch++) {
long start = System.currentTimeMillis();
System.err.println("Bench " + batch);
for (int i = 0; i < lim; i++) {
reduced = expr.simplify();
}
System.err.println("Took " +
(System.currentTimeMillis() - start) / 1000.0 + " seconds");
}
System.out.println(reduced);
}

private static Expr mul(Expr a, Expr b) {
return new Mul(a,b);
}
private static Expr add(Expr a, Expr b) {
return new Add(a,b);
}
private static Expr rat(long a) {
return Val.create(BigDecimal.valueOf(a));
}
private static Expr var(String s) {
return new Sym(s);
}

public static abstract class Expr {
public abstract Expr simplify();
}

public static final class Add extends Expr {
private final Expr l, r;

public Add(Expr l, Expr r) {
this.l = l;
this.r = r;
}

public Expr simplify() {
return add(l.simplify(), r.simplify());
}

private Expr add(Expr l, Expr r) {
if (Val.ZERO == l) {
return r;
} else if (Val.ZERO == r) {
return l;
} else if (l instanceof Val && r instanceof Val) {
return add((Val) l, (Val) r) ;
} else if (r instanceof Add) {
return add(l, (Add) r);
} else {
return new Add(l, r);
}
}

private Expr add(Val l, Val r) {
return Val.create(l.getValue().add(r.getValue()));
}

private Expr add(Expr l, Add r) {
return new Add(new Add(l, r.l), r.r).simplify();
}

public String toString() {
return "[+ " + l + " " + r +"]";
}
}

public static class Mul extends Expr {
private final Expr l, r;
public Mul(Expr l, Expr r) {
this.l = l;
this.r = r;
}

public Expr simplify() {
return mul(l.simplify(), r.simplify());
}

private Expr mul(Expr l, Expr r) {
if (Val.ZERO == l || Val.ZERO == r) {
return Val.ZERO;
} else if (Val.ONE == l) {
return r;
} else if (Val.ONE == r) {
return l;
} else if (l instanceof Val && r instanceof Val) {
return mul((Val) l, (Val) r) ;
} else if (r instanceof Mul) {
return mul(l, (Mul) r);
} else {
return new Mul(l, r);
}
}

private Expr mul(Val l, Val r) {
return Val.create(l.getValue().multiply(r.getValue()));
}

private Expr mul(Expr l, Mul r) {
return new Mul(new Mul(l, r.l), r.r).simplify();
}

public String toString() {
return "[* " + l + " " + r +"]";
}
}

public static class Sym extends Expr {
private final String symbol;

public Sym(String symbol) {
this.symbol = symbol;
}

public Expr simplify() {
return this;
}

public String toString() {
return symbol;
}
}

public static class Val extends Expr {
public static final Val ZERO = new Val(BigDecimal.ZERO);
public static final Val ONE = new Val(BigDecimal.ONE);

private final BigDecimal value;

private Val(BigDecimal value) {
this.value = value;
}

public Expr simplify() {
return this;
}

public static Val create(BigDecimal value) {
if (BigDecimal.ZERO == value || BigDecimal.ZERO.equals(value)) {
return ZERO;
} else if (BigDecimal.ONE == value && BigDecimal.ONE.equals(value)) {
return ONE;
} else {
return new Val(value);
}
}

public BigDecimal getValue() {
return value;
}

public String toString() {
return String.valueOf(value);
}
}
}

//Roger Lindsjö

Steve Wampler

unread,

Dec 12, 2007, 5:49:02 PM12/12/07

to

Jon Harrop wrote:
> Following recent discussions here about benchmarking and when Java's
> performance is worst, I proposed the idea of symbolic computation because
> this entails very high allocation rates for short-lived objects when
> written in an idiomatic functional style and Java should be uniquely bad at
> this. While you might be able to work around the functional style using
> imperative constructs, it will be extremely tedious and error-prone for
> such applications.
>
> So, perhaps someone would like to translate this simple symbolic simplifier
> into Java:
>
> http://www.lambdassociates.org/studies/study10.htm

Here's another Java one. I didn't bother with parsing an
input string because it looks like people were timing
just the time to walk the parse tree and simplify it.

I've been hesitant to post it because it feels like cheating
(and the code isn't very OO - I actually tend to think in
a different language for these puzzles and then cast the
result into Java).

Here are times on my dual-Opteron@2GHz, using BigInteger,
each run does 10,000,000 simplifications and then outputs
the original string and the simplified version (taken
from the last run):

------------------------------------------------------
time java -server Simplify 10000000 10
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0090
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0010
10000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
java -server Simplify 10000000 10 0.11s user 0.03s system 103% cpu 0.135 total
------------------------------------------------------

Why so fast? Because the simplified string is built up as the parse
is constructed. So the simplification loop just outputs the result!
I'll grant that this isn't in the spirit of things, but I couldn't see
where it was prohibited :) After all, if you're allowed to simplify
while building the parse tree, this isn't much more of a stretch...

(Naturally, this would work when parsing of input strings is done instead
of hand-building the parse tree, as well.)

-----------------------------------------------------
import java.math.BigInteger;
import java.util.Date;

public class Simplify {

private static BigInteger zero = new BigInteger("0");
private static BigInteger one = new BigInteger("1");

private class Expr {
public String type;
public Expr left;
public Expr right;
public BigInteger iVal;
public String sVal;

public Expr(int nVal) {
type = "int";
iVal = new BigInteger(""+nVal);
sVal = iVal.toString();
}

public Expr(String var) {
type = "var";
sVal = var;
}

public Expr(String nType, Expr nLeft, Expr nRight) {
if (nType == "+") {
if (nLeft.eq(zero)) { mkCopy(nRight); }
else if (nRight.eq(zero)) { mkCopy(nLeft); }
else if ((nLeft.type == "int") && (nRight.type == "int")) {
type = "int";
iVal = nLeft.iVal.add(nRight.iVal);
sVal = iVal.toString();
}
else lassoc(nType, nLeft, nRight);
}
if (nType == "*") {
if (nLeft.eq(one)) { mkCopy(nRight); }
else if (nRight.eq(one)) { mkCopy(nLeft); }
else if ((nLeft.type == "int") && (nRight.type =="int")) {
type = "int";
iVal = nLeft.iVal.multiply(nRight.iVal);
sVal = iVal.toString();
}
else lassoc(nType, nLeft, nRight);
}
}

private void lassoc(String nType, Expr nLeft, Expr nRight) {
if (nRight.type == nType) {
type = nType;
left = new Expr(nType, nLeft, nRight.left);
right = nRight.right;
sVal = "["+nType+" "+left+" "+right+"]";
}
else mkCopy(nType, nLeft, nRight, zero,
"["+nType+" "+nLeft+" "+nRight+"]");
}

private void mkCopy(Expr e) {
mkCopy(e.type, e.left, e.right, e.iVal, e.sVal);
}

private void mkCopy(String nType, Expr nLeft, Expr nRight,
BigInteger nVal, String nsVal) {
type = nType;
left = nLeft;
right = nRight;
iVal = nVal;
sVal = nsVal;
}

public boolean eq(BigInteger i) {
return (type == "int") && (iVal.equals(i));
}

public String toString() { return sVal; }

public String simplify() { return sVal; }
}

public static String simplify(Expr expr) {
return expr.simplify();
}

public void runTest(int limit) {
String s = "[* x [+ [+ [* 12 0][+ 23 8]] y ]]";
System.out.print(""+limit+" '"+s+"' -> ");
Expr e = new Expr("*",
new Expr("x"),
new Expr("+",
new Expr("+",
new Expr("*", new Expr(12), new Expr(0) ),
new Expr("+", new Expr(23), new Expr(8) ) ) ,
new Expr("y") ) );
Date start = new Date();
String newS = null;
for (int i = 0; i < limit; ++i) {
newS = simplify(e);
}
Date stop = new Date();
System.out.println("'"+newS+"':: "+
(stop.getTime() - start.getTime())/1000.0);
}

public static void main(String[] args) {

int limit = 10000000;
if (args.length > 0) limit = Integer.parseInt(args[0]);
int nRuns = 10;
if (args.length > 1) nRuns = Integer.parseInt(args[1]);

Simplify simplify = new Simplify();
for (int i = 0; i < nRuns; ++i) {
simplify.runTest(limit);
}
System.exit(0);
}
}
-----------------------------------------------------

Chris Dollin

unread,

Dec 13, 2007, 3:36:25 AM12/13/07

to

Steve Wampler wrote:
one. I didn't bother with parsing an
> input string because it looks like people were timing
> just the time to walk the parse tree and simplify it.
>
> I've been hesitant to post it because it feels like cheating
> (and the code isn't very OO - I actually tend to think in
> a different language for these puzzles and then cast the
> result into Java).

> Why so fast? Because the simplified string is built up as the parse

> is constructed. So the simplification loop just outputs the result!
> I'll grant that this isn't in the spirit of things, but I couldn't see
> where it was prohibited :) After all, if you're allowed to simplify
> while building the parse tree, this isn't much more of a stretch...

I thought of that and decided it wasn't in the spirit of the test; it
does show that without the context the problem originally came from,
one doesn't know if it's a realistic optimisation or not, and it shows
how algorithmic changes can beat mere [micro] optimisation. The other
trick I thought of, almost but not quite a similar complete cheat, is
to give each node a slot to hold its simplified form and store it the
first time it gets computed. In this case, that would mean that the
first pass through the loop would do all the computation and the remaining
passes would be trivial ...

[I know Jon wanted immutable nodes, but these /would/ be immutable nodes;
there's no external visible change to the behaviour except performance.
Also, given that Java doesn't even pretend to be functional, disallowing
assignments could be seen as bias ... I do think it's sensible to make
the expression nodes /functionally/ immutable.]

--
Chris "puns is us" Dollin

Hewlett-Packard Limited Cain Road, Bracknell, registered no:
registered office: Berks RG12 1HN 690597 England

Steve Wampler

unread,

Dec 13, 2007, 7:23:21 AM12/13/07

to

Chris Dollin wrote:
> The other
> trick I thought of, almost but not quite a similar complete cheat, is
> to give each node a slot to hold its simplified form and store it the
> first time it gets computed. In this case, that would mean that the
> first pass through the loop would do all the computation and the remaining
> passes would be trivial ...

Yes, that's essentially what my original (unpublished) solution (which includes
the parsing of the string inside every iteration of the inner timing loop) does. I
wrote it with a string transformation approach instead of thinking of it
as a parse tree approach and used caching to avoid reparsing any
previously parsed substring. Since every string (including the
input!) is a substring of itself, this means that it is slow on the first
iteration and reduces to a cache lookup on iterations 2-10,000,000
for this 'benchmark'.

So, the times, where the input string is passed to the simplifier on
each iteration are roughly 0.12 ms per 10,000,000 iterations. The
parsing code is really a quick and (very) dirty hack, so I decided
not to put it out.

I'm still debating with myself about whether or not saving the
simplified string is a cheat or not - it's just a representation of
the parse tree *at that point*, so if you're allowed to simplify
while building the parse tree, why not save its representation
as well? After all, without it, the challenge really reduces to
how fast you can recurse, concatenate strings, and convert big
integers to strings - not really do symbolic math manipulations.

Incidently, I noticed I had written

public Expr(int i) {...

where it should be:

public Expr(BigInteger i)

instead. (And the hand-built parse tree would have to use "new BigInteger(...)" instead
of int literals). Doesn't affect the timings enough to notice.

Also, the code could be cleaned up...and made more OO. Then it would be pretty much the
same as your solution, of course.

Chris Dollin

unread,

Dec 13, 2007, 8:34:19 AM12/13/07

to

Steve Wampler wrote:

> Chris Dollin wrote:

(stuff about caching, and at the end of Steve's response he wrote:)

> Also, the code could be cleaned up...and made more OO. Then it would be
> pretty much the same as your solution, of course.

Could easily be better. I don't like using `instanceof` -- it doesn't
seem to me to be in the spirit of an OO language. Type decisions should
be made by polymorphism. But, as I said, my attempt at using polymorphism
via multiple-dispatch was slower than my instanceof code. Perhaps I gave
up too early ...

--
Chris "giving in until he gets his own way" Dollin

Roger Lindsjö

unread,

Dec 13, 2007, 10:25:29 AM12/13/07

to

I have done some more test, and used the same Ocaml compiler (3.09.3)
and the same Java (1.6 update 3). The times posted are averages over a
few iterations. The processes were locked to one cpu to not allow the gc
running on a separate cpu, but this didn't make any noticeable
difference. Both machines runs late (but not the same) GNU/Linux kernels
(Fedora Core and Fedora systems).

Machine 1 is a 3.2GHz Pentium with 2 cores (not hyperthreading). A 64
bit machine (were they called Pentium D?). 2GB memory.
Java: 5.123 s
Ocaml: 1.142 s

Machine 2 is a 2.4GHz Core 2 Quad also with 2GB memory.
Java: 1.074 s
Ocaml: 1.515 s

The compilers must be very different in how they utilize the cpu
architecture. Sitting on machine 1 (not my primary computer) I'd say
Ocaml is faster, but sitting on machine 2 (my primary machine ;-) ) I'd
say Java is faster.

Perhaps Java is better at utilizing the large cache on machine 2? I
don't know.

//Roger Lindsjö

Steve Wampler

unread,

Dec 13, 2007, 10:55:29 AM12/13/07

to

Roger Lindsjö wrote:
> The compilers must be very different in how they utilize the cpu
> architecture. Sitting on machine 1 (not my primary computer) I'd say
> Ocaml is faster, but sitting on machine 2 (my primary machine ;-) ) I'd
> say Java is faster.

Thanks for testing this - I've been trying to figure out why Jon and I
have been getting such different results. This may well explain it.

Lew

unread,

Dec 13, 2007, 12:03:28 PM12/13/07

to

Steve Wampler wrote:
> Roger Lindsjö wrote:
>> The compilers must be very different in how they utilize the cpu
>> architecture. Sitting on machine 1 (not my primary computer) I'd say
>> Ocaml is faster, but sitting on machine 2 (my primary machine ;-) )
>> I'd say Java is faster.
>
> Thanks for testing this - I've been trying to figure out why Jon and I
> have been getting such different results. This may well explain it.

This supports the points of those who say, "Watch out for benchmarks," and,
"Java can be as fast as C++", and. "Java is slower that C++", and, "It depends".

--
Lew

Jon Harrop

unread,

Dec 13, 2007, 3:36:15 PM12/13/07

to

Chris Dollin wrote:
> [I know Jon wanted immutable nodes...

I don't mind mutable nodes but I don't want simplification to destroy its
input expression.

Jon Harrop

unread,

Dec 13, 2007, 3:39:04 PM12/13/07

to

Steve Wampler wrote:
> Why so fast? Because the simplified string is built up as the parse
> is constructed. So the simplification loop just outputs the result!
> I'll grant that this isn't in the spirit of things, but I couldn't see
> where it was prohibited :) After all, if you're allowed to simplify
> while building the parse tree, this isn't much more of a stretch...

To be honest, I'm not really sure how this relates to the original challenge
or to my OCaml. Specifically, I haven't used any strings or done any
parsing.

The previous Java implementations weren't benchmarking parsing, were they?

Jon Harrop

unread,

Dec 13, 2007, 3:42:12 PM12/13/07

to

Chris Dollin wrote:
> Steve Wampler wrote:
>> Chris Dollin wrote:
>> Also, the code could be cleaned up...and made more OO. Then it would be
>> pretty much the same as your solution, of course.
>
> Could easily be better. I don't like using `instanceof` -- it doesn't
> seem to me to be in the spirit of an OO language. Type decisions should
> be made by polymorphism. But, as I said, my attempt at using polymorphism
> via multiple-dispatch was slower than my instanceof code. Perhaps I gave
> up too early ...

Yes, this is exactly where OOP ceases to be useful. I believe the idiomatic
OOP solution is to implement lots of different member functions reflecting
pieces of pattern match. For example, you would have an add_int member
function to add an int and specialize the add_int member of the Int class
to create a new Int.

I might have a go but I've no idea what would be efficient in Java.

Roedy Green

unread,

Dec 13, 2007, 4:04:02 PM12/13/07

to

On Mon, 10 Dec 2007 09:38:35 +0000, Jon Harrop <use...@jdh30.plus.com>
wrote, quoted or indirectly quoted someone who said :

> high allocation rates for short-lived objects when
>written in an idiomatic functional style and Java should be uniquely bad at
>this.

Not at all. Java can allocate objects very quickly;, it just peels
them off a giant contiguous block. What takes the time is when you
have a large number of long lived objects since they have to be
enumerated each GC pass. Short lived objects are recovered in zero
time.
--
Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Roedy Green

unread,

Dec 13, 2007, 4:09:44 PM12/13/07

to

On Thu, 13 Dec 2007 16:25:29 +0100, Roger Lindsjö
<news....@tilialacus.net> wrote, quoted or indirectly quoted
someone who said :

>Machine 2 is a 2.4GHz Core 2 Quad also with 2GB memory.

>Java: 1.074 s
>Ocaml: 1.515 s

If you pass me the Java code, I can run it both with java.exe and with
Jet. That ratio would give you another datapoint, compiled Java.

I have a dual core Athlon. I think there are no threads in your
implementation (are there in the Ocaml?), you would expect dual core
to buy you nothing.

Steve Wampler

unread,

Dec 13, 2007, 4:43:02 PM12/13/07

to

Jon Harrop wrote:
> Steve Wampler wrote:
>> Why so fast? Because the simplified string is built up as the parse
>> is constructed. So the simplification loop just outputs the result!
>> I'll grant that this isn't in the spirit of things, but I couldn't see
>> where it was prohibited :) After all, if you're allowed to simplify
>> while building the parse tree, this isn't much more of a stretch...
>
> To be honest, I'm not really sure how this relates to the original challenge
> or to my OCaml. Specifically, I haven't used any strings or done any
> parsing.
>
> The previous Java implementations weren't benchmarking parsing, were they?

That's correct. I had originally assumed that the program should work
for different expressions and not have expression trees hard-coded. So
I took that to mean that some form of string input was needed so that
expressions could be parsed into the expression tree. That led me to
a string transformation solution as a first try. When I saw the C++
solution to the derivative solution you offered as an example of something
similar, I realized that there was no need to read a string and parse it.
But my earlier misinterpretation keeps me coming back to referring to the
hard-coded internal representation as a parse tree (that, and my previous
life writing compilers).

Since the simplification can be done when that hard-coded parse tree
is built, the timing loop to simplify the constructed tree actually
doesn't have to do anything to get the simplified expression at all -
it's already there. (My statement about the loop just outputting the
result isn't true, that's not needed either, after all!) Colin's solution,
for example, just walks the tree and builds a copy of it. But I'm
not convinced that's necessary - you could just return the tree without
a copy. (I went ahead and returned that tree in a string representation,
because that made it easier for me to test things, and I could do that
step quickly.)

[The version I posted had a error in the string representation of the
result of converting right associativity to left (teaches me to code
while battling a mild case of insomnia...). I found and fixed that
while cleaning up the code.]

It would probably be more interesting to include the code that
constructs the expression tree in the timing loop. You may very
well have had that in mind, but that's not what I'd seen tested.

Steve Wampler

unread,

Dec 13, 2007, 4:46:10 PM12/13/07

to

Roedy Green wrote:
> I have a dual core Athlon. I think there are no threads in your
> implementation (are there in the Ocaml?), you would expect dual core
> to buy you nothing.

I don't know how Jet handles the execution environment, but Sun's
JDK runs lots of administrative threads while a program is running.
I would expect some benefit from dual cores with that, but apparently
(from Roger's note) it's not much.

Roger Lindsjö

unread,

Dec 13, 2007, 5:16:57 PM12/13/07

to

Roedy Green wrote:
> On Thu, 13 Dec 2007 16:25:29 +0100, Roger Lindsjö
> <news....@tilialacus.net> wrote, quoted or indirectly quoted
> someone who said :
>
>> Machine 2 is a 2.4GHz Core 2 Quad also with 2GB memory.
>> Java: 1.074 s
>> Ocaml: 1.515 s
>
> If you pass me the Java code, I can run it both with java.exe and with
> Jet. That ratio would give you another datapoint, compiled Java.
>
> I have a dual core Athlon. I think there are no threads in your
> implementation (are there in the Ocaml?), you would expect dual core
> to buy you nothing.

No threading other than what the JVM does internally for housekeeping
such as GC. I even ran restricted to one core (using taskset) since I
was told that Ocaml was single threaded.

The code is 3 posts up, the one I was relying to.
http://groups.google.com/group/comp.lang.java.programmer/msg/b5185f597ba62103

//Roger Lindsjö

Steve Wampler

unread,

Dec 13, 2007, 5:10:56 PM12/13/07

to

Steve Wampler wrote:
> Since the simplification can be done when that hard-coded parse tree
> is built, the timing loop to simplify the constructed tree actually
> doesn't have to do anything to get the simplified expression at all -
> it's already there.

Incidently, I think Java figured that out also. Here are the times
I get for 10 runs of 2,000,000,000:

->time java -server Simplify 2000000000 10
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.01
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0
2000000000 '[* x [+ [+ [* 12 0][+ 23 8]] y ]]' -> '[* x [+ 31 y]]':: 0.0010
java -server Simplify 2000000000 10 0.11s user 0.02s system 102% cpu 0.127 total
->

I suspect that means it's figured out I have a loop invariant. Apparently
it has more trouble figuring that out when you actually do something.
Probably because new objects are built while 'cloning' the tree.

Jon Harrop

unread,

Dec 13, 2007, 6:27:16 PM12/13/07

to

Roedy Green wrote:
> On Mon, 10 Dec 2007 09:38:35 +0000, Jon Harrop <use...@jdh30.plus.com>
> wrote, quoted or indirectly quoted someone who said :
>> high allocation rates for short-lived objects when
>> written in an idiomatic functional style and Java should be uniquely bad
>> at this.
>
> Not at all. Java can allocate objects very quickly;, it just peels
> them off a giant contiguous block. What takes the time is when you
> have a large number of long lived objects since they have to be
> enumerated each GC pass. Short lived objects are recovered in zero
> time.

Then you'll be able to improve the Java implementations of these benchmarks
so that they aren't up to 5x slower than an unoptimized alternative.

Lew

unread,

Dec 13, 2007, 8:32:32 PM12/13/07

to

Roger Lindsjö wrote:
> No threading other than what the JVM does internally for housekeeping
> such as GC. I even ran restricted to one core (using taskset) since I
> was told that Ocaml was single threaded.

If it's true that Ocaml is single-threaded, then that kills it for a wide
range of applications for which Java is extremely suitable. Performance would
be moot, even if the numbers weren't so close.

--
Lew

Daniel Dyer

unread,

Dec 13, 2007, 9:10:07 PM12/13/07

to

http://caml.inria.fr/pub/ml-archives/caml-list/2002/11/64c14acb90cb14bedb2cacb73338fb15.en.html

I briefly looked at OCaml because I'd heard a lot of good things about.
The threading thing did put me off. It seems there are issues with its
garbage collector implementation in this respect.

Dan.

--
Daniel Dyer
http://www.uncommons.org

Jon Harrop

unread,

Dec 14, 2007, 4:43:56 AM12/14/07

to

I think there are now two problems:

1. We're measuring bigint arithmetic but we don't really care about it.

2. The input should be such that the evaluation is irreducible. In
particular, the input must not be hardcoded.

We can address these problems quite easily by using machine-precision ints
(whatever they are, we'll just make sure the problem doesn't overflow them)
and doing something juicier with symbolic computation like a big derivative
or multiplying out polynomials.

Jon Harrop

unread,

Dec 14, 2007, 4:51:22 AM12/14/07

to

OCaml simply uses processes for concurrency rather than threads. In terms of
capabilities, there is no difference.

l...@mail.com

unread,

Dec 14, 2007, 5:27:01 AM12/14/07

to

On Dec 14, 4:16 am, Roger Lindsjö <news.nos...@tilialacus.net> wrote:
> Roedy Green wrote:
> > On Thu, 13 Dec 2007 16:25:29 +0100, Roger Lindsjö

> > <news.nos...@tilialacus.net> wrote, quoted or indirectly quoted

> > someone who said :
>
> >> Machine 2 is a 2.4GHz Core 2 Quad also with 2GB memory.
> >> Java: 1.074 s
> >> Ocaml: 1.515 s
>
> > If you pass me the Java code, I can run it both with java.exe and with
> > Jet. That ratio would give you another datapoint, compiled Java.
>
> > I have a dual core Athlon. I think there are no threads in your
> > implementation (are there in the Ocaml?), you would expect dual core
> > to buy you nothing.
>
> No threading other than what the JVM does internally for housekeeping
> such as GC. I even ran restricted to one core (using taskset) since I
> was told that Ocaml was single threaded.
>

> The code is 3 posts up, the one I was relying to.http://groups.google.com/group/comp.lang.java.programmer/msg/b5185f59...
>
> //Roger Lindsjö

On my 2GHz Celeron the executable produced by the latest Excelsior JET
6.0 is much faster than HotSpot Client VM 1.6.0_3, but slower than the
Server VM, which is almost twice as fast as the Client VM. Here go the
best results selected from 10 consecutive runs:

HSS : 2.594s
JET : 3.25s
HSC : 5.172s

On a colleague's Core 2 Duo, the situation is more or less the same:

HSS : 1.206s
JET : 1.687s
HSC : 2.118s

Are you guys all using the Server VM in your comparisons?

LDV

Roger Lindsjö

unread,

Dec 14, 2007, 5:46:24 AM12/14/07

to

I'm using Server VM by default (or actually, the JVM does), the output
from just java contains:
The default VM is server,
because you are running on a server-class machine.

//Roger Lindsjö

Jon Harrop

unread,

Dec 14, 2007, 5:46:18 AM12/14/07

to

Roger Lindsjö wrote:
> I'm using Server VM by default (or actually, the JVM does), the output
> from just java contains:
> The default VM is server,
> because you are running on a server-class machine.

Me too.

Roger Lindsjö

unread,

Dec 14, 2007, 6:32:45 AM12/14/07

to

Jon Harrop wrote:
> I think there are now two problems:
>
> 1. We're measuring bigint arithmetic but we don't really care about it.

Well, actually I used BigDecimal, but that made no difference for the
test. I also made a version using long instead of BigDecimal. With longs
10000000 iterations took about .54 seconds, so about twice as fast. I'm
not sure how to correctly rewrite the Ocaml program to use the native
int. I'll post my program rewritten for long instead, so if you also
post the Ocaml program, then we can both compare relative performance.

> 2. The input should be such that the evaluation is irreducible. In
> particular, the input must not be hardcoded.
>
> We can address these problems quite easily by using machine-precision ints
> (whatever they are, we'll just make sure the problem doesn't overflow them)
> and doing something juicier with symbolic computation like a big derivative
> or multiplying out polynomials.

I think the current program works, at least I think that my solution
will not allow the HotSpot VM to optimize too much.

//Roger Lindsjö

SimplifyLong.java
public class SimplifyLong {

public static void main(String[] args) {

int lim = Integer.parseInt(args[0]);
int nbench = Integer.parseInt(args[1]);

Expr expr =
mul(var("x"),
add(
add(mul(rat(12),rat(0)), add(rat(23),rat(8))),
var("y")
)
);

System.out.println(expr);
Expr reduced = null;
for (int batch = 0; batch < nbench; batch++) {
long start = System.currentTimeMillis();
System.err.println("Bench " + batch);
for (int i = 0; i < lim; i++) {
reduced = expr.simplify();
}
System.err.println("Took " +
(System.currentTimeMillis() - start) / 1000.0 + " seconds");
}
System.out.println(reduced);
}

private static Expr mul(Expr a, Expr b) { return new Mul(a,b); }
private static Expr add(Expr a, Expr b) { return new Add(a,b); }

private static Expr rat(long a) { return Val.create(a); }

private static Expr var(String s) { return new Sym(s); }

public static abstract class Expr {
public abstract Expr simplify();
}

public static final class Add extends Expr {
private final Expr l, r;

public Add(Expr l, Expr r) {
this.l = l;
this.r = r;
}

public Expr simplify() {
return add(l.simplify(), r.simplify());
}

private Expr add(Expr l, Expr r) {
if (Val.ZERO == l) {
return r;
} else if (Val.ZERO == r) {
return l;
} else if (l instanceof Val && r instanceof Val) {
return add((Val) l, (Val) r) ;
} else if (r instanceof Add) {
return add(l, (Add) r);
} else {
return new Add(l, r);
}
}

private Expr add(Val l, Val r) {

return Val.create(l.getValue() + r.getValue());
}

return Val.create(l.getValue() * r.getValue());
}

private Expr mul(Expr l, Mul r) {
return new Mul(new Mul(l, r.l), r.r).simplify();
}

public String toString() {
return "[* " + l + " " + r +"]";
}
}

public static class Sym extends Expr {
private final String symbol;

public Sym(String symbol) {
this.symbol = symbol;
}

public Expr simplify() {
return this;
}

public String toString() {
return symbol;
}
}

public static class Val extends Expr {

public static final Val ZERO = new Val(0);
public static final Val ONE = new Val(1);

private final long value;

private Val(long value) {
this.value = value;
}

public Expr simplify() {
return this;
}

public static Val create(long value) {
if (0 == value) {
return ZERO;
} else if (1 == value) {

return ONE;
} else {
return new Val(value);
}
}

public long getValue() {

Lew

unread,

Dec 14, 2007, 8:46:31 AM12/14/07

to

Jon Harrop wrote:
> Lew wrote:
>> Roger Lindsjö wrote:
>>> No threading other than what the JVM does internally for housekeeping
>>> such as GC. I even ran restricted to one core (using taskset) since I
>>> was told that Ocaml was single threaded.
>> If it's true that Ocaml is single-threaded, then that kills it for a wide
>> range of applications for which Java is extremely suitable. Performance
>> would be moot, even if the numbers weren't so close.
>
> OCaml simply uses processes for concurrency rather than threads. In terms of
> capabilities, there is no difference.

I'm not familiar with OCaml. Does it have a library of concurrency
constructs, such as blocking / non-blocking queues, monitors or other locks,
memory-model coordinating constructs cognate to Java's "volatile" and the like
to coordinate shared data?

--
Lew

bugbear

unread,

Dec 14, 2007, 8:58:35 AM12/14/07

to

May I point out the existence of comp.benchmarks ?

BugBear

Jon Harrop

unread,

Dec 14, 2007, 9:50:25 AM12/14/07

to

Lew wrote:
> I'm not familiar with OCaml. Does it have a library of concurrency
> constructs, such as blocking / non-blocking queues, monitors or other
> locks, memory-model coordinating constructs cognate to Java's "volatile"
> and the like to coordinate shared data?

No, OCaml has data parallel and message passing libraries instead. These are
higher level constructs for parallelism and concurrency that abstract away
the need to use the low-level constructs like monitors and locks directly.
This is easier to use and arguably safer (although I have no evidence to
back up the latter in general, outside of Erlang's overwhelming victory for
massively-concurrent applications).

There are complicated trade-offs involved here but, essentially, OCaml
sacrifices high-performance inter-thread communication (that you can
achieve with) for better scaling because its GC requires no garbage
collection.

There is also a new research language called JoCaml that integrates support
for concurrency directly into the OCaml language using the join calculus.

You may also be interested in related efforts such as the recent support for
hinted parallelism in Haskell and F#'s recently introduction asynchronous
workflows.

Chris Dollin

unread,

Dec 14, 2007, 10:34:09 AM12/14/07

to

Jon Harrop wrote:

> There are complicated trade-offs involved here but, essentially, OCaml
> sacrifices high-performance inter-thread communication (that you can
> achieve with) for better scaling because its GC requires no garbage
> collection.

Did something go wrong with that paragraph, possibly just over-ambitions
packing? I can't make consistent sense of it.

--
Chris "perhaps it's the Friday effect" Dollin

Hewlett-Packard Limited registered no:
registered office: Cain Road, Bracknell, Berks RG12 1HN 690597 England

Message has been deleted

Mark Thornton

unread,

Dec 14, 2007, 3:02:26 PM12/14/07

to

Jon Harrop wrote:

> Chris Dollin wrote:
>> Jon Harrop wrote:
>>> There are complicated trade-offs involved here but, essentially, OCaml
>>> sacrifices high-performance inter-thread communication (that you can
>>> achieve with) for better scaling because its GC requires no garbage
>>> collection.
>> Did something go wrong with that paragraph, possibly just over-ambitions
>> packing? I can't make consistent sense of it.
>

> That was indeed a buggy paragraph. :-)
>
> Multithreaded GC (like the JVM and .NET) provides cheap interthread
> communication. OCaml's approach (independent GCs per process) can make
> interthread communication much more expensive but it scales better.
>

With multicore (shared memory) processors now the norm on the desk top,
the advantage of the Java approach is significant for some desktop
applications. The work I'm doing at the moment has threads sharing data
structures which run to hundreds of megabytes. Combined with the Doug
Lea's Fork/Join library (jsr166y) this is giving me good speed up over
sequential computation.
Separate processes not only impose higher communication overhead, but on
some architectures, the overhead of process creation is substantial.
Much as I would like to wean customers off Windows, my code has to work
efficiently on that platform.

Mark Thornton

Jon Harrop

unread,

Dec 14, 2007, 4:53:48 PM12/14/07

to

Mark Thornton wrote:
> With multicore (shared memory) processors now the norm on the desk top,
> the advantage of the Java approach is significant for some desktop
> applications.

I'm interested in seeing evidence either way. From the benchmarks in this
thread, for example, I attribute the (up to) 5x slower allocation rates in
Java to the fact that its allocator and GC must cope with interthread
issues.

> The work I'm doing at the moment has threads sharing data
> structures which run to hundreds of megabytes. Combined with the Doug
> Lea's Fork/Join library (jsr166y) this is giving me good speed up over
> sequential computation.

Sure. The work I'm doing at the moment has many processes with large shared
data structures.

> Separate processes not only impose higher communication overhead, but on
> some architectures, the overhead of process creation is substantial.
> Much as I would like to wean customers off Windows, my code has to work
> efficiently on that platform.

Yes. Process creation in OCaml takes about 10ms here whereas thread creation
in F# takes 1ms.

Roedy Green

unread,

Dec 14, 2007, 5:11:09 PM12/14/07

to

On Thu, 13 Dec 2007 23:16:57 +0100, Roger Lindsjö

<news....@tilialacus.net> wrote, quoted or indirectly quoted
someone who said :

>>> Machine 2 is a 2.4GHz Core 2 Quad also with 2GB memory.
>>> Java: 1.074 s
>>> Ocaml: 1.515 s

To my surprise Java -server skunked Jet which just edges out java
-client. I gather the run-time knowledge -server can gather is paying
off big time here.

Athlon 64 X2 3800+ dual core, 2 gig ram.

java -client Simplify 10000000 3

[* x [+ [+ [* 12 0] [+ 23 8]] y]]

Bench 0
Took 4.422 seconds
Bench 1
Took 4.728 seconds
Bench 2
Took 4.408 seconds

[* x [+ 31 y]]

using jet
Simplify.exe 10000000 3
[E:\exper]Simplify 10000000 3

[* x [+ [+ [* 12 0] [+ 23 8]] y]]

Bench 0
Took 4.178 seconds
Bench 1
Took 4.193 seconds
Bench 2
Took 4.303 seconds

[* x [+ 31 y]]

java -server Simplify 10000000 3

[* x [+ [+ [* 12 0] [+ 23 8]] y]]

Bench 0
Took 2.91 seconds
Bench 1
Took 2.84 seconds
Bench 2
Took 2.84 seconds

[* x [+ 31 y]]

--

Roedy Green

unread,

Dec 14, 2007, 5:22:35 PM12/14/07

to

On Thu, 13 Dec 2007 23:27:16 +0000, Jon Harrop <use...@jdh30.plus.com>

wrote, quoted or indirectly quoted someone who said :

>> Not at all. Java can allocate objects very quickly;, it just peels

>> them off a giant contiguous block. What takes the time is when you
>> have a large number of long lived objects since they have to be
>> enumerated each GC pass. Short lived objects are recovered in zero
>> time.
>
>Then you'll be able to improve the Java implementations of these benchmarks
>so that they aren't up to 5x slower than an unoptimized alternative.

That does not follow. A capability of the JVM implies nothing about
my skill or willinginess to optimise your code.

Mark Thornton

unread,

Dec 14, 2007, 6:29:43 PM12/14/07

to

Jon Harrop wrote:
> Mark Thornton wrote:
>> With multicore (shared memory) processors now the norm on the desk top,
>> the advantage of the Java approach is significant for some desktop
>> applications.
>
> I'm interested in seeing evidence either way. From the benchmarks in this
> thread, for example, I attribute the (up to) 5x slower allocation rates in
> Java to the fact that its allocator and GC must cope with interthread
> issues.

The allocator is very trivial, almost all the time is in collection. The
comment seen elsewhere that dead objects are free to collect is not
quite true. If you have very high rates of garbage creation, the nursery
area will fill frequently and cause garbage collection to occur often.
There seems to be at least some fixed cost in gc, so if the live object
count/size is very small, the time for each gc cycle does not tend to
zero. Many cases of high rates of allocating short lived objects have a
recognisable block structure. The current JVM can detect this (escape
analysis) but so far only uses this to eliminate some types of locks.
Perhaps unfortunately a lot of production Java code has been hand
'optimised' to reduce such allocations, which means that when the
compiler people test escape analysis, the returns aren't as good as
hoped (programmers have already done it by hand). This technique is
apparently used in the (not free) JET jvm, and we hope it will finally
appear in Java 7.

As for multithread issues, the default garbage collector (note Sun's jvm
comes with a number of collectors which you can select using runtime
options), is a stop the world variety. As all threads are stopped (apart
from the collector itself) interthread issues are non existent. In
threaded applications this means that neither allocation or collection
have threading overheads (no synchronization in either case). OK, it
must take some time to suspend/resume the working threads. Compared to
the costs involved with thread safe heaps in C/C++ this can be a
significant gain. Note that the JVM can use simplified code when running
on a uniprocessor, i.e. multiprocessor costs only occur if you actually
run the application on such a machine.

> Sure. The work I'm doing at the moment has many processes with large
> shared data structures.

It is awkward to include pointers in the shared structures unless you
can ensure that the structure is mapped to the same address in each
process. Easy to arrange in a unix fork environment, more difficult to
ensure on Windows, especially if the processes are not effectively clones.

Incidentally your ray trace code makes unnecessary use of inner classes
as opposed to nested classes. Sprinkling a few 'static' keywords around
chops about 10% off the time of the simple version. Another 10% can be
saved by specifying a larger heap than default (on my machine the
default initial heap is about 5MB). These savings apply to both client
and server JVM's.

Mark Thornton

Jon Harrop

unread,

Dec 14, 2007, 6:31:55 PM12/14/07

to

Roedy Green wrote:
> On Thu, 13 Dec 2007 23:27:16 +0000, Jon Harrop <use...@jdh30.plus.com>
> wrote, quoted or indirectly quoted someone who said :
>>> Not at all. Java can allocate objects very quickly;, it just peels
>>> them off a giant contiguous block. What takes the time is when you
>>> have a large number of long lived objects since they have to be
>>> enumerated each GC pass. Short lived objects are recovered in zero
>>> time.
>>
>>Then you'll be able to improve the Java implementations of these
>>benchmarks so that they aren't up to 5x slower than an unoptimized
>>alternative.
>
> That does not follow. A capability of the JVM implies nothing about
> my skill or willinginess to optimise your code.

Your statement is contrary to the evidence: Java does not allocate fast.

Jon Harrop

unread,

Dec 14, 2007, 6:49:08 PM12/14/07

to

Mark Thornton wrote:
> As for multithread issues, the default garbage collector (note Sun's jvm
> comes with a number of collectors which you can select using runtime
> options), is a stop the world variety. As all threads are stopped (apart

> from the collector itself) interthread issues are non existent...

Stop-the-world is exactly the kind of interthread issue that I was referring
to.

If you take this symbolic simplifier, for example, and make it run two
simplifications at a time on my dual core then its performance will not
improve by a factor of two if the GC stops all threads while it runs.

In contrast, the OCaml is not only more than twice as fast to start with but
it will scale linearly with the number of cores in this case because the
GCs do not interfere.

> > Sure. The work I'm doing at the moment has many processes with large
> > shared data structures.
>
> It is awkward to include pointers in the shared structures unless you
> can ensure that the structure is mapped to the same address in each
> process.

There are no pointers in these languages though.

> Incidentally your ray trace code makes unnecessary use of inner classes
> as opposed to nested classes. Sprinkling a few 'static' keywords around
> chops about 10% off the time of the simple version. Another 10% can be
> saved by specifying a larger heap than default (on my machine the
> default initial heap is about 5MB). These savings apply to both client
> and server JVM's.

Can you send me your code and I'll update it.

Jon Harrop

unread,

Dec 14, 2007, 7:02:17 PM12/14/07

to

Roedy Green wrote:
> On Thu, 13 Dec 2007 23:16:57 +0100, Roger Lindsjö
> <news....@tilialacus.net> wrote, quoted or indirectly quoted
> someone who said :
>
>>>> Machine 2 is a 2.4GHz Core 2 Quad also with 2GB memory.
>>>> Java: 1.074 s
>>>> Ocaml: 1.515 s
>
> To my surprise Java -server skunked Jet which just edges out java
> -client. I gather the run-time knowledge -server can gather is paying
> off big time here.
>
> Athlon 64 X2 3800+ dual core, 2 gig ram.
>

> java -server Simplify 10000000 3
> [* x [+ [+ [* 12 0] [+ 23 8]] y]]
> Bench 0
> Took 2.91 seconds
> Bench 1
> Took 2.84 seconds
> Bench 2
> Took 2.84 seconds
> [* x [+ 31 y]]

Athlon 64 X2 4400+ dual core, 2Gb RAM

OCaml: 0.695s

Even accounting for CPU speed differences, OCaml is still running 3.5x
faster, primarily because it can allocate much more quickly than Java.

This is also neglecting the facts that Java is using 100x as much RAM and
doubles the load average (i.e. maxes out both of my cores whereas the OCaml
only uses one).

Lew

unread,

Dec 14, 2007, 8:09:43 PM12/14/07

to

Jon Harrop wrote:
> Lew wrote:
>> I'm not familiar with OCaml. Does it have a library of concurrency
>> constructs, such as blocking / non-blocking queues, monitors or other
>> locks, memory-model coordinating constructs cognate to Java's "volatile"
>> and the like to coordinate shared data?
>
> No, OCaml has data parallel and message passing libraries instead. These are
> higher level constructs for parallelism and concurrency that abstract away
> the need to use the low-level constructs like monitors and locks directly.
> This is easier to use and arguably safer (although I have no evidence to
> back up the latter in general, outside of Erlang's overwhelming victory for
> massively-concurrent applications).

Message passing as a primitive coordination and concurrency mechanism is
extremely powerful, especially in an O.S. like QNX that reduces context-switch
times to ridiculously small values. I first encountered it on a minicomputer
in 1981. It makes for very straightforward coding, too.

--
Lew

Mark Thornton

unread,

Dec 15, 2007, 4:07:17 AM12/15/07

to

Jon Harrop wrote:
> Mark Thornton wrote:
>> As for multithread issues, the default garbage collector (note Sun's jvm
>> comes with a number of collectors which you can select using runtime
>> options), is a stop the world variety. As all threads are stopped (apart
>> from the collector itself) interthread issues are non existent...
>
> Stop-the-world is exactly the kind of interthread issue that I was referring
> to.
>
> If you take this symbolic simplifier, for example, and make it run two
> simplifications at a time on my dual core then its performance will not
> improve by a factor of two if the GC stops all threads while it runs.
>
> In contrast, the OCaml is not only more than twice as fast to start with but
> it will scale linearly with the number of cores in this case because the
> GCs do not interfere.

Note that Sun's Java 6 also comes with a parallel garbage collector. I
didn't test that because my home machine is an old single core
processor. I notice from other comments that the performance of OCaml
relative to Java seems to depend quite a bit on the specific
environment. Some people are reporting Java as faster.

> Can you send me your code and I'll update it.

Done.

Mark Thornton

Roger Lindsjö

unread,

Dec 15, 2007, 5:17:10 AM12/15/07

to

Jon Harrop wrote:
> Roedy Green wrote:
>> On Thu, 13 Dec 2007 23:27:16 +0000, Jon Harrop <use...@jdh30.plus.com>
>> wrote, quoted or indirectly quoted someone who said :
>>>> Not at all. Java can allocate objects very quickly;, it just peels
>>>> them off a giant contiguous block. What takes the time is when you
>>>> have a large number of long lived objects since they have to be
>>>> enumerated each GC pass. Short lived objects are recovered in zero
>>>> time.
>>> Then you'll be able to improve the Java implementations of these
>>> benchmarks so that they aren't up to 5x slower than an unoptimized
>>> alternative.
>> That does not follow. A capability of the JVM implies nothing about
>> my skill or willinginess to optimise your code.
>
> Your statement is contrary to the evidence: Java does not allocate fast.

But on my machine it seems quite fast. It's a bit hard to measure the
allocations apart from the rest of the code, but at least Java runs on
par with Ocaml.

With the 0 and 1 cache and using longs I get 3 allocations for
simplifying the expression. This gives about 46.4 clock cycles per
allocation (which of course also includes the rest of the logic). I
gouss I could try to write a similar program with no allocations, but
who knows what the Hotspot would optimize that to?

I'm not sure if tuning the GC would improve the performance, the objects
allocated are quite small and should not migrate out of the nursery. For
now I don't think that's an exercise worth doing.

By the way, the raytrace you have been referring to, is it the one on
http://www.ffconsultancy.com/languages/ray_tracer/ ?

//Roger Lindsjö

Roedy Green

unread,

Dec 15, 2007, 1:53:34 PM12/15/07

to

On Fri, 14 Dec 2007 23:31:55 +0000, Jon Harrop <use...@jdh30.plus.com>

wrote, quoted or indirectly quoted someone who said :

>Your statement is contrary to the evidence: Java does not allocate fast.

In a non-GC system, when you allocate an object the allocator looks
around for a suitable sized hole for it. This takes longer than an
addition to allocate the object from a single giant contiguous pool.
The allocation is thus faster.

On free, on a non-CG system, the allocator puts the space back on the
free space chain, and perhaps consolidates it with free space before
and after. In Java, there is no explicit free, but when GC times
comes, the active objects are copied to a new bin. There is ZERO
processing on the dead objects. see
http://mindprod.com/jgloss/garbagecollection.html

The time to complete the GC is proportional to the number of LIVE
objects.

The faster you allocate objects, the more frequently you need GC
passes.

So traditional allocation would be optimal in a system where you
allocated a large number of objects to start and keep them all.

Java allocation would be optimal when you had almost no long term
objects, but many short lived objects, particularly if they are of
assorted sizes.

GC is a whole field of study in itself. Generational collectors
collect objects of different ages in different bins, and GC them
separately. Concurrent collectors can grind away in the background to
avoid pausing the app.

Jon Harrop

unread,

Dec 15, 2007, 1:45:28 PM12/15/07

to

Lew wrote:

> Jon Harrop wrote:
>> No, OCaml has data parallel and message passing libraries instead. These
>> are higher level constructs for parallelism and concurrency that abstract
>> away the need to use the low-level constructs like monitors and locks
>> directly. This is easier to use and arguably safer (although I have no
>> evidence to back up the latter in general, outside of Erlang's
>> overwhelming victory for massively-concurrent applications).
>
> Message passing as a primitive coordination and concurrency mechanism is
> extremely powerful, especially in an O.S. like QNX that reduces
> context-switch
> times to ridiculously small values. I first encountered it on a
> minicomputer
> in 1981. It makes for very straightforward coding, too.

Exactly, yes. However, OCaml does not do such optimizations (AFAIK). To be
fair, I haven't tried it though.

Roedy Green

unread,

Dec 15, 2007, 1:55:59 PM12/15/07

to

On Fri, 14 Dec 2007 23:31:55 +0000, Jon Harrop <use...@jdh30.plus.com>

wrote, quoted or indirectly quoted someone who said :

> the evidence

What evidence? As you get more familiar with Java, you will discover
that often CPU time is getting chewed up in quite unexpected ways.
Don't leap to conclusions.

You need to examine the generated code, and play with the microtimer.

See http://mindprod.com/jgloss/time.html

Jon Harrop

unread,

Dec 15, 2007, 2:11:18 PM12/15/07

to

Roger Lindsjö wrote:
> But on my machine it seems quite fast. It's a bit hard to measure the
> allocations apart from the rest of the code, but at least Java runs on
> par with Ocaml.

In absolute terms, the fastest program is still the OCaml running on my
machine (0.7s) even though my machine is significantly slower than yours.
Also, we've seen benchmark results from several different machines and the
one where Java is slightly faster than OCaml is an outlier. In most cases,
Java is ~2-3x slower, in at least one case it was 5x slower.

Also, you haven't normalized by the load average, which is 1.0 for OCaml but
~1.4 for Java here.

Incidentally, I suspect memory bandwidth is the problem on most machines. I
think your machine where Java does comparatively well has extremely high
memory bandwidth. My reason for speculating this is that the Java program
is using 100x as much memory as the OCaml.

> With the 0 and 1 cache and using longs I get 3 allocations for
> simplifying the expression. This gives about 46.4 clock cycles per
> allocation (which of course also includes the rest of the logic). I
> gouss I could try to write a similar program with no allocations, but
> who knows what the Hotspot would optimize that to?

We should try it rather than speculate.

Also, the improvement in performance noted by Steve that was attributed to
Hotspot seems to be nothing more than GC interactions that periodically
slow the Java code down. I did 1,000 runs of the benchmark and, when the GC
kicks in, the Java code slows down by almost 2x and the load average
increases to 2.0, i.e. Java is maxing out both of my CPUs and is still
several times slower here.

> I'm not sure if tuning the GC would improve the performance, the objects
> allocated are quite small and should not migrate out of the nursery. For
> now I don't think that's an exercise worth doing.

Exactly, yes. There is no point in avoiding the allocation in OCaml code
because it is extremely fast. In contrast, you can often get huge
performance improvements in Java my manually rewriting code to amortize
allocations.

> By the way, the raytrace you have been referring to, is it the one on
> http://www.ffconsultancy.com/languages/ray_tracer/ ?

Yes. Note how the fastest Java implementation achieves a huge performance
boost by manually working around the allocation of temporaries, which is
very slow in Java.

I'd like to know if the fastest and second fastest implementations of the
Java ray tracer also show a smaller performance gap on your machine (the
one where Java is slightly faster than OCaml).

Mark Thornton

unread,

Dec 15, 2007, 2:56:32 PM12/15/07

to

Roedy Green wrote:
> On Fri, 14 Dec 2007 23:31:55 +0000, Jon Harrop <use...@jdh30.plus.com>
> wrote, quoted or indirectly quoted someone who said :
>
>> Your statement is contrary to the evidence: Java does not allocate fast.
>
> In a non-GC system, when you allocate an object the allocator looks
> around for a suitable sized hole for it. This takes longer than an
> addition to allocate the object from a single giant contiguous pool.
> The allocation is thus faster.
>
> On free, on a non-CG system, the allocator puts the space back on the
> free space chain, and perhaps consolidates it with free space before
> and after. In Java, there is no explicit free, but when GC times
> comes, the active objects are copied to a new bin. There is ZERO
> processing on the dead objects. see
> http://mindprod.com/jgloss/garbagecollection.html
>
> The time to complete the GC is proportional to the number of LIVE
> objects.
>
> The faster you allocate objects, the more frequently you need GC
> passes.

However if you have almost no live objects, the time for a GC pass
doesn't reach zero. We can reduce the number of GC passes by throwing
memory at the problem --- increase the size of the nursery area.

Mark Thornton

Roger Lindsjö

unread,

Dec 15, 2007, 7:15:28 PM12/15/07

to

Jon Harrop wrote:
> Roger Lindsjö wrote:
>> But on my machine it seems quite fast. It's a bit hard to measure the
>> allocations apart from the rest of the code, but at least Java runs on
>> par with Ocaml.
>
> In absolute terms, the fastest program is still the OCaml running on my
> machine (0.7s) even though my machine is significantly slower than yours.
> Also, we've seen benchmark results from several different machines and the
> one where Java is slightly faster than OCaml is an outlier. In most cases,
> Java is ~2-3x slower, in at least one case it was 5x slower.
>
> Also, you haven't normalized by the load average, which is 1.0 for OCaml but
> ~1.4 for Java here.

I'm not sure what you mean here, when restricting to one core, both
programs used about 100% and I don't think any other processes(or at
least very few) were scheduled on that core.

> Incidentally, I suspect memory bandwidth is the problem on most machines. I
> think your machine where Java does comparatively well has extremely high
> memory bandwidth. My reason for speculating this is that the Java program
> is using 100x as much memory as the OCaml.

The memory bandwidth might be one reason but memory usage was not a
constraint or a measurement point earlier in the discussion. And 100
times the memory? I'm running a longer test with heapsize of 1.5MB, the
JVM as runtime system take a bit of memory. The resident size of the
Java process is 11MB while the size of the Ocaml is 1MB. I'm measuring
the memory use of the Java process, and during the first 10 minutes no
data has survived to the old heap, a few object temporarily survived a
short while in eden while almost everything seems to die within the
nurseries (this is of course expected behavior).

>> With the 0 and 1 cache and using longs I get 3 allocations for
>> simplifying the expression. This gives about 46.4 clock cycles per
>> allocation (which of course also includes the rest of the logic). I
>> gouss I could try to write a similar program with no allocations, but
>> who knows what the Hotspot would optimize that to?
>
> We should try it rather than speculate.

I will try to make a cache for all "Val", since we have so few different
values a simple switch should solve that, and then we might be able to
measure how much time a "Val" would take to create/destroy. The problem
is that the we don't stress the nurseries with "Val" objects, so they
might behave differently.

> Also, the improvement in performance noted by Steve that was attributed to
> Hotspot seems to be nothing more than GC interactions that periodically
> slow the Java code down. I did 1,000 runs of the benchmark and, when the GC
> kicks in, the Java code slows down by almost 2x and the load average
> increases to 2.0, i.e. Java is maxing out both of my CPUs and is still
> several times slower here.

For me GC does not "kick in" at all, I have no full GCs at all (so far).
I have 800 minor GCs a second, but you would have to fine tune the
measuring to be able to see the slowdown while they run.

>> I'm not sure if tuning the GC would improve the performance, the objects
>> allocated are quite small and should not migrate out of the nursery. For
>> now I don't think that's an exercise worth doing.
>
> Exactly, yes. There is no point in avoiding the allocation in OCaml code
> because it is extremely fast. In contrast, you can often get huge
> performance improvements in Java my manually rewriting code to amortize
> allocations.

My experience is the opposite, for small, short lived objects it has
been better to create-use-forget than use smart code that reuses
objects. However, once the objects grow larger than a few KB, then they
are directly promoted to the old heap, and for those objects it might be
better to handle them differently. But this could vary between
applications. I normally work with web service application with a lot of
xml processing and generation.

>> By the way, the raytrace you have been referring to, is it the one on
>> http://www.ffconsultancy.com/languages/ray_tracer/ ?
>
> Yes. Note how the fastest Java implementation achieves a huge performance
> boost by manually working around the allocation of temporaries, which is
> very slow in Java.
>
> I'd like to know if the fastest and second fastest implementations of the
> Java ray tracer also show a smaller performance gap on your machine (the
> one where Java is slightly faster than OCaml).

This is my output:
Java 1
real 0m12.624s
user 0m12.264s
sys 0m0.812s

Ocaml 1
real 1m13.171s
user 1m13.112s
sys 0m0.045s

Java 2
real 0m8.911s
user 0m9.186s
sys 0m0.177s

Ocaml 2
real 0m51.302s
user 0m51.261s
sys 0m0.032s

Java 3
real 0m8.012s
user 0m8.304s
sys 0m0.163s

Ocaml 3
real 0m8.200s
user 0m8.181s
sys 0m0.020s

Java 4
real 0m7.962s
user 0m8.255s
sys 0m0.150s

Ocaml 4
real 0m7.402s
user 0m7.378s
sys 0m0.019s

Java 5
real 0m5.557s
user 0m5.837s
sys 0m0.113s

Ocaml 5
real 0m7.350s
user 0m7.342s
sys 0m0.008s

As far as I can see, the slowest Java code takes about 2.5 times longer
to complete than the fastest code.

Oh and why isn't the Ocaml programs saving the image to a file but
instead writes to standard out?

//Roger Lindsjö

Roger Lindsjö

unread,

Dec 15, 2007, 7:23:55 PM12/15/07

to

Sorry couldn't help my self, here is also the timing form the c++ programs.

C++ 1
real 0m6.557s
user 0m6.542s
sys 0m0.013s

C++ 2
real 0m5.040s
user 0m5.028s
sys 0m0.010s

C++ 3
real 0m4.299s
user 0m4.287s
sys 0m0.007s

C++ 4
real 0m4.074s
user 0m4.068s
sys 0m0.006s

C++ 5
real 0m4.151s
user 0m4.139s
sys 0m0.011s

//Roger Lindsjö

Jon Harrop

unread,

Dec 15, 2007, 7:37:13 PM12/15/07

to

Roger Lindsjö wrote:

> Jon Harrop wrote:
>> Also, you haven't normalized by the load average, which is 1.0 for OCaml
>> but ~1.4 for Java here.
>
> I'm not sure what you mean here, when restricting to one core, both
> programs used about 100% and I don't think any other processes(or at
> least very few) were scheduled on that core.

I'm not sure what you mean by "restricting to one core". The Java is using
both of my cores: one for the main program and the other for the GC.

>> Incidentally, I suspect memory bandwidth is the problem on most machines.
>> I think your machine where Java does comparatively well has extremely
>> high memory bandwidth. My reason for speculating this is that the Java
>> program is using 100x as much memory as the OCaml.
>
> The memory bandwidth might be one reason but memory usage was not a
> constraint or a measurement point earlier in the discussion. And 100
> times the memory? I'm running a longer test with heapsize of 1.5MB, the
> JVM as runtime system take a bit of memory. The resident size of the
> Java process is 11MB while the size of the Ocaml is 1MB.

That's odd. Resident size here is 180Mb for the Java and 1.7Mb for the
OCaml.

>> Exactly, yes. There is no point in avoiding the allocation in OCaml code
>> because it is extremely fast. In contrast, you can often get huge
>> performance improvements in Java my manually rewriting code to amortize
>> allocations.
>
> My experience is the opposite, for small, short lived objects it has
> been better to create-use-forget than use smart code that reuses
> objects. However, once the objects grow larger than a few KB, then they
> are directly promoted to the old heap, and for those objects it might be
> better to handle them differently. But this could vary between
> applications. I normally work with web service application with a lot of
> xml processing and generation.

Interesting. Assuming your XML processing functions don't destroy their
input I'd have expected allocation to be something you'd optimize away as
well.

Interesting. I get:

Java:
18.3124079704
13.3784809113
12.2087609768
12.3188209534
6.49561595917

OCaml:
20.0419669151
11.9526901245
5.76670503616
5.23136401176
3.96777296066

So you're again seeing the exact opposite of what I'd predicted: Java gets
significantly faster but OCaml doesn't on your machine for the last two
implementations.

From the initial OCaml timings I assume you're using 32-bit. I'd be
interested to see 64-bit results as well: they tend to be much faster for
this kind of work.

> Oh and why isn't the Ocaml programs saving the image to a file but
> instead writes to standard out?

They should both be writing to stdout because the original C++ did. Looks
like a bug in the Java implementations...

Jon Harrop

unread,

Dec 15, 2007, 7:42:26 PM12/15/07

to

Roger Lindsjö wrote:
> Sorry couldn't help my self, here is also the timing form the c++
> programs.

What version of what compiler?

> C++ 1
> real 0m6.557s
> user 0m6.542s
> sys 0m0.013s
>
> C++ 2
> real 0m5.040s
> user 0m5.028s
> sys 0m0.010s
>
> C++ 3
> real 0m4.299s
> user 0m4.287s
> sys 0m0.007s
>
> C++ 4
> real 0m4.074s
> user 0m4.068s
> sys 0m0.006s
>
> C++ 5
> real 0m4.151s
> user 0m4.139s
> sys 0m0.011s

Very interesting. So the only language that runs fast on both of our
computers is C++.

For C++, I get:

7.0633699894
5.21576809883
4.62505602837
4.38395786285
4.0194811821

I also find it amusing that my AMD hardware is faster than your Intel
hardware. ;-)

Jon Harrop

unread,

Dec 15, 2007, 7:43:09 PM12/15/07

to

Roedy Green wrote:
> On Fri, 14 Dec 2007 23:31:55 +0000, Jon Harrop <use...@jdh30.plus.com>
> wrote, quoted or indirectly quoted someone who said :
>> the evidence
>
> What evidence?

These benchmark results.

Jon Harrop

unread,

Dec 15, 2007, 7:46:11 PM12/15/07

to

Roedy Green wrote:
> In a non-GC system, when you allocate an object the allocator looks
> around for a suitable sized hole for it. This takes longer than an
> addition to allocate the object from a single giant contiguous pool.
> The allocation is thus faster.
>
> On free, on a non-CG system, the allocator puts the space back on the
> free space chain, and perhaps consolidates it with free space before
> and after. In Java, there is no explicit free, but when GC times
> comes, the active objects are copied to a new bin. There is ZERO
> processing on the dead objects. see
> http://mindprod.com/jgloss/garbagecollection.html
>
> The time to complete the GC is proportional to the number of LIVE
> objects.
>
> The faster you allocate objects, the more frequently you need GC
> passes.
>
> So traditional allocation would be optimal in a system where you
> allocated a large number of objects to start and keep them all.
>
> Java allocation would be optimal when you had almost no long term
> objects, but many short lived objects, particularly if they are of
> assorted sizes.
>
> GC is a whole field of study in itself. Generational collectors
> collect objects of different ages in different bins, and GC them
> separately. Concurrent collectors can grind away in the background to
> avoid pausing the app.

Sure. None of that compares Java to other garbage collected languages though
(of which there are many) and, consequently, does not support your claim
that Java is comparatively fast at allocating.

Roger Lindsjö

unread,

Dec 16, 2007, 4:50:46 AM12/16/07

to

Jon Harrop wrote:
> Roger Lindsjö wrote:
>> Sorry couldn't help my self, here is also the timing form the c++
>> programs.
>
> What version of what compiler?

Yes, I should of course have posted that.
g++ (GCC) 4.1.2 20070925 (Red Hat 4.1.2-27)

Well, at least at one of the test, and comparing our fastes runs, yours
is a whopping 1.3% faster. Averaging out the different runs mine is
almost 5% faster ;-).

Of course this is a silly comparison, I think they run very comparable
in speed for the C++ code. Being so similar for this code it is quite
interesting that we get so different results for Java vs Ocaml, but
memory bandwidth could be a reason.

//Roger Lindsjö

Roger Lindsjö

unread,

Dec 16, 2007, 5:07:34 AM12/16/07

to

Jon Harrop wrote:
> Roger Lindsjö wrote:
>> Jon Harrop wrote:
>>> Also, you haven't normalized by the load average, which is 1.0 for OCaml
>>> but ~1.4 for Java here.
>> I'm not sure what you mean here, when restricting to one core, both
>> programs used about 100% and I don't think any other processes(or at
>> least very few) were scheduled on that core.
>
> I'm not sure what you mean by "restricting to one core". The Java is using
> both of my cores: one for the main program and the other for the GC.

That's why I limited it to only use one core, using taskset.

>>> Incidentally, I suspect memory bandwidth is the problem on most machines.
>>> I think your machine where Java does comparatively well has extremely
>>> high memory bandwidth. My reason for speculating this is that the Java
>>> program is using 100x as much memory as the OCaml.
>> The memory bandwidth might be one reason but memory usage was not a
>> constraint or a measurement point earlier in the discussion. And 100
>> times the memory? I'm running a longer test with heapsize of 1.5MB, the
>> JVM as runtime system take a bit of memory. The resident size of the
>> Java process is 11MB while the size of the Ocaml is 1MB.
>
> That's odd. Resident size here is 180Mb for the Java and 1.7Mb for the
> OCaml.

I also limited the heap size since Java will grab a beap proportional to
system memory. I ran with a heap of 1.5MB.

>>> Exactly, yes. There is no point in avoiding the allocation in OCaml code
>>> because it is extremely fast. In contrast, you can often get huge
>>> performance improvements in Java my manually rewriting code to amortize
>>> allocations.
>> My experience is the opposite, for small, short lived objects it has
>> been better to create-use-forget than use smart code that reuses
>> objects. However, once the objects grow larger than a few KB, then they
>> are directly promoted to the old heap, and for those objects it might be
>> better to handle them differently. But this could vary between
>> applications. I normally work with web service application with a lot of
>> xml processing and generation.
>
> Interesting. Assuming your XML processing functions don't destroy their
> input I'd have expected allocation to be something you'd optimize away as
> well.

The application gets a lot of small inputs from different sources, they
are parsed to XML, XSL transformations are done and then some other
stuff such as user sessions replicated among all servers etc. We used to
have implementations minimizing allocations, and for Java 1.3 this was a
performance boost. However, after migrating to Java 1.5, then that was
no longer the case, so then we simplified the code.

I'm not about to install a 64bit version of GNU/Linux, perhaps someone
else with similar hardware already has? But why is 64bit faster for this
work?

For the first Ocaml example we get huge differences in speed, my machine
is taking 3.5 times longer than yours while for the rest our machines
are a lot closer.

>> Oh and why isn't the Ocaml programs saving the image to a file but
>> instead writes to standard out?
>
> They should both be writing to stdout because the original C++ did. Looks
> like a bug in the Java implementations...

Ok, I thought it was because you'd need 3rd party modules ;-). That was
before I tried the C++ version and I noticed the same behavior. A
graphics file printed to the console is always a nice treat :-)

I could change the Java implementations, but I doubt that would make a
speed difference. Would you like new versions that behave as the others?

//Roger Lindsjö

Mark Thornton

unread,

Dec 16, 2007, 6:06:14 AM12/16/07

to

Roger Lindsjö wrote:
> The application gets a lot of small inputs from different sources, they
> are parsed to XML, XSL transformations are done and then some other
> stuff such as user sessions replicated among all servers etc. We used to
> have implementations minimizing allocations, and for Java 1.3 this was a
> performance boost. However, after migrating to Java 1.5, then that was
> no longer the case, so then we simplified the code.
>

The amount of 'effort' that can usefully be expended on reducing
allocations has gone down. However I think cases equivalent to basic
complex arithmetic are still noticeably faster if the objects are
eliminated.

Mark Thornton

Jon Harrop

unread,

Dec 16, 2007, 11:04:26 AM12/16/07

to

Roger Lindsjö wrote:
> Jon Harrop wrote:
>> Roger Lindsjö wrote:
>>> Sorry couldn't help my self, here is also the timing form the c++
>>> programs.
>>
>> What version of what compiler?
>
> Yes, I should of course have posted that.
> g++ (GCC) 4.1.2 20070925 (Red Hat 4.1.2-27)

My results are for g++ 4.2.3 and I noticed a significant slow-down moving to
that from 4.1.2. C++ used to be faster than OCaml... :-)

> Of course this is a silly comparison, I think they run very comparable
> in speed for the C++ code. Being so similar for this code it is quite
> interesting that we get so different results for Java vs Ocaml, but
> memory bandwidth could be a reason.

Possibly. The memory bandwidth on this machine is comparatively low but I
don't notice it with OCaml. I wonder if this makes OCaml's performance more
future proof as the memory gap widens...

Jon Harrop

unread,

Dec 16, 2007, 11:19:20 AM12/16/07

to

Roger Lindsjö wrote:
> Jon Harrop wrote:
>> I'm not sure what you mean by "restricting to one core". The Java is
>> using both of my cores: one for the main program and the other for the
>> GC.
>
> That's why I limited it to only use one core, using taskset.

Aha. :-)

>>>> Incidentally, I suspect memory bandwidth is the problem on most
>>>> machines. I think your machine where Java does comparatively well has
>>>> extremely high memory bandwidth. My reason for speculating this is that
>>>> the Java program is using 100x as much memory as the OCaml.
>>> The memory bandwidth might be one reason but memory usage was not a
>>> constraint or a measurement point earlier in the discussion. And 100
>>> times the memory? I'm running a longer test with heapsize of 1.5MB, the
>>> JVM as runtime system take a bit of memory. The resident size of the
>>> Java process is 11MB while the size of the Ocaml is 1MB.
>>
>> That's odd. Resident size here is 180Mb for the Java and 1.7Mb for the
>> OCaml.
>
> I also limited the heap size since Java will grab a beap proportional to
> system memory. I ran with a heap of 1.5MB.

Ah, ok. I didn't realise you were tinkering with the settings. I was just
running with default settings. Interestingly, I can improve the performance
of the OCaml by reducing its minor heap size. It then takes 0.67s.

>> Interesting. Assuming your XML processing functions don't destroy their
>> input I'd have expected allocation to be something you'd optimize away as
>> well.
>
> The application gets a lot of small inputs from different sources, they
> are parsed to XML, XSL transformations are done and then some other
> stuff such as user sessions replicated among all servers etc. We used to
> have implementations minimizing allocations, and for Java 1.3 this was a
> performance boost. However, after migrating to Java 1.5, then that was
> no longer the case, so then we simplified the code.

Very interesting. In that case the JVM might be a much better candidate for
implementing functional programming languages.

>> From the initial OCaml timings I assume you're using 32-bit. I'd be
>> interested to see 64-bit results as well: they tend to be much faster for
>> this kind of work.
>
> I'm not about to install a 64bit version of GNU/Linux, perhaps someone
> else with similar hardware already has? But why is 64bit faster for this
> work?

Generating efficient assembler for floating point intensive code is much
easier with the x86-64 instruction set than the old x86 instruction set.
Consequently, many numerical programs in different languages get ~2x faster
moving to 64-bit.

> For the first Ocaml example we get huge differences in speed, my machine
> is taking 3.5 times longer than yours while for the rest our machines
> are a lot closer.

Yes. The creators of OCaml put a lot more effort into optimizing for the
more modern architecture.

>> They should both be writing to stdout because the original C++ did. Looks
>> like a bug in the Java implementations...
>
> Ok, I thought it was because you'd need 3rd party modules ;-). That was
> before I tried the C++ version and I noticed the same behavior. A
> graphics file printed to the console is always a nice treat :-)

The programs should probably all check that they're not printing to a
console. :-)

> I could change the Java implementations, but I doubt that would make a
> speed difference. Would you like new versions that behave as the others?

Yes please!

Lew

unread,

Dec 16, 2007, 2:03:47 PM12/16/07

to

Mark Thornton wrote:
> The amount of 'effort' that can usefully be expended on reducing
> allocations has gone down. However I think cases equivalent to basic
> complex arithmetic are still noticeably faster if the objects are
> eliminated.

In the special case of relatively small primitive structures such as

public class Complex
{
private final double re, im;
public Complex( double r, double i )
{ re = r; im = i; }
public final double re() { return re; }
public final double im() { return im; }
}

you will likely find that Hotspot will eliminate the objects for you.

The scenario in complex arithmetic is lots of intermediate values - these have
very short lifespans and live completely within the stack context. Object
creation and accessor calls almost certainly will inline and disappear. he
values could live in registers, thus avoiding even memory overhead.

I chose an example of an immutable object, but simple mutable structures also
optimize.

--
Lew

Mark Thornton

unread,

Dec 16, 2007, 2:17:47 PM12/16/07

to

Lew wrote:
> Mark Thornton wrote:
>> The amount of 'effort' that can usefully be expended on reducing
>> allocations has gone down. However I think cases equivalent to basic
>> complex arithmetic are still noticeably faster if the objects are
>> eliminated.
>
> In the special case of relatively small primitive structures such as
>
> public class Complex
> {
> private final double re, im;
> public Complex( double r, double i )
> { re = r; im = i; }
> public final double re() { return re; }
> public final double im() { return im; }
> }
>
> you will likely find that Hotspot will eliminate the objects for you.

As far as I know that optimisation hasn't been implemented in HotSpot
yet. It is planned.

>
> The scenario in complex arithmetic is lots of intermediate values -
> these have very short lifespans and live completely within the stack
> context. Object creation and accessor calls almost certainly will
> inline and disappear.

The accessor calls are inlined, but the object creation remains.

> I chose an example of an immutable object, but simple mutable structures
> also optimize.

immutable objects where == is a value comparison can be placed directly
in arrays instead of arrays of references. This is important in many
applications of complex values.

Mark Thornton

Lew

unread,

Dec 16, 2007, 3:00:07 PM12/16/07

to

--
Lew

Message has been deleted

Lew

unread,

Dec 16, 2007, 3:10:45 PM12/16/07

to

Lew wrote:
>> The scenario in complex arithmetic is lots of intermediate values -
>> these have very short lifespans and live completely within the stack
>> context. Object creation and accessor calls almost certainly will
>> inline and disappear.

Mark Thornton wrote:
> The accessor calls are inlined, but the object creation remains.

According to Sun, they've expanded this capability already in Java 6, at least
if I understand
<http://java.sun.com/performance/reference/whitepapers/6_performance.html#2.1.7>
correctly.

Sun references
<http://www.ssw.uni-linz.ac.at/Research/Papers/Ko05/>
and
<http://www.ssw.uni-linz.ac.at/Research/Papers/Wimmer04Master/>
in the context of the latest HotSpot VM.

Also,
<http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-3412&yr=2006&track=coreplatform>

particularly pp. 24-42.

--
Lew

Mark Thornton

unread,

Dec 16, 2007, 3:20:04 PM12/16/07

to

Lew wrote:
> Lew wrote:
>>> The scenario in complex arithmetic is lots of intermediate values -
>>> these have very short lifespans and live completely within the stack
>>> context. Object creation and accessor calls almost certainly will
>>> inline and disappear.
>
> Mark Thornton wrote:
>> The accessor calls are inlined, but the object creation remains.
>
> According to Sun, they've expanded this capability already in Java 6, at
> least if I understand
> <http://java.sun.com/performance/reference/whitepapers/6_performance.html#2.1.7>
>
> correctly.

My understanding is that the use of escape analysis for synchronization
optimization made the cut into Java 6, but the use for eliminating
object allocation did not. Much of the development had been done, but it
wasn't considered ready in time.

Mark Thornton

Lew

unread,

Dec 16, 2007, 5:58:53 PM12/16/07

to

Mark Thornton wrote:
> My understanding is that the use of escape analysis for synchronization
> optimization made the cut into Java 6, but the use for eliminating
> object allocation did not. Much of the development had been done, but it
> wasn't considered ready in time.

Undoubtedly your information is better than mine.

I can hardly wait.

--
Lew