Great! Thanks.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
Brian K. Reid: "In computer science, we stand on each other's feet."
Thanks. I had flagged the original thread as one to go back and read
once the dust had settled, but the web page looks a lot more user-friendly.
There are some details missing from the webpages
1) which C implementation?
2) which Java implementation?
3) what hardware?
For example, using the code from the webpage
1) gcc version 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0, pie-8.7.8)
gcc -pipe -Wall -O3 -fomit-frame-pointer -funroll-loops -march=pentium4
latin.c -o latinc
time ./latinc > /dev/null 2>&1
user 0m0.820s
sys 0m0.000s
2) /sun-jdk-1.5.0.07/bin/javac Latin.java
time java Latin > /dev/null 2>&1
user 0m3.800s
sys 0m0.644s
3) 2GHz Intel P4
> 2) which Java implementation?
>
[peterhickman]$ javac -version
javac 1.5.0_06
Additionally
[peterhickman]$ perl -V
Summary of my perl5 (revision 5 version 8 subversion 6) configuration:
[peterhickman]$ ruby -v
ruby 1.8.4 (2005-12-24) [powerpc-darwin]
> 3) what hardware?
>
>
Macintosh G4 with 1Gb ram
I recall someone stating "benchmarking without analysis is bogus".
As a first step, comment out the print statements
time ./latinc
user 0m0.492s
sys 0m0.004
time java Latin
user 0m0.992s
sys 0m0.052s
With the print statements ~5.4x
without print statements ~2.1x
iirc the Java program is shuffling around double byte unicode chars and
the C program is handling single byte chars.
Hmm, it would look much better for me if you had included Kristof
Bastiaensen's Curry version
into the party... Heck, if that code is not smart then nothing is, and
in a way, it's a much more
interesting question how a compiled functional language compares to
compiled imepative ones
than the thousand time discussed interpreted vs. compiled match-up.
Yes, Curry is anything but mainstream, but you can't say a decent
compiler is not at your disposal.
The Munster CC (which was used by Kristof, and AFAIK that's the most
commonly used implementation) does even have an IDE for OS X.
Regards,
Csaba
IO can be noisy. I say avoid it for any benchmarking since it can
greatly influence timings. Usually the IO is not what you want to
measure so why add this variable into things?
> - If you're building up a large chunk of strings, write to an internal
> buffer and then write out at the end; don't write for every little tiny
> operation. At the very least, use a buffer per-line, rather than a separate
> write for every element on that line.
I just informally thought I would measure a few things involving IO.
I only changed the printing and nothing else:
Unaltered test: ~3.8s
Use of StringBuffer to print out a single row: ~2.1s
Use of StringBuffer for entire run: ~1.5s
Preallocated StringBuffer for entire run: ~1.4s
As you can see IO can have a large affect on clock time. I demonstrated
that in Java's case the IO in the benchmark accounted for over 2/3 of the
wall clock time (which is interesting because a decent chunk that is
left over is JVM startup overhead).
Some stack allocated space will likely improve the C run as well (and in
this case you can output it in a single write system call).
-Tom
--
+ http://www.tc.umn.edu/~enebo +---- mailto:en...@acm.org ----+
| Thomas E Enebo, Protagonist | "Luck favors the prepared |
| | mind." -Louis Pasteur |
While I certainly appreciate the efforts that are going into this, I
can't help feeling it's all completely irrelevant.
We can engage in cross-implementation pissing contests until the cows
come home. None of them help make Ruby any faster.
My question to the community: is there a comprehensive benchmark suite
*for Ruby alone* that we can use to tweak compilation settings, try out
different core algorithms, and improve what is currently an improvable
situation?
If not, would a port of pybench.py be a suitable start?
--
Alex
As you're having so much fun, let me suggest you try converting the
OutputStrings to byte-arrays, and pre-allocating a byte buffer for
output like the approach taken with this program
http://shootout.alioth.debian.org/gp4/benchmark.php?test=fasta&lang=java&id=2
.. Which is extremely funny, since Common Lisp have had wicked fast
virtual machines for the last 15 years (on par with C in performance).
They should catch up with the 20th century first of all. =)
--
Ola Bini (http://ola-bini.blogspot.com)
JvYAML, RbYAML, JRuby and Jatha contributor
System Developer, Karolinska Institutet (http://www.ki.se)
OLogix Consulting (http://www.ologix.com)
"Yields falsehood when quined" yields falsehood when quined.
. . but VMs actually are slow (to start, all else being equal).
There's a trade-off, though, and VMs tend to be faster later on in
execution for extended operations (again, all else being equal). There
are other alternatives than VMs to consider, though, and the specifics
of what one wishes to accomplish should be examined before settling on
the VM (or any other implementation style) as "the answer".
I'm kinda just babbling at this point.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"The first rule of magic is simple. Don't waste your time waving your
hands and hopping when a rock or a club will do." - McCloctnick the Lucid
That was simple because I couldn't define the array when I declared it
as I did in C.
> - Your variable naming is entirely contrary to every Java coding
> convention
> published (not a benchmark thing, but it sets off any Java devs warning
> flags)
And this affects the performance?
> - Almost all of the time spent running is spent building and printing
> strings
>
> Benchmarking just the algorithm run itself with no gross string
> creation and
> printing, I'm getting in the neighborhood of 370ms per invocation once
> HotSpot has optimized the code. I'll have more detailed numbers shortly.
>
The point Charles made with saying "but it sets off any Java devs
warning flags" is that your Java coding conventions differ so much from
regular conventions that your Java coding capacity is put into doubt. In
plain speak; are you a good enough Java programmer to write an honest
benchmark version for Java?
Does the naming convention affect performance? NO!
So having brought up such a lame point we are now trying to deflect
attention from the code by rubbishing the author. Focus on the code, not
me or whatever fantasy you have about me.
> Charles O Nutter wrote:
>> On 8/1/06, Alex Young <al...@blackkettle.org> wrote:
>>>
>>> While I certainly appreciate the efforts that are going into this, I
>>> can't help feeling it's all completely irrelevant.
>> My only purpose in battling these benchmarks is to help dispel the
>> rumors
>> that "Java is slow," "VMs are slow," and so on. If Ruby does move to a real
>> optimizing VM, it will be a good thing...all those folks who continue to
>> think that VMs are inherently bad need to join the 21st century.
>>
>
> ... Which is extremely funny, since Common Lisp have had wicked fast
> virtual machines for the last 15 years (on par with C in performance).
>
> They should catch up with the 20th century first of all. =)
Sorry, but which CL has a "wicked fast" *virtual machine* and is on
par with C? AFAICS, they either are one or the other...
--
Christian Neukirchen <chneuk...@gmail.com> http://chneukirchen.org
"scheme48" has a virtual machine, IIRC, and it's regarded highly by the
folks who did the "vmgen" and "gforth" virtual machines, which are bred
for speed using some non-ANSI features of GCC. But that's not Common
Lisp; Scheme is in some ways easier to make "wicked fast".
> The source code is available from
> http://peterhi.dyndns.org/write_it_in_c/index.html
Here is another version in Ruby. It uses a more clever algorithm.
It makes use of the fact that permuting the rows or the columns of
a latin square still gives a latin square. It also passes a constraint
on the columns that the given number can be in, to reduce the backtracking
even more.
It runs in 15 seconds on my machine, which I find quite acceptable.
I am sure that this algorithm coded in a compiled functional or logic
language would be about as fast, or faster than your C code, but with the
advantages of a higher level language.
I still think you benchmark is flawed, because
1) the C program works only for squares of size 5
2) It shows totally no relevance to the kind of problems that you
would use Ruby for. If you want raw speed for numerical applications,
then I would suggest to use a functional language (i.e. ocaml).
3) If your application runs slowly, it is probably better to think first
about why it is slow, if you can improve the code with faster
algorithms or removing bottlenecks.
Kristof Bastiaensen
---------------- latin2.rb --------------------
require 'permutation'
$size = (ARGV.shift || 5).to_i
MaxVal = $size-1
RowPermutations = Permutation.new($size).map{|p| p.value}
BaseStr = (2..$size).to_a.join
StrPermutations = Permutation.new($size-1).map{|p| p.project(BaseStr)}
StartColumns = (1..MaxVal).to_a
def init_columns(el)
a = StartColumns.dup
a.delete_at(el-1)
return a
end
def insert(sqr, num, row, columns)
insert(sqr, num, row+1, columns) if (row == num)
columns.each do |col|
next if sqr[row][col] != ?.
sqr[row][col] = num + ?1
if (row == MaxVal)
insert(sqr, num+1, 1, init_columns(num+1))
elsif (num == MaxVal && row == MaxVal - 1)
print_solution(sqr)
else
insert(sqr, num, row+1, columns - [col])
end
sqr[row][col] = ?.
end
end
def print_solution(sqr)
RowPermutations.each do |rp|
StrPermutations.each do |sp|
0.upto(MaxVal) do |r|
print sqr[rp[r]].tr(BaseStr, sp)
print ":"
end
puts
end
end
end
$square = [("1" + BaseStr)] +
Array.new($size-1) { |i| (i+2).to_s + "." * ($size - 1) }
insert($square, 0, 1, StartColumns)
[Latin]$ time ruby latin2.rb > r5
real 0m16.283s
user 0m15.380s
sys 0m0.498s
Which means that your computer is about as crap as mine :) I will add it
to the pages and make it a download. I think I need a table summarising
the timings.
Here's an OCaml version that runs in about 1.5 seconds when
output is redirected to a file on my faster computer (3.2 GHz).
It uses the same algorithm as the C program.
The C version takes 1.961 seconds when writing to /dev/null on
the o.p.'s computer. Since this is my first OCaml program, I'm
certain an expert could improve it.
(* compile with:
ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe
*)
(* permutation code by Eric C. Cooper *)
let rec distribute elt = function
| (hd :: tl) as list -> (elt :: list) ::
(List.map (fun x -> hd :: x) (distribute elt tl))
| [] -> [ [elt] ]
let rec permute = function
| x :: rest -> List.flatten (List.map (distribute x) (permute rest))
| [] -> [ [] ] ;;
let list = [ 1; 2; 3; 4; 5 ]
let size = List.length list
let perms = permute list
let n = List.length perms
let incompat = Array.make_matrix n n true ;;
for x = 0 to n - 1 do
for y = 0 to n - 1 do
incompat.(x).(y) <-
List.exists2 ( = ) (List.nth perms x) (List.nth perms y)
done
done;;
let join list = String.concat "" (List.map string_of_int list)
let output_strings = List.map join perms ;;
let board = ( Array.make size 0 ) ;;
let rec add_a_row row =
if row = size then
print_endline
( String.concat ":"
(List.map (fun i -> List.nth output_strings i)
(Array.to_list board)) )
else
for latest = 0 to n - 1 do
let prev_row = ref 0 in
while (!prev_row < row) &&
(not incompat.(latest).(board.(!prev_row))) do
incr prev_row
done;
if !prev_row = row then
( board.(row) <- latest ; add_a_row (row + 1) )
done
;;
add_a_row 0 ;;
1) I made some changes to the jen.pl script you used to generate the
Java program, it now splits apart the array initialization into many
different methods - and that will let you generate a Java program for
6x6. I emailed it to you.
2) By "benchmarking without analysis is bogus" I understand that
without analysis the programs might not be doing what we think they are
doing - what we think is an apples to apples comparison might not be an
apples to apples comparison.
In this case, I understand Charles O Nutter to be saying that in C all
those boolean arrays are initialized at compile time, but in Java all
those boolean arrays are initialized at runtime - so although we're
writing a similar thing in the code, we aren't doing the same thing
when the code is run.
In the same way, we have similar print statements in the C and Java
code, but they don't do the same thing when the code is run - one deals
with bytes the other deals with Unicode double bytes.
So it depends which part of the computation we're interested in - if
we're interested in the addARow recursive calls and loops, that part of
the computation might be swamped by the time taken to to initialize the
arrays at runtime and naively print strings.
3) There seems to be a bigger problem with using this latin squares
algorithm for benchmarking - the computation explodes!
0.496s gcc 5x5 (without print statements)
30055.098s gcc 6x6 (without print statements)
An expert did suggest replacing a couple of lists with arrays.
This should be a bit faster.
(* compile with:
ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe *)
(* permutation code by Eric C. Cooper *)
let rec distribute elt = function
| (hd :: tl) as list -> (elt :: list) ::
(List.map (fun x -> hd :: x) (distribute elt tl))
| [] -> [ [elt] ]
let rec permute = function
| x :: rest -> List.flatten (List.map (distribute x) (permute rest))
| [] -> [ [] ] ;;
let list = [ 1; 2; 3; 4; 5 ]
let size = List.length list
let perms = Array.of_list (permute list)
let n = Array.length perms
let incompat = Array.make_matrix n n true ;;
for x = 0 to n - 1 do
for y = 0 to n - 1 do
incompat.(x).(y) <-
List.exists2 ( = ) perms.(x) perms.(y)
done
done;;
let join list = String.concat "" (List.map string_of_int list)
let output_strings = Array.map join perms ;;
let board = ( Array.make size 0 ) ;;
(* recursive function *)
let rec add_a_row row =
if row = size then
print_endline
( String.concat ":"
(List.map (fun i -> output_strings.(i) )
Actually SBCL is a fork from CMUCL (with a focus on maintanability) and
as such uses the same compiler.
Compiler improvements are often ported between the two implementations
and when it comes to actual performance there is not much between
the two.
On my machine (Athlon X2 4400+, Debian Linux, gcc 4.0.4, OCaml 3.09.2, Sun
JDK 1.5):
Compile Run
C 0.137s 0.386s gcc -O3 -Wall
OCaml 0.171s 0.668s ocamlopt
OCaml 0.161s 0.626s ocamlopt -inline 100 -unsafe
OCaml2 0.165s 0.415s ocamlopt
OCaml2 0.165s 0.401s ocamlopt -unsafe
Java 1.565s 1.688s javac
Some niggles:
1. A lot of precalculation has been done in the C and Java, to the extent
that we're breaking Java compilers (always a bad sign).
2. Java is nothing like as slow as the OP claimed.
3. OCaml is only 4% slower than C and only 10% slower with bounds checking.
4. Whether Ruby is fast enough or not, you should be programming in OCaml
and not in Ruby or C. ;-)
Here's my OCaml2 (based upon William's):
let rec fact n = if n=0 then 1 else n*fact(n-1)
let size = 5
let n = fact size
(* Permutation generator *)
let p = Array.make size 0
let xx = Array.init size (fun i -> if i<size-1 then 0 else -1)
let gen_perm() =
let i = ref (size - 1) in
while !i > 0 && xx.(!i) = !i do
xx.(!i) <- 0;
decr i
done;
if !i = 0 then 1 else begin
xx.(!i) <- xx.(!i) + 1;
p.(0) <- 1;
for i=0 to size - 1 do
p.(i) <- p.(i - xx.(i));
p.(i - xx.(i)) <- i+1
done;
0
end
(* Permutations *)
let perms = Array.init n (fun _ -> ignore(gen_perm()); Array.copy p)
let incompat =
let aux x y =
Array.iteri (fun i px -> if px = perms.(y).(i) then raise Exit) perms
(x);
false in
Array.init n (fun x -> Array.init n (fun y -> try aux x y with Exit ->
true))
let join list = String.concat "" (Array.to_list (Array.map string_of_int
list))
let output_strings = Array.map join perms
let board = Array.make size 0
let op = String.make (size*(size+1)-1) ':'
let rec add_a_row row =
if row = size then begin
for i=0 to size-1 do
String.blit output_strings.(board.(i)) 0 op (i*(size+1)) size
done;
print_string op;
print_string "\n"
end else
for latest = 0 to n - 1 do
let prev_row = ref 0 in
let incompat = incompat.(latest) in
while !prev_row < row && not incompat.(board.(!prev_row)) do
incr prev_row
done;
if !prev_row = row then begin
board.(row) <- latest;
add_a_row (row + 1)
end
done
let () = add_a_row 0
--
Dr Jon D Harrop, Flying Frog Consultancy
Objective CAML for Scientists
http://www.ffconsultancy.com/products/ocaml_for_scientists
I will take the revised Ocaml version and time it and check the output
tonight.
[peterhickman]$ ocaml -version
The Objective Caml toplevel, version 3.09.2
Is this good enough for the version you wrote?
Can we just clear something up here. The Latin squares mini project was
not plucked out of the air just to be a poster child for my first post.
It was part of a project that I am working on and as such the program
has to produce some output! It does not make any sense to optimise the
program in any language if we are not going to produce any output.
Otherwise I would go with this heavily optimised C version:
[peterhickman]$ cat latin.c
int
main(int argc, char *argv[])
{
return 0;
}
This is an application, it has to produce output. Cease this willy
waving otherwise you are saying "For performance, remove the output".
Actually, OCaml is one of the languages in which I've been meaning to
get competent at some point in the hopefully-near future (for some
definition of "near"). I find that I pick up a language faster if
there's a good community for it whose means of communication suits the
way I think, though. What sort of online OCaml community resources are
available? Perl's biggest strength, in terms of interactive community,
seems (for my tastes) to be the PerlMonks site, and this list fills the
same need for Ruby. What comparable community interaction mechanism
could I use for OCaml?
I'm often disappointed by the options for a given language. For
instance, I recently (earlier this year) added another language to my
list of loves: the UCBLogo implementation of the Logo language (more a
"Lisp without parentheses" than any other Logo of which I'm aware, as it
provides macro support). It has no online community at all, except for
a bunch of half-baked educators who want to use Turtle Graphics to teach
elementary and middle school kids that computers aren't scary, which is
definitely not the community I want for my own purposes. My issue with
the Python community is far less drastic -- it actually has one of the
better programming language online communities, all things considered,
of course -- but the fact that the cultural rigidity of the Python
community seems so pervasive drives me away (as does that
fork-in-the-eye sensation I get when reading Python source). Et cetera.
Both Ruby and Perl are stand-out examples of excellence for online
communities, and while I don't need the perfection of either
necessarily, I hope OCaml provides something close.
So. Does it?
Wow, that was a lot of rambling for that question. Just to keep it
on-topic:
Thanks for being a good community, guys.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"A script is what you give the actors. A program
is what you give the audience." - Larry Wall
You probably also have:
$ ocamlopt -v
The Objective Caml native-code compiler, version 3.09.2
Standard library directory: /usr/lib/ocaml/3.09.2
In which case you just do:
$ ocamlopt latin.ml -o latin
$ time ./latin >/dev/null
Yeah. I didn't bother doing that. ;-)
The next stage, which is to stop it reporting rotational and flipped
versions of a grid, is going to be quite a grind to test.
Any views on Practical Ocaml by Joshua B. Smith, does it look like it
might be a good book to learn from?
> Here's how it goes on my box:
>
> [Latin]$ time ruby latin2.rb > r5
>
> real 0m16.283s
> user 0m15.380s
> sys 0m0.498s
>
> Which means that your computer is about as crap as mine :)
Yes. I was really suprised to see that your timings where longer than
mine :) But I am happy with the speed of my computer, though emacs
startup is a bit slow... :)
> I will add it
> to the pages and make it a download. I think I need a table summarising
> the timings.
I just realised that it is better to move the permutations on the
row string (the ones that use String#tr) outside the row permutations,
to avoid recalculating the rows each time. The only method I changed is
print_solution. The program runs almost twice as fast now! (0m7.8s on my
computer).
Regards,
Kristof
----------------- latin.rb, version 2 -----------------------
require 'permutation'
def print_solution(sqr)
StrPermutations.each do |sp|
newsqr = sqr.map { |r| r.tr(BaseStr, sp) }
RowPermutations.each do |rp|
rp.each do |r|
print newsqr[r]
> Jon Harrop wrote:
>> Peter Hickman wrote:
>>
>>> I will take the revised Ocaml version and time it and check the output
>>> tonight.
>>>
>>>
>> Yeah. I didn't bother doing that. ;-)
>>
>>
> Not wanting to name and shame but there was one submission (off list)
> that was faster for completely the wrong reasons. Besides I get a warm
> fuzzy feeling when the results match because the pain of checking which
> program is actually wrong is too great. As my Perl, Java and C versions
> all use the same Perl to pre compute their values they all produced the
> same results, consistent yes but are they correct? I'm pretty confident
> now, especially now that I have independently written programs produce
> the same results.
>
> <snip>
I used the following tests:
"ruby latin.rb | wc -l" to test if the number of squares matches the
number that is written in the wikipedia article.
Then "ruby latin.rb | sort | uniq | wc -l" shows if all the squares that
are output by the program are different.
All that remains to be done is to check that each latin square is
actually a latin square. That wouldn't be so much work, but I just looked
at some of the output, and trusted that the rest of the squares are
correct too :)
Regards,
Kristof
But given that there is no relationship between the Perl, Ruby and Ocaml
versions I am sure that there will be no surprises.
> I find that I pick up a language faster if
> there's a good community for it whose means of communication suits the
> way I think, though. What sort of online OCaml community resources are
> available?
Primarily, the online resources come from one of the two mailing lists,
OCaml-Beginners[1] for introductory to intermediate questions, and the
main OCaml list[2] for intermediate to advanced questions, or questions
on the OCaml internals and whatnot. There are a few other places where
info moves, like Richard Jones' cocan wiki [3], but at the moment
things are primarily handled through the lists (though it might be
about time for something like perlmonks to arrive as the community is
growing).
[1] http://groups.yahoo.com/group/ocaml_beginners/
[2] http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
[3] http://wiki.cocan.org/
> Any views on Practical Ocaml by Joshua B. Smith, does it look like it
> might be a good book to learn from?
Hard to say, as it won't be available for another three months, and I
can't even find a sample table of contents...
However, in the meantime, some of the best resources are freely
available. In addition to parts I, III, and IV of the OCaml manual, I
recommend anyone starting with OCaml to begin with Jason Hickey's book
[1], and then look at the translation of the older French O'Reilly book
[2] (note however, that the book is a bit out of date, but this only
crops up as a real issue a few times -- feel free to hit the
ocaml-beginners mailing list [3] with questions). Richard Jones also
has a pretty nice tutorial available online [4]. Once you get going,
there are more specific resources for tools like OCamllex, ocamlyacc,
and camlp4.
[1] http://files.metaprl.org/doc/ocaml-book.pdf
[2] http://caml.inria.fr/pub/docs/oreilly-book/
[3] http://groups.yahoo.com/group/ocaml_beginners/
[4] http://www.ocaml-tutorial.org/
1) Over the past 3 or 4 days, people have said what's wrong with the
way you're producing output in Java, and suggested what you should do
instead. Is it that you don't know enough Java to follow their
suggestions and fix the program?
A couple of days ago single measurements on my machine gave
0.820s user+sys, gcc -pipe -Wall -O3 -fomit-frame-pointer
-funroll-loops -march=pentium4
4.444s user+sys, sun-jdk-1.5.0.07
Simply writing Java that converts the double-byte Unicode Java
outputStrings to single-byte /once/ instead of everytime there's a
print statement, and following the usual Java pattern of wrapping
output with a BufferedOutputStream, reduces that to
1.396s user+sys, sun-jdk-1.5.0.07
Yesterday I emailed you another version of your Java generating Perl
script that fixes those print problems, the output matches the C
program output.
2) If this is an application, and it has to produce output, and it has
to produce output for something larger than 6x6, then I think you need
a better algorithm.
You've mentioned the 5x5 C program took 2.473s, and the 6x6 took 9900s
- at that rate might we expect 7x7 in 15 months?
So you don't know enough Java?
> That way we can get away from all this
> stupidity like the fact that I didn't use the correct naming convention.
> Which we all know hugely affects performance. But remember that the
> timings will be based on a program that produces output.
> > A couple of days ago single measurements on my machine gave
> > 0.820s user+sys, gcc -pipe -Wall -O3 -fomit-frame-pointer
> > -funroll-loops -march=pentium4
> > 4.444s user+sys, sun-jdk-1.5.0.07
> >
> > Simply writing Java that converts the double-byte Unicode Java
> > outputStrings to single-byte /once/ instead of everytime there's a
> > print statement, and following the usual Java pattern of wrapping
> > output with a BufferedOutputStream, reduces that to
> >
> > 1.396s user+sys, sun-jdk-1.5.0.07
> >
> > Yesterday I emailed you another version of your Java generating Perl
> > script that fixes those print problems, the output matches the C
> > program output.
> >
> If it was the email I replied to yesterday then reason that it got close
> to the speed of the C version was because it didn't do any output. Once
> I put the output back in it was only around 5 seconds faster than my own
> Java version and around 15 seconds slower than the C version. Hardly
> encouragement to drop C and pursue Java. Remember I am running the
> program to get the output.
If it didn't produce any output I would have mentioned that, as I have
before.
Check your email - the subject line reads " For performance use ascii
output"
A small point.
print_endline op
is shorter. Is it faster?
> Here's my OCaml2 (based upon William's):
Just a small note here:
Depending on your goals, you can do even better by stuffing the output
in a buffer then dumping the buffer at the end of computation (see
example code below). Of course this consumes more memory, and isn't
possible on a larger board, but it's yet another tweak to exploit,
since we're benchmarking output in addition to benchmarking work (note
that on my G5 output to stdout is a bit slower than your version, but
this isn't the case on the AMD machines I have access to -- redirecting
to a file is always faster).
let rec add_a_row row =
let buf = Buffer.create 1_000_000 in
let rec addh row =
match row with
| x when x=size ->
for i=0 to size-1 do
String.blit output_strings.(board.(i)) 0 op (i*(size+1)) size
done;
Buffer.add_string buf op; Buffer.add_char buf '\n'
| _ ->
for latest = 0 to n - 1 do
let prev_row = ref 0 in
let incompat = incompat.(latest) in
while !prev_row < row && not incompat.(board.(!prev_row)) do
incr prev_row
done;
if !prev_row = row then (board.(row) <- latest; addh (row + 1))
done in
addh row;
Buffer.output_buffer stdout buf
let () =
add_a_row 0;
Thanks. I'll look into it.
You mention it needs something like PerlMonks -- which is interesting,
considering I've noticed there's nothing like that for Ruby, either.
I've found myself wishing a couple of times that there was something
like a RubyMonks kinda thing (though in a more Rubyish idiom than *Monks
would be), but every time I consider the notion I find myself wishing
that PerlMonks were just a bit more multi-language, focusing primarily
on languages I personally find interesting. Bah.
I've considered putting something vaguely like that together myself,
actually, but I'm lazy and too much of a perfectionist. Since I'd be
working toward my own project spec to create such a website, I'd
probably just end up spending months working on what I want the site to
do and never get around to writing a line of code.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"The first rule of magic is simple. Don't waste your time waving your
hands and hopping when a rock or a club will do." - McCloctnick the Lucid
> let incompat =
> let aux x y =
> Array.iteri (fun i px -> if px = perms.(y).(i) then raise Exit) perms
> (x);
Perhaps google has mangled you code; at this point ocaml reports
File "latin-squares-H.ml", line 30, characters 4-15:
This function is applied to too many arguments,
maybe you forgot a `;'
> File "latin-squares-H.ml", line 30, characters 4-15:
> This function is applied to too many arguments,
> maybe you forgot a `;'
It lost a . for array access. It should be (slightly reformatted
trying to lose nothing):
let incompat =
let aux x y =
Array.iteri
(fun i px -> if px = perms.(y).(i) then raise Exit)
perms.(x);
Not with the interpreter; it's slower. Haven't tested the compiler yet.
> > A small point.
> > print_endline op
> > is shorter. Is it faster?
>
> Not with the interpreter; it's slower. Haven't tested the compiler yet.
If anything, it'll be slower, as print_endline flushes the buffer every
call, whereas print_string and print_char do not.
It looks very good but not as good as my book. ;-)
http://www.ffconsultancy.com/products/ocaml_for_scientists/chapter1.html
Post your code here.
That was 64-bit. Here are my 32-bit timings (language versions are the
same):
Compile Run
C 0.126s 0.386s gcc -O3 -Wall
OCaml2 0.089s 0.391s ocamlopt
OCaml2 0.010s 10.111s ocamlc
Java 1.615s 12.745s javac
Again, OCaml is basically as fast as C. Note that C and OCaml get slower
moving to 64-bit but Java gets >7x faster. Amazingly, OCaml's interpreted
bytecode is faster than Java on 32-bit, whilst being >160x faster to
compile!
Here's my latest OCaml (I think we could revert to the simpler list-based
permuter because no significant time is spent generating the permutations):
let rec fact n = if n=0 then 1 else n*fact(n-1)
let size = 5
(* Permutation generator *)
let p = Array.make size 0
let xx = Array.init size (fun i -> if i<size-1 then 0 else -1)
let gen_perm() =
let i = ref (size - 1) in
while !i > 0 && xx.(!i) = !i do
xx.(!i) <- 0;
decr i
done;
if !i = 0 then 1 else begin
xx.(!i) <- xx.(!i) + 1;
p.(0) <- 1;
for i=0 to size - 1 do
p.(i) <- p.(i - xx.(i));
p.(i - xx.(i)) <- i+1
done;
0
end
let n = fact size
(* Permutations *)
let perms = Array.init n (fun _ -> ignore(gen_perm()); Array.copy p)
let incompat =
let rec aux (px : int array) py i =
i<size && (px.(i) = py.(i) || aux px py (i+1)) in
Array.init n (fun x -> Array.init n (fun y -> aux perms.(x) perms.(y) 0))
let join list = String.concat "" (Array.to_list (Array.map string_of_int
list))
let output_strings = Array.map join perms
let board = Array.make size 0
let op = String.make (size*(size+1)) ':'
let rec add_a_row row =
if row=size then begin
for i=0 to size-1 do
String.blit output_strings.(board.(i)) 0 op (i*(size+1)) size
done;
print_string op
end else
for latest = 0 to n - 1 do
let prev_row = ref 0 in
let incompat = incompat.(latest) in
while !prev_row < row && not incompat.(board.(!prev_row)) do
incr prev_row;
done;
if !prev_row = row then begin
board.(row) <- latest;
add_a_row (row + 1)
end
done
let () =
op.[size*(size+1)-1] <- '\n';
add_a_row 0
>> 2) If this is an application, and it has to produce output, and it has
>> to produce output for something larger than 6x6, then I think you need
>> a better algorithm.
>>
>> You've mentioned the 5x5 C program took 2.473s, and the 6x6 took 9900s
>> - at that rate might we expect 7x7 in 15 months?
>>
> Actually the real problem is the amount of disk space I need to store
> the output. 6 x 6 resulted in 32Gb of output. The 7 x 7 will probably
> eat my hard disk well before it completes. I'm not sure that the 1Tb I
> have will be enough so I am looking at removing the rotational and
> mirrored versions of the grids from the output.
Why would you want to store all this data? If you want to store the 7x7
version, you will need 3187.2 TB to store all the squares. To store
only the reduced squares you still need 920 MB. To store the reduced
squares for 8x8 you will need 35.6TB. (I used the wikipedia article
on http://en.wikipedia.org/wiki/Latin_square to compute these numbers).
My program can easily show only the reduced squares by removing the
permutations from the solutions method, and it will be a lot faster too.
If you want to do this for large squares it would be better to rewrite my
algorithm in C. But I am interested how this data could ever be useful.
Regards,
Kristof
No.
That version is on Peter Hickman's website, perhaps he'll replace it
with the newer version on Friday.
Here's a faster version of my program.
Eliminated a "not" in a loop by replacing the array of booleans
that shows incompatibility of two rows with an array that shows
compatibility.
Borrowed you idea of creating an output line ahead of time and
modifying it in place.
Timings:
1.11 yours
1.04 mine
(* compile with:
ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe
*)
(* permutation code by Eric C. Cooper *)
let rec distribute elt = function
| (hd :: tl) as list -> (elt :: list) ::
(List.map (fun x -> hd :: x) (distribute elt tl))
| [] -> [ [elt] ]
let rec permute = function
| x :: rest -> List.flatten (List.map (distribute x) (permute rest))
| [] -> [ [] ] ;;
let list = [ 1; 2; 3; 4; 5 ]
let size = List.length list
let perms = Array.of_list (permute list)
let n = Array.length perms
(* Used to determine if one row is compatible with another. *)
let compatible = Array.make_matrix n n true ;;
for x = 0 to n - 1 do
for y = 0 to n - 1 do
compatible.(x).(y) <-
List.for_all2 ( <> ) perms.(x) perms.(y)
done
done
let join list = String.concat "" (List.map string_of_int list)
let output_strings = Array.map join perms
(* For speed, create a string that's the length of the lines
that we'll print; the :'s that aren't needed as separators
will later be overwritten. *)
let output_line = String.make (size*(size+1)-1) ':' ^ "\n"
let board = Array.make size 0
let rec add_a_row row =
if row = size then
( for i=0 to size-1 do
String.blit
output_strings.(board.(i)) 0 (* source *)
output_line (i*(size+1)) (* dest *)
size
done;
print_string output_line
)
else
for latest = 0 to n - 1 do
(* Create a changeable thing (variable). *)
let prev_row = ref 0 in
(* The ! below fetches the variable's value. *)
while (!prev_row < row) &&
(compatible.(latest).(board.(!prev_row))) do
incr prev_row
done;
if !prev_row = row then
( board.(row) <- latest ; add_a_row (row + 1) )
done
;;
add_a_row 0
Curiously the two Java implementations on Peter's site perform the same on
my machine in x86 and they are >32x slower than OCaml. I know Java is slow
but that is ridiculous...
You're probably doing something wrong.
Heh, should've thought of that. :-)
> Borrowed you idea of creating an output line ahead of time and
> modifying it in place.
>
> Timings:
> 1.11 yours
> 1.04 mine
I get the opposite order:
0.394s yours
0.390s mine
but your timings are probably more accurate because your computer is so
slow. ;-)
You can make it slightly faster by factoring out the CSE "compatible
(latest)", giving:
0.382s
That's actually faster than the C here. Woohoo! :-)
Using bitvectors may also improve things, depending if skipping a bunch of
comparisons in that loop is significant. However, I think it is time we
searched for a new task to optimise. Ray tracer anyone? ;-)
It's 800MHz. I ran each program 4 times and took the average.
>
> You can make it slightly faster by factoring out the CSE "compatible
> (latest)", giving:
>
> 0.382s
>
> That's actually faster than the C here. Woohoo! :-)
Good point. That cuts my time to 1.01 seconds.
Also cleaned up the code a bit.
Here's the final version.
(* compile with:
ocamlopt -unsafe -inline 100 latin-squares.ml -o latin-squares.exe
*)
(* permutation code by Eric C. Cooper *)
let rec distribute elt = function
(hd :: tl) as list -> (elt :: list) ::
(List.map (fun x -> hd :: x) (distribute elt tl))
| [] -> [ [elt] ]
let rec permute = function
x :: rest -> List.flatten (List.map (distribute x) (permute rest))
| [] -> [ [] ]
let list = [ 1; 2; 3; 4; 5 ]
let size = List.length list
let perms = Array.of_list (permute list)
let n = Array.length perms
(* Boolean array used to determine if one row is
compatible with another. *)
let compatible = Array.make_matrix n n true ;;
Array.iteri (fun x ex ->
Array.iteri (fun y ey ->
compatible.(x).(y) <- List.for_all2 (<>) ex ey) perms ) perms
let join list = String.concat "" (List.map string_of_int list)
let output_strings = Array.map join perms
(* For speed, create a string that's the length of the lines
that we'll print; the :'s that aren't needed as separators
will later be overwritten. *)
let output_line = String.make (size*(size+1)-1) ':' ^ "\n"
let board = Array.make size 0
(* A recursive function. *)
let rec add_a_row row =
if row = size then
( for i=0 to size-1 do
String.blit
output_strings.(board.(i)) 0 (* source *)
output_line (i*(size+1)) (* dest *)
size
done;
print_string output_line
)
else
for latest = 0 to n - 1 do
let compatible_slice = compatible.(latest) in
(* Create a changeable thing (variable). *)
let prev_row = ref 0 in
(* The ! below fetches the variable's value. *)
while !prev_row < row &&
compatible_slice.(board.(!prev_row)) do
incr prev_row
done;
if !prev_row = row then
( board.(row) <- latest ; add_a_row (row + 1) )
done
;;
add_a_row 0
>
Here's a C++ implementation:
C++ 0.652s 0.608s g++ -O3 -Wall
(95% of time is spent in addRow())
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;
int fact(int n) { return n == 0 ? 1 : n*fact(n-1); }
int size=5;
vector<int> list(size);
int n = fact(size);
vector< vector<int> > perms(n);
vector< vector<bool> > compat(n);
vector<int> board(size);
string out(size*(size+1), ':');
void print() {
for (int i=0; i<size; ++i)
for (int j=0; j<size; ++j)
out[i*(size+1) + j] = '1' + perms[board[i]][j];
cout << out;
}
int next(vector<bool> &compat, int row) {
int prev_row = 0;
while (prev_row < row && compat[board[prev_row]])
++prev_row;
return prev_row;
}
void addRow(int row) {
if (row == size) print(); else
for (int latest=0; latest<n; ++latest) {
if (next(compat[latest], row) == row) {
board[row] = latest;
addRow(row+1);
}
}
}
int main() {
for (int i=0; i<size; ++i) list[i] = i;
for (int i=0; i<n; ++i) {
perms[i] = vector<int>(list.begin(), list.end());
next_permutation(list.begin(), list.end());
}
for (int i=0; i<n; ++i) {
compat[i] = vector<bool>(n);
for (int j=0; j<n; ++j) {
compat[i][j] = true;
for (int k=0; k<size; ++k)
if (perms[i][k] == perms[j][k]) {
compat[i][j] = false;
break;
}
}
}
out[size*(size+1)-1] = '\n';
addRow(0);
Other people have posted not entirely dissimilar timings for Java and I'm
doing exactly the same thing on x86_64 where is runs much faster.
Has anyone managed to get Java close to OCaml or C?
Peter Hickman's measurements were displayed on his website before you
asked for the Java source code - they show the current Java program
taking ~2.15x the time of the current OCaml program, on his machine.
For Java2 from the site and using Sun's Java 1.5 instead of GIJ 1.4.2
(d'oh!), I get:
Java 0.543s 0.711s javac
That's a much more respectable running time although the compile time is
comparatively poor.
I guess that means we should be testing everything on the laptop I'm
currently using as a remote terminal for my server. It's a 366MHz
dinosaur.
. . except I'm doing real work right now, and don't need to be bogging
stuff down.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
print substr("Just another Perl hacker", 0, -2);
Isaac Gouy provided a Java implementation based on mine (ie still pre
computes the tables in Perl) that brought the times down to sub 9 seconds.
real 0m8.966s
user 0m5.815s
sys 0m1.488s
But the big news is that William James' revision of his previous Ocaml
version is now the fastest.
real 0m3.660s
user 0m1.958s
sys 0m1.421s
The source code for both are available on the web site for you to examine.
real 0m2.260s
user 0m1.867s
sys 0m0.221s
I stole code and ideas from Jon Harrop, so this should be called
the James/Harrop version.
I just noticed that your site doesn't have my last version (the 4th).
Here it is again:
(* Thanks to Jon Harrup for code and ideas. *)
I'm beginning to wonder if my computer is slow enough to give meaningful
timings as the Ocaml versions seem to be pushing for sub second timings :)
In order to be fair to gcc, I'd like to point out one thing.
Although I used your algorithm, I tweaked it in an attempt to make
it faster. So I think that you should take my add_a_row function
and translate it back to C.
real 0m1.885s
user 0m1.649s
sys 0m0.168s
I think this is because the top most loop, when processing the top row,
basically just zooms down a given row of the compared array. With four
chars being packed in where one bool would be more of the tests will be
using data that is available in the CPUs cache.
sub 9 seconds?
7.3s
"real" is elapsed time, which includes all those other processes that
grabbed CPU after you gave the time command.
That URL again? -Tim
>
>
>
This late in the proceedings it would only confuse the issue to start
saying that the first Perl version ran in 251 minutes when 473 minutes
has been mentioned several times. Besides now that we are getting into
such short times for the Ocaml and C versions the difference between
real and user + sys is sub second and it is becoming harder to say that
one is significantly faster then the other. One more run of William
James' program and it might make up that 0.029 of a second difference.
Perhaps stepping up to a 6 x 6 grid would allow more meaningful timings?
> Perhaps stepping up to a 6 x 6 grid would allow more meaningful timings?
showing how it performs on a 2, 4, and 8 grid would show how it scaled...
-a
--
happiness is not something ready-made. it comes from your own actions.
- h.h. the 14th dali lama
I do have a very noisy and hot AMD 64 4400 with 2 Gb ram running Ubuntu.
Perhaps I will set this as the new benchmark machine.
I wish that you would dump Perl and do everything in C.
Now do you understand that when you wrote "C is still faster by an
order of magnitude" it had nothing to do with Java and had everything
to do with your "bogus" program.
In other words - Charles O Nutter was right, and you were wrong.
Depending on the machine hardware, we should expect a vanilla Java
implementation to take 1.5x - 2.5x longer than C for this problem.
Only if by "vanilla" you mean optimised Java.
I posted an OCaml implementation that used an imperative, array-based
permuter. We could just port that back to C (or find the site with the C
code that I translated).
> Isaac Gouy wrote:
> > Depending on the machine hardware, we should expect a vanilla Java
> > implementation to take 1.5x - 2.5x longer than C for this problem.
>
> Only if by "vanilla" you mean optimised Java.
Did you look at Isaac's implementation? It is almost exactly the same it
was before, but he is not printing out the latin squares as unicode (which
none of the other impls are doing to my knowledge) and it is writing to
a BufferedReader (which any Java programmer with a couple months of
experience would/should do).
I think of things like manual loop unrolling and other similiar techniques
as optimization techniques. The stuff he did was common sense and I think
closer to the C impl in action (these languages are not the same -- some
consideration should be taken in this).
My only question on this thread about Java is what Java is being used
for the test (used on the webpage -- a good addition to the page)? I ran
Java 1.4, Java 5 and Java 6 (beta but what the hell). The speedup was
surprising and exciting (obviously I do Java programming). I went from
1.1s to 0.84s between the versions.
-Tom
--
+ http://www.tc.umn.edu/~enebo +---- mailto:en...@acm.org ----+
| Thomas E Enebo, Protagonist | "Luck favors the prepared |
| | mind." -Louis Pasteur |
What things in that program do you consider optimizations?
I also provided a Java program with the same inner loop code as the
OCaml program but it hasn't appeared on Peter Hickman's website yet.
Having repeatedly made a mistake in the past is no reason to continue
making the same mistake in the future!
(You're timing for Perl includes 3 or 4 hours of you surfing the web or
installing OCaml or whatever!)
It scales badly
0.496s gcc 5x5 (without print statements)
30055.098s gcc 6x6 (without print statements)
imo If Peter has figured out what he's trying to do then it would be
smart to find a better way of doing it rather than scrabbling for small
percentage improvements.
indeed - that's my point exactly.
with that kind scaling a better ruby version could kill the c one!
At last, something we can agree about - that makes all this silliness
worthwhile :-)
Yes.
> I think of things like manual loop unrolling and other similiar
> techniques
> as optimization techniques.
They are all optimisations.
> The stuff he did was common sense and I think
> closer to the C impl in action (these languages are not the same -- some
> consideration should be taken in this).
These languages are not the same. However, using "ordinary" IO via
System.out.println or cout or print_endline is the "vanilla" way of
outputting in all of these languages. Of these languages, only Java gives
appalling performance unless you switch to buffers.
Even when you do use buffers in Java, it is still slower than unbuffered
C++, OCaml etc. on my machine.
> My only question on this thread about Java is what Java is being used
> for the test (used on the webpage -- a good addition to the page)?
I was using Sun's J2SE 1.5. I also tried GNU's GIJ, which is even slower
than OCaml's bytecode.
I guess you'd fail a Java exam :-)
(I'm not interested enough to find out, but I suspect stdout is at
minimum line-buffered.)
> Even when you do use buffers in Java, it is still slower than unbuffered
> C++, OCaml etc. on my machine.
No where near an order of magnitude slower - iirc you reported 1.85x C
> I was using Sun's J2SE 1.5. I also tried GNU's GIJ, which is even slower
> than OCaml's bytecode.
You failed the Java exam again :-)
For Java without JIT use the -Xint command line option with the Sun JVM
like this
time java -Xint Latin > /dev/null
And if the program ran for any amount of time we of course be using
time java -server Latin > /dev/null
Sure, the naive implementations in other languages are doing silly things as
well (like the OCaml flushing after every line) but they are still faster
than the most optimised Java to date.
>> Even when you do use buffers in Java, it is still slower than unbuffered
>> C++, OCaml etc. on my machine.
>
> No where near an order of magnitude slower - iirc you reported 1.85x C
I get 0.604s unoptimised OCaml vs 1.740s unoptimised Java on x86_64. That's
2.9x slower.
>> I was using Sun's J2SE 1.5. I also tried GNU's GIJ, which is even slower
>> than OCaml's bytecode.
>
> You failed the Java exam again :-)
>
> For Java without JIT use the -Xint command line option with the Sun JVM
> like this
>
> time java -Xint Latin > /dev/null
That's as slow as ocamlc.
> And if the program ran for any amount of time we of course be using
>
> time java -server Latin > /dev/null
I already tried and it improved performance slightly.
Not to get embroiled in a nascent flamewar, or anything, but . . .
I don't think the fact that everyone in one language's community does
something that people in other languages' communities don't so they can
get better performance makes what they're doing any less an
optimization. It just makes it an exceedingly common optimization (and
might indicate that language implementation defaults should be changed).
Then again, maybe not.
--
CCD CopyWrite Chad Perrin [ http://ccd.apotheon.org ]
"A script is what you give the actors. A program
is what you give the audience." - Larry Wall
afaict none of the OCaml implementations work with the data created by
that Perl script - like the C and Java implementations - why is that?
>
> >> Even when you do use buffers in Java, it is still slower than unbuffered
> >> C++, OCaml etc. on my machine.
> >
> > No where near an order of magnitude slower - iirc you reported 1.85x C
>
> I get 0.604s unoptimised OCaml vs 1.740s unoptimised Java on x86_64. That's
> 2.9x slower.
>
> >> I was using Sun's J2SE 1.5. I also tried GNU's GIJ, which is even slower
> >> than OCaml's bytecode.
> >
> > You failed the Java exam again :-)
> >
> > For Java without JIT use the -Xint command line option with the Sun JVM
> > like this
> >
> > time java -Xint Latin > /dev/null
>
> That's as slow as ocamlc.
as slow as? does that mean faster :-)
> On Sat, 5 Aug 2006, Isaac Gouy wrote:
>
>>
>> ara.t....@noaa.gov wrote:
>>> On Sat, 5 Aug 2006, Peter Hickman wrote:
>>>
>>>> Perhaps stepping up to a 6 x 6 grid would allow more meaningful timings?
>>>
>>> showing how it performs on a 2, 4, and 8 grid would show how it scaled...
>>>
>>> -a
>>> --
>>> happiness is not something ready-made. it comes from your own actions.
>>> - h.h. the 14th dali lama
>>
>> It scales badly
>>
>> 0.496s gcc 5x5 (without print statements)
>> 30055.098s gcc 6x6 (without print statements)
>>
>> imo If Peter has figured out what he's trying to do then it would be
>> smart to find a better way of doing it rather than scrabbling for small
>> percentage improvements.
>
> indeed - that's my point exactly.
>
> with that kind scaling a better ruby version could kill the c one!
Did you actually run the 6x6 version? I think my Ruby version might
actually be faster than the C one for large squares, since it only
computes normalised squares. A reasonable estimate indicates that
for 6x6 squares it will take about 23500 seconds on my computer. But I
didn't test it, because that's still longer than I am prepared to wait for
it :)
Kristof
The authors couldn't be bothered because it is easier to generate that data
in OCaml.
>> > For Java without JIT use the -Xint command line option with the Sun JVM
>> > like this
>> >
>> > time java -Xint Latin > /dev/null
>>
>> That's as slow as ocamlc.
>
> as slow as? does that mean faster :-)
No, that means even slower.
I think that's the current best version for the Mac. It does tend to lag
behind the rest of the world when it comes to releases.
I will stick a language versions page up tonight.
There are optimisations that could be applied to the C version but they
are only going to get percentage improvements and are getting much
harder to reliably measure.
As to Ocaml versions, well they were submitted and proved to be faster
than my C version. C is quite a fragile language to develop in and so
any high level language that can deliver that performance is very
interesting. To me at least.
And all this Java stuff, well that comes from the post by Charles O Nutter:
> - Write it in C is as valid as write it in Java (as someone else
mentioned).
> Java is at least as fast as C for most algorithms.
It was the "at least as fast as" bit that I disputed and the whole Java
sub thread comes from that.
The only requirement is that I can get an implementation of the language
in question on my Macintosh.