[ruby-core:18303] Ruby 1.8.6 yields 50%-100% performance gain when compiled at full optimization

1 view
Skip to first unread message

kevin nolan

unread,
Aug 14, 2008, 1:19:14 PM8/14/08
to ruby...@ruby-lang.org
After compiling Ruby 1.8.6 with '-O3 -mtune=K8 -march=K8' on an AMD 4800+, I decided to run Antonio Cangiano's benchmark suite to see what performance gain, if any, the new interpreter realized. Needless to say I was impressed with the results. The specifics:

control: ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux] (apt-get install ruby)
test:      ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux] (source compiled with '-O3 -mtune=K8 -march=K8')
kernel:   2.6.24-19-server
test-suite: git://github.com/acangiano/ruby-benchmark-suite.git

Notes:

The default timeout for any given test was set at the default of 30 seconds. Twenty-for tests exceeded the timeout therefore the ratio is unknown. In two of the tests: bm_regex_dna.rb and bm_hilbert_matrix.rb the optimized version of ruby was actually *slower*. The patch level of the two interpreters is different so this is not exactly apples-to-apples comparison. Two tests which reported a 'stack to deep' error.

                                                Default    Optim(O3,native)   Optim/Default
/core-features/bm_app_answer.rb:                1.29            0.8             0.62
/core-features/bm_app_factorial.rb:             Error           Error             ?
/core-features/bm_app_factorial2.rb:            Error           Error             ?
/core-features/bm_app_fib.rb:                   T/O              T/O              ?
/core-features/bm_app_raise.rb:                 6.63            6.01            0.91
/core-features/bm_app_tak.rb:                   12.4            8.38            0.68
/core-features/bm_app_tarai.rb:                 9.91            6.88            0.69
/core-features/bm_loop_times.rb:                7.97            3.69            0.46
/core-features/bm_loop_whileloop.rb:            T/O              10.2            ?
/core-features/bm_loop_whileloop2.rb:           T/O              21.77          ?
/core-features/bm_so_ackermann.rb:              T/O              T/O              ?
/core-features/bm_so_nested_loop.rb:            9.31            5.33            0.57
/core-features/bm_so_object.rb:                 11.74           9.26            0.79
/core-features/bm_so_random.rb:                 T/O              2.35            -2.35
/core-features/bm_startup.rb:                   0               0               0.71
/core-features/bm_vm1_block.rb:                 T/O              23.67           ?
/core-features/bm_vm1_const.rb:                 T/O              T/O              ?
/core-features/bm_vm1_ensure.rb:                27.68           15.82           0.57
/core-features/bm_vm1_length.rb:                22.99           19.91           0.87
/core-features/bm_vm1_rescue.rb:                T/O              12.86           ?
/core-features/bm_vm1_simplereturn.rb:          T/O              18.3            ?
/core-features/bm_vm1_swap.rb:                  T/O              T/O              ?
/core-features/bm_vm2_method.rb:                21.01           11.65           0.55
/core-features/bm_vm2_poly_method.rb:           T/O              15.72           ?
/core-features/bm_vm2_poly_method_ov.rb:        5.59            4.88            0.87
/core-features/bm_vm2_proc.rb:                  8.92            6.2             0.69
/core-features/bm_vm2_send.rb:                  5.69            4.67            0.82
/core-features/bm_vm2_super.rb:                 6.97            4.46            0.64
/core-features/bm_vm2_unif1.rb:                 5.11            3.65            0.71
/core-features/bm_vm2_zsuper.rb:                7.47            4.93            0.66
/core-library/bm_app_strconcat.rb:              1.44            1.13            0.78
/core-library/bm_pathname.rb:                   T/O              T/O              ?
/core-library/bm_so_array.rb:                   9.1             5.6             0.62
/core-library/bm_so_concatenate.rb:             3.42            1.84            0.54
/core-library/bm_so_count_words.rb:             0.03            0.03            ?
/core-library/bm_so_exception.rb:               7.58            5.28            0.7
/core-library/bm_so_lists.rb:                   T/O              T/O              ?
/core-library/bm_so_matrix.rb:                  2.62            1.79            0.68
/core-library/bm_vm2_array.rb:                  9.55            6.18            0.65
/core-library/bm_vm2_regexp.rb:                 4.72            6.4            *1.35
/core-library/bm_vm3_thread_create_join.rb:     0.08            0.03            0.34
/micro-benchmarks/bm_app_pentomino.rb:          T/O              T/O              ?
/micro-benchmarks/bm_binary_trees.rb:           T/O              T/O              ?
/micro-benchmarks/bm_fannkuch.rb:               T/O              T/O              ?
/micro-benchmarks/bm_fasta.rb:                  T/O              T/O              ?
/micro-benchmarks/bm_fractal.rb:                T/O              T/O              ?
/micro-benchmarks/bm_knucleotide.rb:            2.21            1.55            0.70
/micro-benchmarks/bm_lucas_lehmer.rb:           7.32            6.44            0.88
/micro-benchmarks/bm_mandelbrot.rb:             T/O              T/O              ?
/micro-benchmarks/bm_mergesort.rb:              2.91            2.62            0.9
/micro-benchmarks/bm_meteor_contest.rb:         T/O              T/O              ?
/micro-benchmarks/bm_monte_carlo_pi.rb:         24.83           19.52           0.79
/micro-benchmarks/bm_nbody.rb:                  T/O              T/O              ?
/micro-benchmarks/bm_nsieve.rb:                 24.55           21.47           0.87
/micro-benchmarks/bm_nsieve_bits.rb:            T/O              T/O              ?
/micro-benchmarks/bm_partial_sums.rb:           27.83           25.13           0.9
/micro-benchmarks/bm_quicksort.rb:              10.76           6.06            0.56
/micro-benchmarks/bm_recursive.rb:              T/O              28.06            ?
/micro-benchmarks/bm_regex_dna.rb:              1.54            2.05          *1.33
/micro-benchmarks/bm_reverse_compliment.rb:     T/O              T/O              ?
/micro-benchmarks/bm_so_sieve.rb:               T/O              T/O              ?
/micro-benchmarks/bm_spectral_norm.rb:          T/O              T/O              ?
/micro-benchmarks/bm_sum_file.rb:               20.89           16.44           0.79
/micro-benchmarks/bm_thread_ring.rb:            T/O              T/O              ?
/micro-benchmarks/bm_word_anagrams.rb:          12.24           8.79            0.72
/real-world/bm_hilbert_matrix.rb:               24.74           T/O             *?
/standard-library/bm_app_mandelbrot.rb:         0.81            0.61            0.75

M. Edward (Ed) Borasky

unread,
Aug 14, 2008, 10:45:39 PM8/14/08
to ruby...@ruby-lang.org
On Fri, 2008-08-15 at 02:19 +0900, kevin nolan wrote:
> After compiling Ruby 1.8.6 with '-O3 -mtune=K8 -march=K8' on an AMD
> 4800+, I decided to run Antonio Cangiano's benchmark suite to see what
> performance gain, if any, the new interpreter realized. Needless to
> say I was impressed with the results. The specifics:
>
> control: ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux]
> (apt-get install ruby)
> test: ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux]
> (source compiled with '-O3 -mtune=K8 -march=K8')
> kernel: 2.6.24-19-server
> test-suite: git://github.com/acangiano/ruby-benchmark-suite.git
>
> Notes:
>
> The default timeout for any given test was set at the default of 30
> seconds. Twenty-for tests exceeded the timeout therefore the ratio is
> unknown.

I think that can be changed easily.

> In two of the tests: bm_regex_dna.rb and bm_hilbert_matrix.rb the
> optimized version of ruby was actually *slower*.

Thanks for letting me know -- I wrote "bm_hilbert_matrix", so I think
I'll check this out over the weekend with my "oprofile" setup. BTW, I
usually compile "-O3 -march=athlon64" and I have been using gcc 4.3.1
for a couple of months. Do you expect a fundamental difference between
"-march=athlon64" and "-march=k8 -mtune=k8"?

> The patch level of the two interpreters is different so this is not
> exactly apples-to-apples comparison. Two tests which reported a 'stack
> to deep' error.

Try "ulimit -a" and look at the stack size. Then type "ulimit -s <4x>"
where <4x> is four times the number you got from "ulimit -a". This made
those stack errors go away when I ran these.

By the way -- the Ruby Benchmark Suite has its own mailing list --
http://groups.google.com/group/ruby-benchmark-suite to be precise.
--
M. Edward (Ed) Borasky
ruby-perspectives.blogspot.com

"A mathematician is a machine for turning coffee into theorems." --
Alfréd Rényi via Paul Erdős


Shot (Piotr Szotkowski)

unread,
Aug 15, 2008, 2:39:55 PM8/15/08
to ruby...@ruby-lang.org
kevin nolan:

> control: ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux]
> (apt-get install ruby)
> test: ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux]
> (source compiled with '-O3 -mtune=K8 -march=K8')

Are you sure the differences are not because of the --with-phtreads
flag in Ubuntu’s build? In my case, the difference between Ubuntu’s
Ruby and `configure; make; make install` 1.8.6.p111 was about 40%
(without touching -O, -mtune and -march).

-- Shot
--
Perl is designed for people who like oblique voodoo. For people
who have no qualms about using rhaphyrographic and tmesis, or for
that matter oblique, in casual conversation. -- Peter da Silva

M. Edward (Ed) Borasky

unread,
Aug 16, 2008, 12:04:51 AM8/16/08
to ruby...@ruby-lang.org
On Sat, 2008-08-16 at 03:39 +0900, Shot (Piotr Szotkowski) wrote:
> kevin nolan:
>
> > control: ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux]
> > (apt-get install ruby)
> > test: ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux]
> > (source compiled with '-O3 -mtune=K8 -march=K8')
>
> Are you sure the differences are not because of the --with-phtreads
> flag in Ubuntu’s build? In my case, the difference between Ubuntu’s
> Ruby and `configure; make; make install` 1.8.6.p111 was about 40%
> (without touching -O, -mtune and -march).

Last time I looked, the difference between no optimization whatsoever
and "-O3 -march=<your chip here>" was about 30 percent. But yes,
pthreads makes a big difference. And I think pthreads is mandatory for
1.9.

BTW ... 64-bit compiled is slower than 32-bit compiled on a 64-bit chip,
too ... cache sizes, alignments, and such, I suspect, though I haven't
taken the time to profile it.

Prashant Srinivasan

unread,
Aug 16, 2008, 6:18:17 PM8/16/08
to ruby...@ruby-lang.org
M. Edward (Ed) Borasky wrote:
> On Sat, 2008-08-16 at 03:39 +0900, Shot (Piotr Szotkowski) wrote:
>
>> kevin nolan:
>>
>>
>>> control: ruby 1.8.6 (2007-09-24 patchlevel 111) [x86_64-linux]
>>> (apt-get install ruby)
>>> test: ruby 1.8.6 (2008-08-11 patchlevel 287) [x86_64-linux]
>>> (source compiled with '-O3 -mtune=K8 -march=K8')
>>>
>> Are you sure the differences are not because of the --with-phtreads
>> flag in Ubuntu’s build? In my case, the difference between Ubuntu’s
>> Ruby and `configure; make; make install` 1.8.6.p111 was about 40%
>> (without touching -O, -mtune and -march).
>>
>
> Last time I looked, the difference between no optimization whatsoever
> and "-O3 -march=<your chip here>" was about 30 percent. But yes,
> pthreads makes a big difference.

I've seen huge performance enhancements by not using
"--enable-pthreads" too. I initially began to use it after the Ruby
build admonished me for trying link to the Tk library(which on Solaris
is built with threading support) without using --enable-pthreads for
Ruby. But it turned out to be a bad idea for performance since it makes
the interpretor invoke a *lot* of getcontext calls that pull down the
performance by about 50% in cases.

I'm not sure why the --enable-pthreads uses the *context calls based
implementation. The ruby build messages talk about frequent crashes if
a pthreads based tcl/tk is linked into a non-pthreaded Ruby. I though
that perhaps, having extensions that invoked threads would change the
context from beneath the Ruby interpretor and leave it in an
inconsistent state if Ruby didn't store it away first. So I built an
extension that created threads and did some trivial computations(I did
check to make sure that these weren't optimized away by the compiler).
But that didn't cause ruby 1.8.6 to crash(I haven't tried on 1.9).

What advantages are obtained by using --enable-pthreads in Ruby 1.8?
I'm also curious if someone has gotten Ruby(built without
--enable-pthreads) to work successfully(without crashes) with tck/tk
libraries with threading support built in.

thanks,
-ps

Shot (Piotr Szotkowski)

unread,
Oct 19, 2008, 8:42:55 PM10/19/08
to ruby...@ruby-lang.org
M. Edward (Ed) Borasky:

> Last time I looked, the difference between no optimization
> whatsoever and "-O3 -march=<your chip here>" was about 30 percent.

What benchmarks did you use? In my code’s case, the difference between
empty CFLAGS and CFLAGS='-O3 -march=native' is minimal (Athlon 64 X2).

(gcc’s man page says -march implies the same -mtune, and that ‘native’
is inteligently handled to mean whatever arch is the best in my case.)

> BTW ... 64-bit compiled is slower than 32-bit compiled on a 64-bit
> chip, too ... cache sizes, alignments, and such, I suspect, though
> I haven't taken the time to profile it.

That’s interesting. Can I build 32-bit Ruby and use it inside my x86_64
system? If so, how? (Sorry, I’m a total novice when it comes to this.)

-- Shot
--
summer's essence drowns
in gloomy eve of winter
wind blows bricktext in -- Alan J Rosenthal

Michal Suchanek

unread,
Oct 20, 2008, 9:11:55 AM10/20/08
to ruby...@ruby-lang.org, Shot (Piotr Szotkowski)
On 20/10/2008, Shot (Piotr Szotkowski) <sh...@hot.pl> wrote:
> M. Edward (Ed) Borasky:
>
>
> > Last time I looked, the difference between no optimization
> > whatsoever and "-O3 -march=<your chip here>" was about 30 percent.
>
>
> What benchmarks did you use? In my code's case, the difference between
> empty CFLAGS and CFLAGS='-O3 -march=native' is minimal (Athlon 64 X2).

It also depends on your chip. Recent AMD chips tend to have sane
design wrt balance of number of execution units, cache sizes, decoder,
etc. The parts fit well together so the CPU can handle any code
without much trouble.

On the other hand, Pentium4 chips (before Core2 which are sometimes
also called P4 for some reason) were very poorly designed with slow
decoder and inbalanced number of execution units. The compiler can
reorder instructions so that they can get to the execution units
faster on this chip and achieve better saturation of the CPU hence
improving performance considerably.

>
> (gcc's man page says -march implies the same -mtune, and that 'native'
> is inteligently handled to mean whatever arch is the best in my case.)
>
>
> > BTW ... 64-bit compiled is slower than 32-bit compiled on a 64-bit
> > chip, too ... cache sizes, alignments, and such, I suspect, though
> > I haven't taken the time to profile it.
>
>
> That's interesting. Can I build 32-bit Ruby and use it inside my x86_64
> system? If so, how? (Sorry, I'm a total novice when it comes to this.)

You probably do that by passing some parameter to gcc.

Obviously you would need 32bit versions of all the libraries you use
in your extensions.

And you would not be able to use as much memory. The 32bit address
space is very limited (normally only 1-2GB on Linux).

Thanks

Michal

Reply all
Reply to author
Forward
0 new messages