Benchmark timing

Monty Williams

unread,

Feb 18, 2010, 11:40:31 AM2/18/10

to ruby-bench...@googlegroups.com

I took a look at the timing mechanism (internal) in the Rubinius "tiered" benchmarks. It is simple enough for pretty much any implementation to use and does seem to provide more accurate times. How would others feel about adopting that mechanism for RBS?

I have noticed that RBS benchmarks that only take a few milliseconds have a pretty large timing variance (20%) run to run on the same impl on the same machine. Since I use RBS to track MagLev performance week to week, it's hard to isolate any real effects. Should we up the minimum so that all benchmarks take at least 100 ms to run on even the fastest impl?

-- Monty

----- Forwarded Message -----
From: "Evan Phoenix" <epho...@engineyard.com>
To: "Monty Williams" <monty.w...@gemstone.com>
Cc: "Wayne E. Seguin" <wayne...@gmail.com>, "Peter Mclain" <peter....@gemstone.com>, "Allen Otis" <allen...@gemstone.com>
Sent: Friday, January 8, 2010 1:51:05 PM GMT -08:00 US/Canada Pacific
Subject: Re: RVM and MagLev

A minor clarification. "timeout" is the wrong word to be using. I assume you guys are talking about timing itself, ie how long does something take to complete. "timeout" refers to terminating if it runs past a certain upper threshold.

That being said, external timing is introduces far more noise that internal timing. The only variable with internal timing is how long it takes to get the current time, which is should be far less problematic than trying to calculate VM startup time, differences in how OS's calculate the time of a process, etc.

- Evan

On Fri, Jan 8, 2010 at 1:29 PM, Monty Williams <monty.w...@gemstone.com> wrote:

Cool! I'll be glad to help test the MagLev part. There will probably be some unforseen gotchas.

I'll see if I can come up with some magic to factor out or at least measure the overhead on the level 0 tier measurements. Ilya Grigorik said he was concerned about startup times so from my perspective it would be fine to include them even though MagLev's are probably longer.

-- Monty

----- Forwarded Message -----
From: "Wayne E. Seguin" <wayne...@gmail.com>
To: "Monty Williams" <monty.w...@gemstone.com>
Cc: "Wayne E. Seguin" <wayne...@gmail.com>
Sent: Friday, January 8, 2010 1:16:46 PM GMT -08:00 US/Canada Pacific
Subject: Re: RVM and MagLev

Monty,

After some discussion with Evan we have concluded that the internal timeout will always be more accurate than using an external timeout. When using an external timeout you end up including the VM startup times (which we do not want factored in, we want post-startup times). Additionally you will record any other unknown process interactions like process scheduling interruptions. These items are removed and/or minimized by doing it all internal.

Does that make sense?

  ~Wayne

----- Original Message -----
From: "Wayne E. Seguin" <wayne...@gmail.com>
To: "Monty Williams" <monty.w...@gemstone.com>
Cc: "Wayne E. Seguin" <wayne...@gmail.com>
Sent: Friday, January 8, 2010 1:11:21 PM GMT -08:00 US/Canada Pacific
Subject: Re: RVM and MagLev

Thanks for bringing this to my attention, I am talking with Evan about it right now and will get back to you on it.

Additionally I am currently in the middle of adding maglev to rvm :)

  ~Wayne

On Jan 08, 2010, at 15:46 , Monty Williams wrote:

Do the "Tiers" scripts you run use the Rubinius compare.rb? If so they may not be measuring exactly what you think.

Here is my observation:

The old RBS scripts ran in MRI and spawned a separate process which executed the code to be measured.
cmd = "#{timeout} -t #{limit} #{vm} #{runner} #{name} #{iterations} #{report} #{meter_memory}"

However, compare.rb executes the harness code in the system under test, and it includes an additional component beyond the actual code to be benchmarked in the total time reported. It's most significant in the tier 0 tests.

I found this when a 3x difference between MagLev and RBX (rbx being faster) turned into a 10x difference when run using compare.rb.

It seems to me the prior RBS methodology was more accurate, and something similar should be used, or else just use "time ruby benchmark.rb" and let the OS be the leveler by keeping Ruby out of everything except the code under test.

What do you think? Or maybe you've already accounted for this?

-- Monty

----- Original Message -----
From: "Wayne E. Seguin" <wayne...@gmail.com>
To: "Monty Williams" <monty.w...@gemstone.com>
Cc: "Wayne E. Seguin" <wayne...@gmail.com>
Sent: Thursday, January 7, 2010 6:07:37 PM GMT -08:00 US/Canada Pacific
Subject: Re: RVM and MagLev

I wrote a few scripts to generate those benchmarks and results. I am working on refining it and will be publishing it sometime soon. For now I am intending on running the benchmarks every few days and posting them to that site.

We should have MagLev in rvm before the weekend is out I believe.

  ~Wayne

Shri Borde

unread,

Feb 18, 2010, 2:16:27 PM2/18/10

to ruby-bench...@googlegroups.com

What we do on the IronRuby/IronPython team with our legacy internal perf infrastructure is to calibrate the iteration count for all new benchmarks so that the variance is below some low number. This is similar to your proposal of ensuring a minimum execution time, but more directly tied to reducing variance. Some benchmarks could do with less than 100 ms and some might need more than 100 ms to get the same low variance.

Btw, we also throw at the extreme end-points as there do seem to be outliers even if the overall variance is low.

--
The GitHub project is located at http://github.com/acangiano/ruby-benchmark-suite

You received this message because you are subscribed to the Google
Groups "Ruby Benchmark Suite" group.
To post to this group, send email to
ruby-bench...@googlegroups.com
To unsubscribe from this group, send email to
ruby-benchmark-s...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/ruby-benchmark-suite?hl=en

Monty Williams

unread,

Feb 19, 2010, 7:39:01 PM2/19/10

to ruby-bench...@googlegroups.com

I understand the numbers pretty well for my internal use.

But since Antonio said he's planning another shootout in March, external users should also be able to extract something accurate and meaningful. Hopefully, we can prevent "Lies, Damn Lies, and Statistics" from being the primary conversation. Variance between implementations where the mean time hovers below 10ms is so down in the noise as to be meaningless. Unfortunately, end users are likely to try to reduce comparisons between implementations to a single number. As Evan pointed out in his RubyConf talk, that's not meaningful either.

Any ideas on what we can do to add value for end users would be great. It might bring more external attention to the RBS group and turn it into something better for us all.

I'd expect the Rubinius team is quite busy. Unless we hear otherwise perhaps we should focus on improving the current RBS. Or maybe migrate some of the current benchmarks to the Rubinius "tiers" harness. Thoughts?

Shri Borde

unread,

Feb 21, 2010, 6:07:33 PM2/21/10

to ruby-bench...@googlegroups.com

RBS has macro-benchmarks for RDoc and Rails. We could add more benchmarks using other real-world libraries, apps or gems. If we all add just a couple of benchmarks each, we can get a good collection. The next shootout could then also request library authors to contribute benchmarks for their gems, which will create an even larger suite of real-world code (though you would want to restrict it to the most popular gems for the suite to be considered relevant).

I may not be able to get to this for a couple of weeks, but can try to add a couple of macro-benchmarks after that.

Roger Pack

unread,

Feb 22, 2010, 10:55:30 AM2/22/10

to ruby-bench...@googlegroups.com

> I may not be able to get to this for a couple of weeks, but can try to add a
> couple of macro-benchmarks after that.

Yeah asking the community for more would be good.

My next thought for a macro benchmark would be a sinatra benchmark
somehow (since sinatra seems like a common benchmark and runnable on
more Ruby VM's than rails is).

-rp

Reply all

Reply to author

Forward