How to fairly measure performance of programs written in different languages

10 views
Skip to first unread message

Andrey Mokhov

unread,
Mar 18, 2011, 7:14:41 PM3/18/11
to Tuura project
To make the project attractive for people programming in different
languages, we need a way to compare performance of two programs in a
language-independent way.

Measuring program runtime (as usually done) often becomes the fun-
killer for programmers in Java, Haskell, and other 'thoughtful'
languages ;-) And we don't want to force everyone to code in C++.

Note, measuring runtime is perfect for comparing performance of two
programs written in the same language. Any ideas how to fairly rank
all the programs?

One way to mitigate the problem is to use pre-written wrappers for
input/output and pass them as parameters to a function instead of
making a solution do all the parsing. This is done at TopCoder and
helps a lot because IO takes a lot of runtime for some languages.

Rotsor

unread,
Mar 19, 2011, 12:19:50 AM3/19/11
to Tuura project
I think that measuring the total program runtime is okay.

First of all, nobody will use a "slow" language if their only concern
is performance. There are a lot of other, measurable and not, metrics,
attracting users to higher-level languages.

Secondly, if reading an input file is a problem, then there is a
problem with the input file. I believe the input files should never be
too large, i.e. linear-time parser, however slow, should always pass
the time limit.

However, to make performance competition fun for all, even really
slow, languages, It would be great to have an additional per-language
ranking of solutions (e.g. top 10 Haskell solutions).

Now, on the "TopCoder" approach. Yes, it eliminates the common I/O
overhead, but the complexity of implementing the correct time
measurement and cheat-protection grows greatly!

One problem I can see with (naive implementation of) I/O wrappers in
Haskell: a slow program will often instantly return its output as a
lazy list, thus passing the time limit, but the output printer/
verifier will have a hard time reading this list.

Another, security-related, problem is that under the TopCoder model
the timing decision will be made by the process containing the user
code. Thus there is no easy way of making user unable to forge the
timing information and this problem would need to be solved in every
supported language separately.

Now, if we look at the TopCoder model from the other angle, the
programmer convenience, then it becomes a great idea! It eliminates
the tedious I/O code writing for every submission at the cost of
defining a simple input format specification once per problem (which
should be useful for the problem setter himself!) and a parser
generator for each language. If we forget about "fair time-measurement
with pre-parsed input", it becomes even more convenient because then
we can make usage of generated parsers optional (like Google AI
Challenge starter packages), and it will allow us to not worry about
writing a parser generator for every language because the users will
be able to write their own.

Another advantage I see is that it will simplify the input verifier
greatly! It will only have to check that the numbers/sizes lie within
the proper bounds. All the format checks will be automated.

Summary of what I would like to see:
* Total run time measurement with a single process-runner (as before:
robust, fair, and secure)
* Per-language ranking of solutions in addition to a cross-langage one
* A simple TopCoder-style input specification for problems (every
problem, or should we allow exceptions?)
* An optional input parser for every language, generated from the
input specification for the given problem

Andrey Mokhov

unread,
Mar 19, 2011, 7:59:42 AM3/19/11
to Tuura project
> Secondly, if reading an input file is a problem, then there is a
> problem with the input file. I believe the input files should never be
> too large, i.e. linear-time parser, however slow, should always pass
> the time limit.

Well, the problem is when the time limit is sufficient for a Haskell
program to parse the input with, say, 10000 integers, C++ can already
afford a quadratic solution beating Haskell's linear one. This
happened at the KRSU server several times and I, as a problem setter,
could do nothing about that.

And I don't think that an input file with 10000 elements is too large.
It is difficult to distinguish linear and quadratic algorithms with
smaller inputs. (And using some kind of random input generators
significantly complicates adding tests by participants.)

> One problem I can see with (naive implementation of) I/O wrappers in
> Haskell: a slow program will often instantly return its output as a
> lazy list, thus passing the time limit, but the output printer/
> verifier will have a hard time reading this list.

Interesting, I haven't thought about that :-)

> Summary of what I would like to see:
> * Total run time measurement with a single process-runner (as before:
> robust, fair, and secure)

By the way, in principle, we can allow use of parallelism. This is not
the primary target but potentially it will be interesting for ranking
marathon-like algorithms (I mean TopCoder marathons).

> * Per-language ranking of solutions in addition to a cross-langage one

Sure, that goes without saying. Still if there is just a single
participant coding in Java it will not be much fun for him to see
himself alone in the Java ranking.

> * A simple TopCoder-style input specification for problems (every
> problem, or should we allow exceptions?)
> * An optional input parser for every language, generated from the
> input specification for the given problem

Yes, these are good points. We can generate whatever bulky but
efficient parsing procedures for every language automatically from
formal input specifications. I can do C++, Ivan -- Java, while you and
Ulan can help Haskell, I hope :-)

Ulan

unread,
Mar 19, 2011, 8:12:41 AM3/19/11
to Tuura project
A fair way would be to make separate time limit for each language and
to set the limit value relative to the reference solution running time
in that language (say about 150%-200%).

Andrey Mokhov

unread,
Mar 19, 2011, 8:23:13 AM3/19/11
to Tuura project
> A fair way would be to make separate time limit for each language and
> to set the limit value relative to the reference solution running time
> in that language (say about 150%-200%).

This puts too much work on the problem writer. He has to write
reference solutions in all languages. Moreover, he has to write them
efficiently :-) We thought that may be optional, i.e. the author can
provide a simple, easy to check reference solution which is slow. It
can even be made public. The participants (including the author) will
then have to write optimal solutions under the time limit set
independently from the reference solution.

Ulan

unread,
Mar 19, 2011, 8:29:32 AM3/19/11
to Tuura project
Crazy idea, make TL dynamic:
- if there is no submission for a language, set conservative TL like 5
sec
- otherwise set it relative to the fastest solution (about 200%-300%).

This way, a fast submission is also a "challenge" for previous slow
submissions, i.e. it can break them. :)

Ivan V. Polyakov

unread,
Mar 19, 2011, 8:29:40 AM3/19/11
to tuurap...@googlegroups.com
Just ban C++ and there you have it, everything's fair :D

Andrey Mokhov

unread,
Mar 19, 2011, 8:34:26 AM3/19/11
to Tuura project
Ulan,

Yes, dynamic TL should be fun :-) It's similar to failing other
solutions by adding a new test.

Ivan,

Noooo T_T

Ivan V. Polyakov

unread,
Mar 19, 2011, 8:40:36 AM3/19/11
to tuurap...@googlegroups.com
Dynamic TL is a great idea, but can you apply it during a contest? :)

Ulan Degenbaev

unread,
Mar 19, 2011, 8:54:44 AM3/19/11
to tuurap...@googlegroups.com
Ivan, that shouldn't be a problem if the test-runner processes can connect to the submission db.
This also means that problem statements are no longer static (or we factor out the TL info).

Rotsor

unread,
Mar 19, 2011, 12:37:37 PM3/19/11
to Tuura project
I like the idea of dynamic TL!

However, the percentage of the absolute best time may sometimes be too
rough. Imagine one guy optimising his program in some ugly and crazy
way (like inline assembler), breaking the problem for the majority of
contestants who write normal programs.

The solution may be like max(1, min(5, best*300%)), but ideally it
should be something involving statistics. :)

> Just ban C++
This won't help Python, will it? :)

On Mar 19, 12:54 pm, Ulan Degenbaev <udegenb...@gmail.com> wrote:
> Ivan, that shouldn't be a problem if the test-runner processes can connect
> to the submission db.
> This also means that problem statements are no longer static (or we factor
> out the TL info).
>
> On Sat, Mar 19, 2011 at 1:40 PM, Ivan V. Polyakov <ivpolya...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Dynamic TL is a great idea, but can you apply it during a contest? :)
>

Ivan V. Polyakov

unread,
Mar 19, 2011, 12:53:31 PM3/19/11
to tuurap...@googlegroups.com
> This won't help Python, will it? :)

Ban that too! :D

Splash

unread,
Mar 20, 2011, 3:52:46 AM3/20/11
to Tuura project
I have two questions:
1. What will happen with previous accepted submissions in case of
reducing TL?
2. There is a class of problems with a very limited set of input
values and as a result with a limited set of correct answers for the
input. Having the only one submission with a chain of ifs will ban all
other submissions with a "fair" solution. How to prevent the
situation?

Splash

unread,
Mar 20, 2011, 4:17:41 AM3/20/11
to Tuura project
As I understood, you want to compare not the implementation but the
efficiency of a solution?
If so, to make comparison truly fair you'd probably need to parse a
solution into an AST and apply some weights to nodes. The weights
should take into account an overhead provided by the target language
of solution.
The idea is crazier than the Ulan's one.

So the best way I suppose is to measure the total program runtime but
I/O calls.
I think I/O Wrappers and pre-parsed inputs are same in terms of "fair"
measurement.

Splash

unread,
Mar 20, 2011, 5:00:18 AM3/20/11
to Tuura project
Statistically "fair" dynamic TL.

Suppose we have a distribution of time measures for a solution in one
precise language. A histogram of the distribution should be like this:
t00 (0)
t01 (2) --
t02 (4) ----
t03 (5) -----
t04 (4) ----
t05 (4) ----
t06 (6) ------
t07 (9) ---------
t08 (6) ------
t09 (5) -----
t10 (4) ----
TL (0)
49 solution in total.

The first peak is t03. The mean value for of measures is about t06.
It looks ok to set TL to 3-sigma away from t03. On given example it
will be around t06.
So tuned non-efficient solutions remain accepted.
Of course, we can detect distribution when:
- there are enough correct submissions
- the author marks test cases with flag "performance test"

Ulan, could distributions be collected from KRSU server to prove the
concept?


On Mar 19, 6:29 pm, Ulan <udegenb...@gmail.com> wrote:

Ulan Degenbaev

unread,
Mar 20, 2011, 9:25:12 AM3/20/11
to tuurap...@googlegroups.com
> 1. What will happen with previous accepted submissions in case of
> reducing TL?
The status of the previous slow solutions changes to TL from AC.
Now I see the problem with this during a contest (Ivan, did you mean this?):
imagine someone who heavily optimizes his solution and waits until the last minute before submitting.
This submission breaks all previous accepted submission, and other contestants have no time to fix their solutions.
Maybe freezing the time limits in the last hour could help.

> 2. There is a class of problems with a very limited set of input ...
Time limit does not make sense for such problems, does it? :)
So I guess, we should avoid giving such problems.

> If so, to make comparison truly fair you'd probably need to parse a
> solution into an AST and apply some weights to nodes.
If I got you correctly, the halting problem can be reduced to this. %)

>I think I/O Wrappers and pre-parsed inputs are same in terms of "fair"
>measurement.
As Rotsor said, it is not generally secure because the wrapper in in the same address-space as the submission.
We can use wrappers because they are convenient for users, but we cannot rely on the measurements (at least in low-level languages, like C). 

>Statistically "fair" dynamic TL.
What I don't quite like about the statistical approach is that it provides incentive for cheating by making multiple accounts:
imagine someone how can't solve the problem fast enough, he registers 100 accounts and submits very slow solutions,
and after that submits his slow solution. On the other hand, the guy with the fastest solution can decrease the TL by
registering another 1000 accounts. Now this becomes a "who-can-manipulate-TL" contest. :)

I think we should consider the fastest time only, and find some formula like Rotsor suggested.

Andrey Mokhov

unread,
Mar 20, 2011, 9:49:35 AM3/20/11
to Tuura project
Speaking of dynamic TL, we can keep all previously accepted solutions
but demote them to lower 'leagues' or 'tiers' when someone presents a
significant improvement over them. Say, the top league contains
solutions within 300% of the best one, etc.

Then a participant whose program was demoted due to weak tests can
just add a new test (slowing down the current leader) and getting back
to the top without resubmission (and complete retesting) of his code.

Arseniy

unread,
Mar 20, 2011, 1:21:14 PM3/20/11
to Tuura project
> > If so, to make comparison truly fair you'd probably need to parse a
> > solution into an AST and apply some weights to nodes.
> If I got you correctly, the halting problem can be reduced to this. %)
Ulan, this is not correct.
Ensuring that a program takes no more than X steps to execute is not
equivalent to a halting problem, but rather requires a simple program
simulation, counting the instructions executed.
Implementing a system using this technique would require to define a
cross-language computation model (abstract machine) and compile all
the code into this machine instructions. This machine would be much
better than a real one in that the performance measurement would be
exact (the number of instructions executed) and reproducible on any
machine, including the contestants' machines. Additionally, the
security of executing its code would be perfect.

However, this leads to a problem of instruction set fairness and
instructions weighting (e.g. if we include an instruction doing a beta
reduction, it may give some unfair advantage to Haskell solutions and
vice versa). Performance of individual compilers can be a problem too.

Ulan Degenbaev

unread,
Mar 20, 2011, 4:23:22 PM3/20/11
to tuurap...@googlegroups.com
Ah, sure, we don't have to wait till the program finishes. :)

Arseniy Alekseyev

unread,
Mar 21, 2011, 9:19:11 PM3/21/11
to tuurap...@googlegroups.com, Ulan Degenbaev
Ulan makes a good point about statistics!

>... it provides incentive for cheating by making multiple accounts

I think the same argument applies to the program performance ranking.
The initial idea was based on statistics - calculate a sum of run
times on all the tests.
I'll show that it can be easily manipulated.
Suppose we have a solution A, working in N*N time where N is the size
of a test and a solution B, whose execution time is 1000*N.

Obviously, the larger the test, the better it is for solution B and vice versa.

So, the authors of solutions A and B are forced to compete in who
submits more tests favoring their respective solutions.

The possible solution to this problem is to only consider the worst
time of the program on all the passed tests. However, this makes
programs with good average performance useless.

Now, the idea for the possible solution to this can be taken from the
Google Code Jam competition, where the tests are separated into "easy"
and "hard" categories. We too can create several test categories
(light-weight, mid-weight, heavy-weight), and rank the problems
separately in those categories. We can even calculate an "average
performance" rating, which will be the multiple of worst times in each
test category.

Andrey Mokhov

unread,
Mar 21, 2011, 9:45:13 PM3/21/11
to Tuura project
That's unexpected :-)

I was also thinking about different test categories in the context of
doing GCJ-like tournaments. A problem statement can specify separate
limits on input data for every category and a participant can attempt
to solve only some of them. This is beginner-friendly and is useful
for education purposes. Also, often easier limits allow for a much
easier solution (and sometimes I'm just too bored to implement the
optimal solution if it is difficult and too technical).

This leads to another interesting sort of ranking: length of code,
which may also be difficult to measure fairly :-) Perhaps, counting
words may be a good starting point here. Then one could brag that his
solution is the shortest one to code even if it does not pass the
hardest tests. This is a topic for a separate thread, probably -- so
if you find the idea interesting, let's start it.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages