Parameters and Scoring for the Final Competition

Scott Sanner

unread,

Apr 11, 2011, 6:44:02 AM4/11/11

to IPPC 2011

Dear IPPC Competitors,

Based on test competition analysis, here are the parameters and scoring changes for the final competition (boolean MDP and POMDP tracks):

(1) Parameters:

- 8 domains

- 10 instances per domain (like the test competition)

- 30 trials per instance

- Horizon = 40 (*all* instances will use the *same* fixed horizon)

(2) Normalized scoring change:

Scores are currently normalized to the interval [0,1]. The lower score for normalization purposes is currently min(RandomPolicy, NoopPolicy). This will be changed to max(RandomPolicy, NoopPolicy) for the main reason that planners should only get a non-zero normalized score if they can beat both trivial policies. The IPPC web page has been updated to reflect these changes:

http://users.cecs.anu.edu.au/~ssanner/IPPC_2011/

(3) Planners will be ranked purely on the average of their normalized scores over all instances... confidence intervals (CIs) will be reported but will not be used to tie competitors. (For various reasons, we would need to increase the number of instances -- not trials -- drastically in order to get small CIs and this is simply not possible in a 24 hour competition format.)

If you have any questions regarding these changes, please post them to this list for public discussion.

Cheers,

Scott & Sungwook

Andrey Kolobov

unread,

Apr 12, 2011, 3:07:07 AM4/12/11

to ippc...@googlegroups.com, Scott Sanner

Hi Scott,

I have 2 questions:

> Scores are currently normalized to the interval [0,1]. The lower score for
> normalization purposes is currently min(RandomPolicy, NoopPolicy). This
> will be changed to max(RandomPolicy, NoopPolicy) for the main reason that
> planners should only get a non-zero normalized score if they can beat both
> trivial policies.

Will you somehow make sure that neither of these policies is optimal
on the competition problems? On quite a few currently published
problems, the best policy is actually one of these two, and some work
is required on the part of the planner to "realize" that. Giving
planners a score of 0 in such cases wouldn't be entirely fair.

Also, will you count only the last or only the first 30 trials on a
given problem when evaluating performance?

Thanks,

Andrey

Scott Sanner

unread,

Apr 12, 2011, 3:35:21 AM4/12/11

to Andrey Kolobov, ippc...@googlegroups.com

Hi Andrey,

> On quite a few currently published problems, the best

> policy is actually one of these two

This seems true for 3 of the 10 Game of Life instances, but Game of Life will not be used in the final competition.

Do you believe this holds for other domains / instances? Which ones?

> will you count only the last or only the first 30 trials on a

> given problem

Last 30... all details for final competition scoring should be here:

http://users.cecs.anu.edu.au/~ssanner/IPPC_2011/

Cheers,

Scott

Ashwin NR

unread,

Apr 12, 2011, 4:11:28 AM4/12/11

to ippc...@googlegroups.com

Hi Scott,

Will you be allowing (3) variants to be run?

Thanks

Aswin NR

Scott Sanner

unread,

Apr 12, 2011, 7:26:03 AM4/12/11

to ippc...@googlegroups.com, Ashwin NR

> Will you be allowing (3) variants to be run?

Good question, let's take a vote:

http://www.doodle.com/n78r3be36syu63zm

One vote per team. Majority vote wins -- voting closes in 48 hours.

Cheers,

Scott

Daniel Bryce

unread,

Apr 12, 2011, 12:06:26 PM4/12/11

to ippc...@googlegroups.com

Will each variant be scored by itself or are we talking about the max performance across the variants for each instance?

Andrey Kolobov

unread,

Apr 12, 2011, 1:45:39 PM4/12/11

to Scott Sanner, ippc...@googlegroups.com

Hi Scott,

> This seems true for 3 of the 10 Game of Life instances, but Game of Life
> will not be used in the final competition.
> Do you believe this holds for other domains / instances? Which ones?

I'm not entirely sure if there are more (the largest sysadmin problems
are suspect, but I can't say for certain), and this is exactly the
issue I'm slightly worried about -- it often seems impossible to know
if there is a "nontrivial" policy that beats best{noop, random} unless
the given problem can be solved optimally. If fact, a slight change of
parameters can turn a problem with a nontrivial best policy into one
with a trivial best policy. For large problems optimal solutions are
clearly infeasible, which means that the trouble with game_of_life
could surface on the new domains you will be introducing. So, I was
wondering whether you have somehow verified for the problems we (the
competitors) haven't seen yet that they all have a nontrivial optimal
policy. If this can't be verified then defining minimum score to be
max{noop, random} may be dangerous, since one of these two may
actually be the best one.

The other reason why this may be dangerous is that the best policy may
be only slighly better than best{noop, random}, and due to variance
(in turn, due to the small number of rounds) the difference may be
hardly noticeable. Again though, if you managed to verify that this is
not the case with competition problems then this is not an issue.

Cheers,

Andrey

Scott Sanner

unread,

Apr 12, 2011, 9:09:29 PM4/12/11

to Andrey Kolobov, ippc...@googlegroups.com

Hi Andrey,

Thanks for the comments, I do understand your concerns. Obviously we (Tom Walsh, Sungwook, and myself) will do our best to produce final competition instances that have non-trivial (i.e., non-random, non-noop) optimal solutions.

Regarding non-trivial policies for the existing domains, here's a quick run-down of what I've seen in the competition for both MDPs and POMDPs:

* Game of Life: no planner did well on the instances with high randomness (three), but some MDP planners and one POMDP planner did well on instances with low randomness. Again, this domain will not be used in the final competition.

* SysAdmin: the best MDP planners did significantly better than the random/noop policies, POMDP planners struggled more due to observation aliasing so I will likely decrease the failure rate in the POMDP versions to compensate, maybe also the MDP version.

* Elevators: there are known policies here that far outperform the random / noop policies; while both MDP and POMDP planners struggled on some instances, I know there is much room for improvement over the random and noop policies.

* Traffic: all but one MDP planner struggled here; the main problem with traffic is that a planner has to do an extreme lookahead since the reward for letting a vehicle through a light does not register until the vehicle exits 10-20 steps later in the larger instances. As a reward-shaping aid to planners, I'm going to switch the final competition reward to this one:

reward = sum_{?c : cell} -[occupied(?c) ^ exists_{?c2 : cell} (FLOWS-INTO-CELL(?c2, ?c) ^ occupied(?c2))];

This reward penalizes *stopped* traffic (moving traffic always has a one car gap) so the advantages of good traffic signal control can be seen more immediately.

This reward is currently commented out in the traffic_mdp and traffic_pomdp files... if you want to play with it, just comment out the current reward in traffic_(po)mdp.rddl, comment in this one, and run one of the following translators (if needed) to regenerate the translations (this may take a little while):

./run rddl.translate.RDDL2Prefix files/test_comp/rddl files/test_comp/rddl_prefix

./run rddl.translate.RDDL2Format files/test_comp/rddl files/test_comp/spudd_sperseus spudd_sperseus

./run rddl.translate.RDDL2Format files/test_comp/rddl files/test_comp/ppddl ppddl

===

Why is min(noop, random) a bad lower bound for normalization? Because any competitor can just quickly run both policies in simulation, pick the better one, and submit that policy for any instances where it does not have time to run a better policy. But this is trivial -- everyone could do this -- non-zero scores should arise from a decent attempt to plan. Of course to use max(noop, random) as a lower bound, I fully agree that domains need good non-noop, non-random policies that are statistically separated from max(noop, random). We're trying our best to ensure that in the final competition.

Cheers,

Scott

Scott Sanner

unread,

Apr 12, 2011, 9:17:21 PM4/12/11

to ippc...@googlegroups.com

> Will each variant be scored by itself

Each variant would be treated as a separate planner entry (and listed separately in the results).

ALL COMPETITORS: The current poll indicates some IPPC competitors are against allowing multiple entries per team -- if your team has an opinion, let your vote be known:

http://www.doodle.com/n78r3be36syu63zm

Cheers,

Scott

Reply all

Reply to author

Forward