TLDR: This email proposes adding a new method of Dixonary skill rating to be published as a “leader board” alongside the scorekeeper’s statistics posts every 25 rounds, and solicits volunteers to help evaluate and tune the rating system. If you don’t care about Dixonary statistics, scores, or ratings, feel free to stop reading now.
If you’re still reading, you’re probably one of several players with at least a passing interest in different measures of Dixonary performance. Mike’s statistics posts scratch that itch for many of us, especially those like me who don’t typically score that high and instead seek validation elsewhere, such as being a tricky dealer or maintaining back-to-back streaks of wins.
One long-standing challenge is that most existing metrics reward
longevity more than
current performance. Total scores accumulate over thousands of rounds, which
Efrem illustrated very clearly in a graph after the Round 3575 stats report. The only reason I’m currently in third place overall is that Paul took a long break from the game. I’m sure he’ll overtake me again eventually, but it will take a while!
Average scores improve on total scores, but they have their own limitations. In the early days of the game there were more active players and more points available, so averages are biased toward those early decades and don’t move much over time. Tim Lodge once wondered whether his performance was declining;
Paul responded after Round 3600 with evidence to the contrary; but it’s still difficult for long-term averages to reflect recent performance.
Counting total wins (outright or tied), as Paul also did after Round 3600, addresses the “points available” issue, but it still rewards longevity. In that same post, Paul observed that
most games and sports instead use *ratings* to address exactly these problems. I’ll quote him here:
The statistics that Mike has been carefully maintaining since Round 1000 are now such an expected fixture that it has only lately begun to occur to me that they are quite unlike most of the leader boards you will find for other games or sports.
Chess and Scrabble players are accorded an elaborately computed rating after every match or tournament. Our equivalent to that is (I suppose) the 5-round rolling scores report. But there are other, simpler league tables, such as you might find at your local squash or tennis club, that list wins, draws and losses.
That “elaborately computed rating” is what most people know as an Elo rating. Very roughly, Elo treats your rating as a measure of skill: beating a lower-rated player changes your rating only slightly, while beating a higher-rated player results in a larger gain. Over time, ratings settle into a reasonably stable reflection of current ability.
Classic Elo works only for two-player games, but there are modern extensions designed for multiplayer games. One of the best-known is Microsoft’s TrueSkill system used to match Xbox players; an open-source equivalent called
OpenSkill makes
similar ideas available more broadly.
For Dixonary, this kind of rating has some attractive properties:
- It is based on relative performance against other players, rewarding consistently ranking ahead of (or tying) strong opponents.
- It incorporates uncertainty, rewarding consistent performance rather than a few lucky results.
- It allows uncertainty to grow over time, so new players, or players returning after a long gap (Hi Theresa!), can become competitive in the rankings quickly.
I’ve been experimenting with this approach, and it produces some interesting results (including confirming that Tim Lodge is still doing quite well).
One particularly nice feature is that it also supports *separate* ratings for different roles. In our case, that means dealer versus guesser. While we don’t award points for a D0, it’s clearly an accomplishment; under this system, fooling strong guessers improves your dealer rating: a D2 achieved fooling less consistent guessers may not be as impressive as a D3 fooling expert guessers. Similarly, guessers are rewarded more for correctly identifying words dealt by tricky dealers.
Long story short: I’ve already built a system to calculate these ratings, but before proposing anything official I’d like to invite interested players to an off-group email thread to discuss, evaluate, and tune it, as well help decide what a 25-round report should look like if we decide to publish one.
If you’d like to take part in that discussion, let me know!