Hi all,
Just joined the group after finding out about it a couple days ago.
Thought I'd share what I have done so far. I think it is quite similar
(and possibly identical) to the PMF method described by Danny but I am
not exactly sure. Forgive me if I am just rambling on about things
everyone here is already aware of.
In any case, I have put the problem in a compact matrix form that
makes it a lot more clear, to me at least. Also, the problem has a
convex optimum solution so it doesn't depend on Machine learning,
which I have next to zero experience with.
Problem description:
For each season (e.g. 2006-2007) I have enumerated the teams and
compiled the scores of the games into a matrix S. For example, if team
1 beat team 2 with a score of 82-72 then S12=82 and S21=72. Ideally,
each team would play every other team at least once, but this is
obviously not the case so the matrix S is sparse. Using the method
proposed by George Dahl, I define vectors o and d which correspond to
each teams offensive and defensive ability. The approximation to the
matrix S is then just the outer product od' (for example
(od')_12=o1d2=S12est). This is a simple rank one approximation for the
matrix. If each team played each other at least once then the matrix S
would be dense and the vectors o and d could be found by finding the
SVD of S (see
http://www.stanford.edu/~boyd/ee263/notes/low_rank_approx.pdf).
Because this is not the case, we instead define a matrix P that
represents which teams played that season. For example, P12=P21=1 if
teams 1 and 2 played a game. Now the problem stated by George can be
expressed compactedly as, "minimize ||P.*(o*d')-S||_F". Here, '.*'
represents the Hadamard product and ||.||_F is the Frobenius norm. In
this from, it is easy to see that, for constant vector o and variable
vector d, this is a convex problem. Also, for constant vector d and
variable vector o this is a convex problem. Therefore, by solving a
series of convex problems, alternating the vector variable between o
and d, the problem converges rapidly in about 5 to 10 steps (see
"Nonnegative Matrix Factorizations" code here
http://cvxr.com/cvx/examples/).
From this point the problem is easily expanded to handle higher rank
approximations.
Some notes:
1) Higher rank approximations seem to over fit the data and make less
accurate predictions than rank one fits. I have only tried rank 5 and
rank 10, though. It is possible rank 2 or so is the sweet spot.
2) My current results seem to agree very closely with those found
here:
http://thepowerrank.com/visual/NCAA_Tournament_Predictions . Any
other prediction websites that people like?
3) Predictions from my code here:
http://dl.dropbox.com/u/6998157/Predictions2012.xlsx
4) If anyone is interested in the MATLAB/CVX code I used I can share
it, but it is very hacked together. The CVX code is only a few lines -
most of the effort is forming the matrix S (which isn't really much
effort).
5) The algorithm correctly predicted the national champion for the
last 6 years. This year it is calling for Ohio St to squeak out a win
over Kentucky. It will be fun to watch and see how that goes. In any
case, I think it is fantastic that a simple rank one approximation can
give such accurate predictions.
Thanks for sharing the starter code Danny. It made it effortless to
jump into this fun problem. I'm looking forward to next year when I
will know about this ahead of time and be able to put more effort into
it.
- Ryan