Hi,
Out of personal interest I wanted to come up with a better way to
schedule match-ups between sports teams so that you play teams that are
close to the same strength. This should take into account the results of the games and the strength of a team's schedule. It would be good if each team didn't have to play every other to determine their ranking as that would create a lot of blow-outs which wouldn't be fun.
After a bit of research some people use a linear regression where rating_team1 - rating_team2 = expected_point_spread. A paper written about it is here:
http://masseyratings.com/theory/massey97.pdfThe paper gives an example set of games:
![]()
The ratings can be calculated as X * r = y or r = X \ y:
# 1=Beast Squares, 2=Gaussian Eliminators, 3=Likelyhood Loggers, 4=Linear Aggressors:
julia> X = [1 -1 0 0; 0 0 1 -1; 0 -1 0 1; 1 0 0 -1; 0 1 -1 0]
5x4 Array{Int64,2}:
1 -1 0 0
0 0 1 -1
0 -1 0 1
1 0 0 -1
0 1 -1 0
# Point spreads:
julia> y = [4, 0, 7, 2, 1]
5-element Array{Int64,1}:
4
0
7
2
1
julia> X \ y
4-element Array{Float64,1}:
2.375
-2.5
-1.125
1.25
(It is amazing what Julia can do in 1 line of code!) So if 1 plays 2 the expected point spread is 4.875. But when I use my own dataset and code posted in this gist:
https://gist.github.com/GlenHertz/6360352I get:
julia> reload("calc_spreads.jl")
julia> standings_by_points
6x11 DataFrame:
Team GP W L T PTS GF GA DIFF PCT Rating
[1,] "F" 15 0 14 1 1 27 77 -50 0.0333333 -1.48166
[2,] "E" 14 5 9 0 10 42 58 -16 0.357143 0.456697
[3,] "D" 16 6 9 1 13 44 69 -25 0.40625 -0.239425
[4,] "C" 15 9 5 1 19 52 34 18 0.633333 0.377248
[5,] "B" 17 11 5 1 23 73 45 28 0.676471 0.309839
[6,] "A" 17 14 3 0 28 85 40 45 0.823529 0.5773
The PTS are 2 points for a win, 1 for a tie, 0 for a loss. GF and GA are goals for and against. Diff = GF - GA. PCT is winning percentage (ties count as half a win).
julia> standings_by_rating
6x12 DataFrame:
Team GP W L T OT PTS GF GA DIFF PCT Rating
[1,] "F" 15 0 14 1 0 1 27 77 -50 0.0333333 -1.48166
[2,] "D" 16 6 9 1 0 13 44 69 -25 0.40625 -0.239425
[3,] "B" 17 11 5 1 0 23 73 45 28 0.676471 0.309839
[4,] "C" 15 9 5 1 0 19 52 34 18 0.633333 0.377248
[5,] "E" 14 5 9 0 0 10 42 58 -16 0.357143 0.456697
[6,] "A" 17 14 3 0 0 28 85 40 45 0.823529 0.5773
(On an aside, I feel a bit stupid that I couldn't figure out how to sort the dataframe in reverse order...the Julia way isn't followed and it wasn't obvious to me. Is it planned to catch up to Julia? I also found the API hard to pick up again after being away from DF for a while. I was looking for "append!" instead of "rbind")
The ratings seem hard to believe. For example, how can team E have a higher rating than team B?
Aside from explaining the odd results with the regression, what technique do people recommend for this? It looks like the GLM package would be helpful and some researchers use a Bayes approach. I'm not sure how to structure this so it is usable in the GLM package. The true dataset is probably not going to be larger than 15 teams and 250 games. Any recommendations for a model and the packages to use would be greatly appreciated!
Thanks again!
Glen