I've decide to break up my study into two parts: this post is an indepth
explanation of the method I've used to come up with a model to predict
player performance based on draft position; a post to follow will apply
that model to the drafts over the year, to see if we can spot variations
and deviations from the model. This post is going to be mathintensive and
very, very dull  feel free to skip it. The next post should have a little
more meat.
I've decided to present my numbers in terms of Net Wins Above Replacement
instead of using more traditional boxscore stats. Even thought nWAR is
slightly esoteric, they can be translated back into the real world simply:
One nWAR = one win added to the team by that player, over and above what
his replacement would've done given the same amount of touches. Got that?
One nWAR = One Win. In this post, I may use nWAR and Win interchangeably.
To give you a feel for what a nWAR is worth, I present here the top 10 for
200304, in conjunction with the usual per game numbers. (GR is "Games
Responsible"  an estimate of how many team possessions over the season
were used up by that player.)
Player Year GP GR PPG APG RPG nWAR
1 garnett,kevin 2004 82 16.3 24.2 5.0 13.9 13.4
2 duncan,tim 2004 69 12.7 22.3 3.1 12.4 10.3
3 stojakovic,peja 2004 81 13.9 24.2 2.1 6.3 9.9
4 cassell,sam 2004 81 13.8 19.8 7.3 3.3 9.2
5 kirilenko,andri 2004 78 12.6 16.5 3.1 8.1 9.1
6 ming,yao 2004 82 12.2 17.5 1.5 9.0 8.8
7 nowitzki,dirk 2004 77 12.8 21.8 2.7 8.7 8.7
8 billups,chaunce 2004 78 12.3 16.9 5.7 3.5 8.7
9 jefferson,richa 2004 82 13.6 18.5 3.8 5.7 8.6
10 o'neal,jermaine 2004 78 13.9 20.1 2.1 10.0 8.0
Yes, I too was surprised to see Jeff.
Here are the top 10 nWAR of all time (actually, since 197778, when the
numbers became available for the first time):
Player Year GP GR PPG APG RPG nWAR
1 jordan,michael 1988 82 17.7 35.0 5.9 5.5 14.4
2 robinson,david 1994 80 17.1 29.8 4.8 10.7 14.0
3 jordan,michael 1989 81 17.3 32.5 8.0 8.0 13.7
4 jordan,michael 1996 82 16.3 30.4 4.3 6.6 13.6
5 garnett,kevin 2004 82 16.3 24.2 5.0 13.9 13.4
6 o'neal,shaq 2000 79 16.3 29.7 3.8 13.6 13.3
7 jordan,michael 1990 82 17.1 33.6 6.3 6.9 13.2
8 jordan,michael 1991 82 15.9 31.5 5.5 6.0 13.2
9 jordan,michael 1987 82 18.7 37.1 4.6 5.2 13.2
10 jordan,michael 1997 82 16.2 29.6 4.3 5.9 12.9
(One thing I'll take from this stat is just how good MJ really was.
Incredible, isn't it?)
So just keep in mind: 1 nWAR over a full season isn't very good  we're
talking Samaki Walker. 3 nWAR is pretty decent: Joe Barry Carroll, say. 5
nWAR is good second banana territory: Derrick Coleman, say, or Otis
Birdsong. 7 nWAR is getting into the really good category: Brad Daugherty,
Larry Johnson, Alonzo Mourning. 10 is team leader having career year area:
Ray Allen 2001, Mookie 1997, Terrell Brandon 1996, Kidd 2003. 12 is the
type of season only HOFers have: Moses, DRob, Shaq, MJ, Dirk, Karl, Duncan,
Hill, Bird, Barkley. (Yes, Dirk will be a HOFer.) MJ leads with 10 12nWAR
seasons, Karl next with 7. Kareem, who played with better teammates for
many of his years, had to share the ball too much to rack up many 10+ nWAR
seasons  which highlights one aspect of nWAR as a performance metric: it
only measures production, not ability. Keep that in mind.
Got that out of the way. So how much is a draft pick worth? I ran into
problems right away  what the hell does that question mean? Worth to who?
To the team that has the pick, obviously, but for how long? A team can't be
expected to hold on to a picked draft pick forever. What does San Antonio
get out of picking Tim Duncan anyway? They get his services (810
nWAR/year), but for how long?
Say a team gets a draft pick's services for 3 years  the three years
following the year of the pick. Here are the average wins per year a draft
pick contributes to his team.
Level N Mean StDev +++
1 25 4.607 2.775 (*)
2 25 3.217 1.886 (*)
3 25 4.388 2.738 (*)
4 25 2.366 1.592 (*)
5 25 3.366 2.381 (*)
6 25 2.023 1.801 (*)
7 25 2.030 1.523 (*)
8 25 2.096 2.076 (*)
9 25 2.456 1.985 (*)
10 25 2.255 1.880 (*)
MID 225 1.483 1.499 (*)
LATE 250 0.896 1.337 (*)
2ndRound 556 0.322 0.758 (*)
+++
Pooled StDev = 1.356 1.5 3.0 4.5
What does all that mean? "Level" is the Draft Pick position: 110; "MID" is
a midlevel first rounder, 1120; "LATE" is a late first rounder, 2129; and
then there are the second round picks. "N" is the number of picks in my
sample. "Mean" is the average nWAR per year by each pick. "StDev" is
standard deviation, a measure of variation. The variation is also displayed
by the parentheses enclosing the asterisk, like this > (*). The
wider the parentheses are set apart, the more variation in the wins
produced by the draft picks.
Clearly, then, the top five draft picks produce wins roughly in accordance
with their draft positioning. Draft picks 610 are indistinguishable from
each other. The other picks decrease in value, fading to an average of just
over zero wins for second round picks.
That's for the first 3 years. But what about the first 5 years following
the draft, what can we expect?
Level N Mean StDev +++
1 23 5.217 2.672 (*)
2 23 3.637 1.983 (*)
3 23 4.693 2.682 (*)
4 23 3.088 1.591 (*)
5 23 3.751 2.502 (*)
6 23 2.132 1.872 (*)
7 23 2.529 1.716 (*)
8 23 2.409 2.112 (*)
9 23 3.044 2.221 (*)
10 23 2.476 2.062 (*)
MID 207 1.701 1.754 (*)
LATE 230 0.991 1.467 (*)
2ndRound 500 0.379 0.866 *)
+++
Pooled StDev = 1.483 1.6 3.2 4.8
The same pattern. How about 7 seasons?
Level N Mean StDev +++
1 21 5.556 2.646 (*)
2 21 3.640 1.950 (*)
3 21 4.699 2.786 (*)
4 21 3.262 1.831 (*)
5 21 3.757 2.632 (*)
6 21 2.108 2.126 (*)
7 21 2.667 1.878 (*)
8 21 2.326 2.113 (*)
9 21 2.669 1.957 (*)
10 21 2.240 1.859 (*)
MID 189 1.798 1.939 (*)
LATE 210 0.990 1.496 (*)
2ndRound 442 0.389 0.892 (*)
+++
Pooled StDev = 1.552 2.0 4.0 6.0
These all display the same pattern. What I will do is use the average nWAR
production over the first five seasons following the draft as the measure
under study.
It's clear then that the average draft pick produces wins roughly in
accordance to his draft position. This isn't a huge surprise, but what we
want to know is the amount of variation about the mean  the amount of
certainty we can have that a draft pick will produce the expected number of
wins. What we need is a model of win production which takes draft position
into account, along with any other factors that may affect his performance.
I'll use a typical multiple regression model for this.
The multiple regression equation takes the form
y = b1*x1 + b2*x2 + ... + bn*xn + c. The b's are the regression
coefficients, representing the amount the dependent variable y changes when
the independent changes 1 unit. The c is the constant, where the regression
line intercepts the y axis, representing the amount the dependent y will be
when all the independent variables are 0.
In English: the model I've used will predict a player's nWAR by using the
following equation: nWAR = b1*x1 + b2*x2 + ... + bn*xn + c, where the x's
are factors used to predict player performance (draft position, season,
height, etc.) and the b's are coefficient used to weight the xfactors
properly (because one inch of height has less effect than one draft
position). C is just a constant added to the equation to make it look nice.
These are the factors I used to try to predict player performance:
Year  that is, the season the player was drafted
AGE  player age at draft
Ht  height
OvAlPk  overall pick in the draft
Teams  # of teams in the league
AllPs  total # of picks taken in the draft
HHI5  a measure of team equality, averaged over the 5 years
HHI5_1  team equality from the season before
W_L5  team win/loss record
W_L5_1  team win/loss record from the season before
Additionally, I've included squared and cubed versions of each of these
variables in the regression, denoted by "^2" and "^3" respectively  eg
AGE^2 = the player's age, squared. The reason for this is that some
variables's effect is nonlinear (for example, the effect of rest on team
performance is nonlinear: 1 days rest is twice as good as 0 days, but 2 is
four times as good as 1 days rest). Squared and cubed terms sometimes
capture the nonlinearity.
Okay, that done. The regression equation is
1st5 =  421363 + 644*Year  0.33*Year^2 +0.000056*Year^3
+ 0.070*AGE  0.00455*AGE^2 +0.000040*AGE^3
 5.61*Ht + 0.0732*Ht^2 0.000318*Ht^3
 0.290*OvAlPk + 0.00719*OvAlPk^2 0.000060*OvAlPk^3
+ 8.6*Teams  0.299*Teams^2 + 0.0034*Teams^3
+ 0.0708*AllPs 0.000485*AllPs^2 +0.000001*AllPs^3
 139*HHI5 + 570*HHI5^2  732*HHI5^3
 23.6*HHI5_1 + 64.3*HHI5_1^2  52*HHI5_1^3
+ 22.1*W_L5  66.6*W_L5^2 + 70.5*W_L5^3
 15.6*W_L5_1 + 43.2*W_L5_1^2  41.6*W_L5_1^3
The next step is to remove the variables which are not statistically
significant. These are listed in the following table:
Predictor Coef SE Coef T P
Constant 421363 3258105 0.13 0.897
Year 644 4917 0.13 0.896
Year^2 0.328 2.473 0.13 0.895
Year^3 0.0000556 0.0004147 0.13 0.893
AGE 0.0703 0.1599 0.44 0.660
AGE^2 0.004551 0.004870 0.93 0.350
AGE^3 0.00003986 0.00003607 1.10 0.269
Ht 5.609 5.471 1.03 0.306
Ht^2 0.07317 0.06984 1.05 0.295
Ht^3 0.0003185 0.0002969 1.07 0.284
OvAlPk 0.28996 0.02819 10.29 0.000
OvAlPk^2 0.007192 0.001164 6.18 0.000
OvAlPk^3 0.00005979 0.00001380 4.33 0.000
Teams 8.63 19.89 0.43 0.664
Teams^2 0.2992 0.7769 0.39 0.700
Teams^3 0.00342 0.01007 0.34 0.734
AllPs 0.07076 0.06993 1.01 0.312
AllPs^2 0.0004846 0.0005099 0.95 0.342
AllPs^3 0.00000101 0.00000112 0.90 0.366
HHI5 139.17 44.58 3.12 0.002
HHI5^2 569.5 184.0 3.10 0.002
HHI5^3 732.1 240.2 3.05 0.002
HHI5_1 23.62 25.07 0.94 0.346
HHI5_1^2 64.25 91.30 0.70 0.482
HHI5_1^3 51.9 100.5 0.52 0.606
W_L5 22.07 36.03 0.61 0.540
W_L5^2 66.62 75.52 0.88 0.378
W_L5^3 70.53 51.85 1.36 0.174
W_L5_1 15.62 34.40 0.45 0.650
W_L5_1^2 43.20 73.26 0.59 0.556
W_L5_1^3 41.59 51.30 0.81 0.418
S = 1.423 RSq = 47.5% RSq(adj) = 46.0% <that reflects a
pretty good fit!
The column labeled "P" shows the statistical significance. We are looking
for variables which have low pvalues, below 0.05. Once I remove the
variables that aren't significant, we end up with this:
1st5 = 2.69
+ 0.00668*Ht
 0.189*OvAlPk + 0.00228*OvAlPk^2
 15.5*HHI5 + 32.7*HHI5^2
+ 8.12*W_L5
 2.45*W_L5_1
Predictor Coef SE Coef T P
Constant 2.6878 0.8788 3.06 0.002
Ht 0.006684 0.001792 3.73 0.000
OvAlPk 0.18878 0.01091 17.31 0.000
OvAlPk^2 0.0022751 0.0001927 11.80 0.000
HHI5 15.514 6.740 2.30 0.022
HHI5^2 32.69 14.04 2.33 0.020
W_L5 8.116 1.098 7.39 0.000
W_L5_1 2.449 1.131 2.16 0.031
So how does that work? Take a look at the picks from the 97 draft, for
example: I'll show picks at intervals of 5, beginning with the #1 pick.
actual predicted
Name Pick nWAR nWAR Error
Tim Duncan 1 11.2 5.5 5.7
Ron Mercer 6 0.8 2.5 1.7
Tariq AbdulWahad 11 0.2 2.1 1.9
Brevin Knight 16 2.9 1.1 1.8
Anthony Parker 21 0.1 1.4 1.3
Charles C. Smith 26 0.0 0.3 0.3
Charles O'Bannon 31 0.1 0.6 0.5
James Collins 36 0.0 0.0 0.1
Jason Lawson 41 0.0 0.4 0.4
Eric Washington 46 0.0 0.3 0.3
DeJuan Wheat 51 0.0 0.2 0.2
Nate Erdmann 56 0.0 0.8 0.8
Except for Duncan, our regression equation does a pretty good job at
predicting the amount of wins these players will contribute. In fact, if we
look at all the picks from every season, we'll see that about twothirds of
the predictions are off by less than 2 nWAR. We'll call this (2 nWAR) the
Error term of the equation  the amount of uncertainty inherent the
equation.
In my next post I will apply this model to drafts over the years to see if
there are consistent deviations from the model.


 best,  Sticking it to 
 ed  The Man since 1971 

Watch the spam trap  the domain is rogers
So the million dollar question is: if you took the Clippers out of this
equation....how would it change the numbers? Really. They have wasted a
LOT of decent picks.
Big Chris
<snip>
Excellent. Really.
Just curious: how did you calculate this? I presume some software package,
it seems to me that with all the variables, even with software, that would
take forever to compute. That's really really cool though, I can see how
that would come in useful in the general "how much does X correlate with Y"
type questions.
<snip>
> The column labeled "P" shows the statistical significance. We are looking
> for variables which have low pvalues, below 0.05. Once I remove the
> variables that aren't significant, we end up with this:
>
> 1st5 = 2.69
> + 0.00668*Ht
>  0.189*OvAlPk + 0.00228*OvAlPk^2
>  15.5*HHI5 + 32.7*HHI5^2
> + 8.12*W_L5
>  2.45*W_L5_1
Interesting how age isn't in there.
That's really pretty damn good. But, you need to apply the model for a
longer period of time to see how good it really is. It would be
interesting to see how relevant the increase in high school players drafted
really is (my guess: not that much)  just look at the average NWAR over
the first X years prior to, say, the draft with KG and the draft with KG
and after. It's also be interesting to see, in general, if high school
players end up being better than college players on average (for whatever
definition of "average").
> In my next post I will apply this model to drafts over the years to see if
> there are consistent deviations from the model.
Well, there you go.
Also, unrelated, but I remember you mentioning something (how's that for
vague?) which analyzed game logs to pull out interesting stats  what was
that? I was thinking of making something like that in my "free time." I
was also thinking of yanking the box score stats from every game and making
them free available in an RSS feed so other people could parse them easily
(or, more accurately, so I could mess with them later in the year).
Anyway, that's really interesting stuff. You should have one of those "blog
thingys" man.

Ron Coscorrosa
http://coscorrosa.com
On Sun, 10 Oct 2004 01:03:40 0500, "Big Chris" <mr...@yahoo.com> wrote in
<2ss1ltF...@uniberlin.de>:
The Clippers are as good a place to start as any. Between 1977 and 1999
(the years used in my sample), they have had 27 1st round picks. They break
down like this (each "X" represents one pick):
Pick #of picks

13 XXXXXX
46 XXXX
79 XXXXXX
1012
1315XXXXX
1618X
1921X
2224XX
2527XX
How good have the Clippers picks been? Have they underachieved or surpassed
expectations? Take a look at the following table, containing every 1st
round Clipper pick:
Actual Expected
TEAM Year Round Pick PLAYER nWAR nWAR DIFF SIGNIFICANCE
SDC 1980 1 9 Mike Brooks +2.2 +3.1 0.9 
SDC 1981 1 8 Tom Chambers +3.3 +1.8 +1.5 ++
SDC 1982 1 2 Terry Cummings +6.9 +4.3 +2.6 +++
SDC 1983 1 4 Byron Scott +5.3 +4.8 +0.5 +
LAC 1984 1 8 Lancaster Gordon 0.2 +1.7 1.9 
LAC 1984 1 14 Michael Cage +4.1 +1.0 +3.1 ++++
LAC 1985 1 3 Benoit Benjamin +2.4 +2.4 0.1
LAC 1987 1 4 Reggie Williams +1.2 +2.3 1.1 
LAC 1987 1 13 Joe Wolf 0.2 +0.8 1.1 
LAC 1987 1 19 Ken Norman +2.0 +0.7 +1.3 +
LAC 1988 1 1 Danny Manning +4.5 +3.7 +0.7 +
LAC 1988 1 6 Hersey Hawkins +5.8 +3.0 +2.8 +++
LAC 1989 1 2 Danny Ferry +1.0 +4.1 3.1 
LAC 1990 1 8 Bo Kimble 0.0 +3.2 3.2 
LAC 1990 1 13 Loy Vaught +3.2 +1.6 +1.7 ++
LAC 1991 1 22 LeRon Ellis +0.3 +1.5 1.2 
LAC 1992 1 16 Randy Woods +0.0 +1.0 1.0 
LAC 1992 1 25 Elmore Spencer 0.1 +0.6 0.6 
LAC 1993 1 13 Terry Dehere +0.9 +1.1 0.2
LAC 1994 1 7 Lamond Murray +0.8 +1.6 0.8 
LAC 1994 1 25 Greg Minor +1.2 +0.1 +1.1 +
LAC 1995 1 2 Antonio McDyess +4.0 +3.6 +0.4
LAC 1996 1 7 Lorenzen Wright +2.1 +1.6 +0.5 +
LAC 1997 1 14 Maurice Taylor +0.4 +0.9 0.5 
LAC 1998 1 1 Michael Olowokandi 0.4 +3.0 3.3 
LAC 1998 1 22 Brian Skinner +1.2 +0.3 +0.9 +
LAC 1999 1 4 Lamar Odom +2.8 +3.0 0.3
The DIFF column is the difference between expected wins and actual wins 
a positive result denotes a player who exceeded expectations. The
SIGNIFICANCE column shows how many ERRORS away from expectations that
player's performance was. For example, Mike Brooks averaged 2.2 wins over
his first 5 seasons. Based on his draft position and other factors, he was
expected to average 2.5 wins, for a DIFFerence of 0.3. The DIFFerence is
lower than the ERROR (0.87), so the difference is not statistically
significant. Tom Chambers averaged 3.3 wins, but was excted to average only
1.2. He exceeded expectations by +2.1 wins, which is two ERRORS over, shown
here as "++".
Now we can look at the Clippers' picks in terms of disappointments and
pleasant surprises. Those players with "" in the SIGNIFICANCE column are
the disappointments and those with "+" are the pleasant surprises. Those
with nothing in that column are those who performed exactly to
expectations. The following graph shows how many ERRORS the Clippers picks
deviated from expectations.
4 XXX
3 
2 X
1 XXXXXXXX
0 XXXX
+1 XXXXXX
+2 XX
+3 XX
+4 X
Four picks (15%) exactly met expectations, and eleven more (41%) exceeded
their expected win totals. Twelve picks (44%) were disappointnents. That
seems to me like a pretty average draft record.
Let's compare that to the Sonics' draft record:
Actual Expected
TEAM Year Round Pick PLAYER nWAR nWAR DIFF SIGNIFICANCE
SEA 1977 1 8 Jack Sikma +7.1 +3.4 +3.7 ++++
SEA 1979 1 6 James Bailey +1.3 +2.6 1.4 
SEA 1979 1 7 Vinnie Johnson +3.1 +3.2 0.1
SEA 1980 1 20 Bill Hanzlik +1.6 +1.5 +0.0
SEA 1981 1 5 Danny Vranes +1.9 +3.2 1.3 
SEA 1983 1 16 Jon Sundvold +0.8 +1.1 0.3
SEA 1985 1 4 Xavier McDaniel +4.0 +3.6 +0.5 +
SEA 1987 1 5 Scottie Pippen +5.5 +4.5 +1.0 +
SEA 1987 1 9 Derrick McKey +4.5 +3.0 +1.5 ++
SEA 1988 1 15 Gary Grant +0.6 +1.5 0.9 
SEA 1989 1 16 Dana Barros +2.4 +1.8 +0.6 +
SEA 1989 1 17 Shawn Kemp +5.8 +2.4 +3.4 ++++
SEA 1990 1 2 Gary Payton +5.7 +4.8 +0.9 +
SEA 1991 1 14 Rich King +0.0 +3.0 3.0 
SEA 1992 1 17 Doug Christie +1.2 +1.4 0.1
SEA 1993 1 23 Ervin Johnson +2.7 +1.8 +1.0 +
SEA 1994 1 11 Carlos Rogers +0.9 +1.7 0.9 
SEA 1995 1 26 Sherell Ford +0.1 +1.2 1.2 
SEA 1997 1 23 Bobby Jackson +1.3 +1.3 +0.1
SEA 1998 1 27 Vladimir Stepania +0.6 +0.0 +0.6 +
SEA 1999 1 13 Corey Maggette +3.2 +1.6 +1.7 ++
Fifteen of Seattle's picks (56%) met or exceeded expectations, about the
same as the Clippers.
4 
3 X
2 XX
1 XXX
+0 XXXXX
+1 XXXXXX
+2 XX
+3 
+4 XX
My next post will explore the deviations from expectations over time, which
I believe was the original topic under discussion.
>Fifteen of Seattle's picks (56%) met or exceeded expectations, about the
>same as the Clippers.
That should be 71%, way better than the Clippers' 56%.
I see this type of application as interesting, in that it is a measurable
way to verify if your management team is doing a good job over time. One
could plot the Jerry West Lakers years to see if he had a big impact, or if
it was only partly him, and other parts xxx.
Big Chris
But, in the sonic sample, it covers how many GM's, and I think 3 owners. to
isolate the draft in terms of management, it would need to be broken down by
GM and Owner. You could actually add in a variable for coach and see which
coaches have impacted the draft for the sonics.
>
> Big Chris
>
>
Before we do that, I want to show how well 1st round players met
expectations over the years. If we subtract Expected nWAR from Actual nWAR,
and divide the difference by 0.87, we get a measure of how much that player
under or overachieved in terms of nWAR ERRORs (the ERROR term was
described previous post). The graph below shows the standard deviation of
nWAR ERRORS for each season between 1977 and 1999. A standard deviation is
a statistical measure of variability  the more variation in a sample, the
higher the standard deviation. If my nWAR Expectation measure could
perfectly predict win production, the standard deviations would be zero.
However, if players' production was utterly uncorrelated to draft position,
and draft day was ultimately a crap shoot, the standard deviation would be
infinite. Of course, reality is somewhere between the two extremes.
 O
3.0+ . . . . . . . . . . . . .
 
StDev  
 O O  O O O
      
2.0+ . . .O .O . . .O . . . .O O. .
  O   O    O   O O O   
    O        O  O  O   O O   
                       
                       
1.0+ . . . . . . . . . . . . .
                       
                       
                       
                       
0.0+                       
+++++
1980 1985 1990 1995 2000
No real pattern emerges from this measure. The highest amount of deviation
from expectations came from the draft class of '84, the second highest in
'85 and '99. The draft classes that came closest to meeting expectations
were '88, '92, and '96.
But this isn't the only way of looking at this. The plot above looked at
all deviations from the expectations of my model, the betterthanexpected
and the worse. But imagine if we were only interested in avoiding wasting a
draft pick. We'd want to know what percentage of draft picks underachieve
their expectations, how many become busts.
Let us define "bust" as a 1st round pick who averages 2 fewer wins over his
first 5 seasons than our regression model predicts for a player of his
draft position. Here, then, is the percentage of picks per season who
became busts:


30%+ . . . . . . . . . . . . .
 O
%   O O O O
     
      O O
20%+ . . . . O.O . . . . . . . .
  O      O   O  O
             
  O            
          O     O 
10%+ . . . . . . . . . .O . O. .
     O O             O  
                   O   
                O       
                       
0%+                       
+++++Year
1980 1985 1990 1995 2000
Although the data are pretty noisy, one can see a definite trend: the
percentage of 1st round picks who become busts has decreased slightly over
time, although the '99 draft class (the latest one in my sample) has gone
back to the high pre90's levels. Overall, about 20% of all picks in the
70s and 80s became busts. That number dropped to 13% in the 90s.
Now imagine that we are only interested in "steals," in picks that vastly
exceed their expectations. Someone in this position may take the "lottery"
picture of the draft fairly literally, and see that most picks never amount
to much. This person would wonder how many times the winning ticket has
come up.
 O
+%  O 
   O O
 O    
30%+ . . . . . . . . . . . . .
     
   O  O  
       
       
20%+ . . . . . . . . . . . . .
    O    O O O 
     O       
  O        O O  O O   
          O O     O    
10%+ . . . . . . . . . . . O. .
    O                  
                      
                      
                      
0%+          O             
+++++Year
1980 1985 1990 1995 2000
These data are even noisier than the bust data, but a similar trend, I
think, is apparent: getting a steal in the draft was much likelier in the
past than it has become  even if '99 harkened back to pre1985 levels. In
the 70s, 27% of all picks became steals. That number dropped to 19% in the
80s, and dropped further to 17% in the 90s.
More to come. In a future post I will look at variations within the first
round picks, and also include some analysis of seconds round picks and
nondrafted players.
[...]
> These data are even noisier than the bust data, but a similar trend, I
> think, is apparent: getting a steal in the draft was much likelier in the
> past than it has become  even if '99 harkened back to pre1985 levels. In
> the 70s, 27% of all picks became steals. That number dropped to 19% in the
> 80s, and dropped further to 17% in the 90s.
Sorry that I have to rely on you all my math for me  I was learning about
postmodern literary theory when people with futures were taking stat courses 
but how does bustiness in a draft correlate to boominess? Eyeballing the
graphs, it looks like some, which would make sense (if the talent pool remains
relatively constant, a bad player getting drafted earlier means that a good
player will be available to be drafted later), but that's just eyeballing, and
it's eyeballing with a preconceived notion, to boot.
I can clarify, if that didn't make sense.

Jeremey
> I can clarify, if that didn't make sense.
Rely on you to do all my math for me. Goddammit.

Jeremey
The correlation is weak, and statistically insignificant (the latter due to
small sample sizes, likely). Below I plot the Bust% (on the left axis)
against the Steal% (on the bottom axis). A perfect correlation would be
displayed as a straight diagonal line from the bottom left to top right.
Zero correlation would look something like a ball of marks centered in the
plot. You can see that the actual data shows little relationship between
Bust% and Steal%.
 x
%  x x x x

 2
0.20+ x x
 x xxx

 x
 xx
0.10+ x x
 xx x
 x
 x

0.00+
+++++%
0.00 0.10 0.20 0.30
However, I think there is a relationship, but the data is much too noisy to
register it. I attempted to correct for this by lumping steals and busts
together, labeling them all "deviations from expectations," and showing the
results in the Standard Deviation graph upthread. What I was trying to
capture is the "chanciness" of the draft, and how it has slightly declined
over time.
Have you examined this on a position by position basis at all? Now that I
think of it, that has a double (and doubly interesting) meaning. By
position in draft (does #4 more consistantly perform at or above standard
than #3 for instance) as well as do SG's regulary exceed expecation where PG
underperform? I assume this could be extrapolated anyhow, though perhaps it
is not of interest to anyone else. I was just thinking how "the best
available player" drafting theory might be affirmed or cut down through
this, if say SG's were shown to consistantly out perform all other
positions, and on the draft board you were picking between an SG and a PF
with all other things being equal (not that they ever are).
Well, if nothing else, I'm glad you're smart enough to pull this all
together and make a cohesive presentation of it all.
Big Chris
Thanks Chris. I'll be looking at your question about draft position in the
next couple of days. My guess is that there is a draft position effect, ie
that some draft positions deviate more from expectations than others.
The other question WRT floor position is a little more difficult, but worth
studying. I attempted to include it in a halfass way in my original
regression equation: I included a "height" variable, which turned out to be
statistically significant in predicting production. Height is, of course,
strongly correlated to position.
<talking to myself>
See if you can remove the height variable, and rerun the regression. Check
the error against the original error. If not substantially different, group
positions and check for trends. A bust/steal plot would probably be a good
place to start.
I'm looking forward to your findings. Certainly more interesting than
preseason games.
Big Chris
Damn right!
Cheers,
Chris Hafner
<snip>
Both this and your followups are absolutely brilliant, Ed. This is the kind
of thing only you can add to a newsgroup.
Great work. Whatever you do for a living, I'm convinced you need to stop
that immediately and find a way to make your aptitude for statistics work
for you (if it doesn't already).
Cheers,
Chris Hafner
Keep in mind that the draft classes Ed has studied ends at 1999  the major
influx of high school players that was at issue in our conversation happened
after that, which means that his findings don't necessarily completely
address our conversation, especially since my point was that more busts
*higher* in the first round push better players lower, not that there are
more busts in the whole first round (which is what Ed's data show).
> [snip]
>
> > These data are even noisier than the bust data, but a similar trend, I
> > think, is apparent: getting a steal in the draft was much likelier in
the
> > past than it has become  even if '99 harkened back to pre1985 levels.
> In
> > the 70s, 27% of all picks became steals. That number dropped to 19% in
the
> > 80s, and dropped further to 17% in the 90s.
>
> More steals  check.
He's saying that there are fewer steals, right?
"... getting a steal in the draft was much likelier in the past than it has
become ..."
And since we both agreed there were more steals now (though we disagreed on
the reasons), I guess we're both wrong here.
:(
We both have an out here again, though, because again the years most
fiercely under debate are the ones with the highestpercentage of
highschool players, which are the ones not included in the study (for legit
reasons).
If we assume that either one of us is right, perhaps the effect was weak
enough up to 1999 (because the draft hadn't changed as dramatically yet)
that increased scouting sophistication takes some of the uncertainty out?
> See Chris, Igor agrees with me! As does his calculator!
Ed's calculator is nonsentient. I'm hoping so, anyway.
Cheers,
Chris Hafner
Hey! That's more like what I would've expected to see.
What's the reasoning behind the smoothing process?
Cheers,
Chris Hafner
_________________________________________
Usenet Zone Free Binaries Usenet Server
More than 120,000 groups
Unlimited download
http://www.usenetzone.com to open account
>"igor eduardo küpfer" <edku...@example.com> wrote in message
>news:5trqm0poffqfmdf9l...@4ax.com...
>> On Wed, 13 Oct 2004 10:50:27 0700, "Chris Hafner" <haf...@peoplepc.com>
>> wrote in <416d...@news.usenetzone.com>:
>>
>> >It's amazing how orderly the downward progression of win shares
>> >is as you descend draft order, especially after the big gaps in the first
>> >five picks.
>>
>> Not so amazing: I smoothed the data out to produce the orderly
>progression.
>> The reality, like all facets of life, is messier:
>>
>> Smoothed Actual
>> #1 20.1 20.3
...
>> #19 3.2 2.0
>> #20 2.9 3.1
>> NonTop20 2.6 0.9
>> Undrafted 2.3 1.2
>
>Hey! That's more like what I would've expected to see.
>
>What's the reasoning behind the smoothing process?
The smoothing was done using Excel's Trendline chart function. Essentially,
it's a linearlog regression, used when the drop off starts quickly, and
then fades to almost zero, like we see above.
If you're asking about mathematical justification, well, I have none. You
aren't supposed to use this sort of regression on ordinal data. There are
linearlog regression models for noninterval/ratio data, but I don't know
exactly how to use them. I was hoping that my linearlog model was robust
enough to handle the nonstandard data, and that since we weren't doing
open heart surgery, any mistakes wouldn't make all that much difference.
>  1st5 2nd5  season season season season season
> DraftPickseasons seasons  1 2 3 4 5
> 
> #1  20.1 16.5  3.4 3.9 4.5 4.2 4.0
> #2  16.1 13.2  2.7 3.1 3.7 3.4 3.2
> #3  13.8 11.3  2.3 2.6 3.1 3.0 2.8
> #4  12.1 9.9  1.9 2.3 2.8 2.6 2.5
> #5  10.8 8.9  1.7 2.0 2.5 2.4 2.2
>  
> #6  9.8 8.0  1.5 1.8 2.3 2.2 2.1
> #7  8.9 7.3  1.4 1.6 2.1 2.0 1.9
> #8  8.1 6.6  1.2 1.4 1.9 1.8 1.7
> #9  7.5 6.1  1.1 1.3 1.7 1.7 1.6
> #10  6.9 5.6  1.0 1.2 1.6 1.6 1.5
>  
> #11  6.3 5.1  0.9 1.1 1.5 1.5 1.4
> #12  5.8 4.7  0.8 1.0 1.4 1.4 1.3
> #13  5.3 4.3  0.7 0.9 1.3 1.3 1.2
> #14  4.9 3.9  0.6 0.8 1.2 1.2 1.1
> #15  4.5 3.6  0.6 0.7 1.1 1.1 1.1
>  
> #16  4.2 3.3  0.5 0.6 1.0 1.0 1.0
> #17  3.8 3.0  0.4 0.5 0.9 1.0 0.9
> #18  3.5 2.7  0.4 0.5 0.9 0.9 0.9
> #19  3.2 2.5  0.3 0.4 0.8 0.8 0.8
> #20  2.9 2.2  0.2 0.3 0.7 0.8 0.8
>  
> NonTop20  2.6 2.0  0.2 0.3 0.7 0.7 0.7
> pick  
>
> Undrafted 2.3 1.8  0.1 0.2 0.6 0.7 0.7
Any idea on why the second 5 seasons are worse than the first 5, and production
seems to peak in year 3? Is that busts leaving the league? Just not what I was
expecting.

Jeremey
>> Undrafted 2.3 1.8  0.1 0.2 0.6 0.7 0.7
>
>Any idea on why the second 5 seasons are worse than the first 5, and production
>seems to peak in year 3? Is that busts leaving the league? Just not what I was
>expecting.
Elementary. Most players flame out after reaching their 5th season. The
teams that chose those players are getting zero value.
Average number of minutes played by years in the league, with a little
graph to show the steep drop off after year 5:
Years MIN
1 637 XXXXXXXXXXXXXXXXXXXXXXXXX
2 786 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
3 761 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
4 749 XXXXXXXXXXXXXXXXXXXXXXXXXXXXX
5 707 XXXXXXXXXXXXXXXXXXXXXXXXXXXX
6 631 XXXXXXXXXXXXXXXXXXXXXXXXX
7 576 XXXXXXXXXXXXXXXXXXXXXXX
8 496 XXXXXXXXXXXXXXXXXXX
9 404 XXXXXXXXXXXXXXXX
10 319 XXXXXXXXXXXX
11 231 XXXXXXXXX
12 178 XXXXXXX
13 128 XXXXX
14 84 XXX
15 48 X
Ah  gotcha.
> If you're asking about mathematical justification, well, I have none. You
> aren't supposed to use this sort of regression on ordinal data. There are
> linearlog regression models for noninterval/ratio data, but I don't know
> exactly how to use them. I was hoping that my linearlog model was robust
> enough to handle the nonstandard data, and that since we weren't doing
> open heart surgery, any mistakes wouldn't make all that much difference.
Since I followed roughly seven percent of what you said above, I can offer
no objection. I'm not even sure that I'd want to  the cleaner data is
easier to absorb and tells us what we really want to know.
And, as you say, this isn't exactly lifeordeath stuff.
Cheers,
Chris Hafner
> Ed's calculator is nonsentient. I'm hoping so, anyway.
Hell, Ed was nonsentient for awhile there.
(Ed, your sister will laugh at that one).
> By the way, sorry about the Seahawks game. First Hawks game I've seen in
> about 3 years, and they blew it in overtime. Look forward to watching the
> Pats game.
Same here, it should be a good one.
Oh whoops, I forgot to put on my Seahawk's fan hat. The game will
probably suck, the Seahawks blew their chances at making the playoffs,
Holmgren should be fired, Koren Robinson is the antichrist, and it's the
end of the world.
> ....
>>> Okay, that done. The regression equation is
>>>
>>> 1st5 =  421363 + 644*Year  0.33*Year^2 +0.000056*Year^3
>>> + 0.070*AGE  0.00455*AGE^2 +0.000040*AGE^3
>>>  5.61*Ht + 0.0732*Ht^2 0.000318*Ht^3
>>>  0.290*OvAlPk + 0.00719*OvAlPk^2 0.000060*OvAlPk^3
>>> + 8.6*Teams  0.299*Teams^2 + 0.0034*Teams^3
>>> + 0.0708*AllPs 0.000485*AllPs^2 +0.000001*AllPs^3
>>>  139*HHI5 + 570*HHI5^2  732*HHI5^3
>>>  23.6*HHI5_1 + 64.3*HHI5_1^2  52*HHI5_1^3
>>> + 22.1*W_L5  66.6*W_L5^2 + 70.5*W_L5^3
>>>  15.6*W_L5_1 + 43.2*W_L5_1^2  41.6*W_L5_1^3
>>
>>Just curious: how did you calculate this? I presume some software package,
>>it seems to me that with all the variables, even with software, that would
>>take forever to compute. That's really really cool though, I can see how
>>that would come in useful in the general "how much does X correlate with Y"
>>type questions.
>>
>
> The method is known as Ordinary Least Squares. It is among the oldest
> statistical techniques  I'm sure the algorithm for minimizing errors (the
> "least squares" part) goes back a century. The method for using multiple
> variables like I've done above probably goes back almost as long, but it
> had to wait until computers to make it really accessible.
That figures.
> It doesn't take
> any time at all for any stats package to work out the optimal values for
> each variable.
Well, maybe I'm being pedantic, or ignorant, but are you sure about that?
Are you sure it's not just giving you a *good* value for each variable,
and not the optimal? There's a lot of algorithms that can generate a
"good" result but not guarantee that that result is the best. It just
seems like with 20 dependent variables it's asking a lot to get the
optimal solution (although it's not asking a lot to get a solution that's
"close enough."). Anyway, it's interesting either way.
>><snip>
>>
>>> The column labeled "P" shows the statistical significance. We are
>>> looking for variables which have low pvalues, below 0.05. Once I
>>> remove the variables that aren't significant, we end up with this:
>>>
>>> 1st5 = 2.69
>>> + 0.00668*Ht
>>>  0.189*OvAlPk + 0.00228*OvAlPk^2
>>>  15.5*HHI5 + 32.7*HHI5^2
>>> + 8.12*W_L5
>>>  2.45*W_L5_1
>>
>>Interesting how age isn't in there.
>>
>>
> Got weeded out as statistically insignificant. I *think* it's because so
> many players in the sample are the same 2 ages  almost 75% of the
> players are drafted at 21 and 22  that my sample isn't large enough to
> pick up the subtle effect of age.
Or that the percentage of old flameouts is the same for young flameouts.
I would have intuited that old players would contribute sooner, but maybe
that's not the case, since many of them never pan out.
> ....
>>
>>That's really pretty damn good. But, you need to apply the model for a
>>longer period of time to see how good it really is.
>
> Well, I did it for every draft from 1977. I only showed a sample above
> to give a feel for the data.
Oh. For some reason I thought it was < than a 10 year period you ran.
>> It would be
>>interesting to see how relevant the increase in high school players
>>drafted really is (my guess: not that much)  just look at the average
>>NWAR over the first X years prior to, say, the draft with KG and the
>>draft with KG and after. It's also be interesting to see, in general,
>>if high school players end up being better than college players on
>>average (for whatever definition of "average").
>>
>>
> I can't imagine that HS players affect this very much. The sample simply
> isn't large enough to draw any firm conclusions  remember, I'm looking
> at data from players' first five seasons following the draft, which
> means I can't use player data post1999, and that's when most of the HS
> players have come into the league.
I remember you did a study correlating a players first 30 games with their
career  and that the correlation was actually pretty strong (or stronger
than one would think). Maybe you could cheat and just use first year
contributions.
>>> In my next post I will apply this model to drafts over the years to
>>> see if there are consistent deviations from the model.
>>
>>Well, there you go.
>>
>>Also, unrelated, but I remember you mentioning something (how's that for
>>vague?) which analyzed game logs to pull out interesting stats  what
>>was that? I was thinking of making something like that in my "free
>>time." I was also thinking of yanking the box score stats from every
>>game and making them free available in an RSS feed so other people could
>>parse them easily (or, more accurately, so I could mess with them later
>>in the year).
>>
>>
> I have all the raw data, box scores and game logs. I'd love to provide
> the former to anyone who wants to maintain a public DB or something. I
> can do limited work on the playbyplay logs  limited by my lame
> programming abilities. What is needed is someone to program a parser for
> these logs. I'm asking around to see if I can get someone interested in
> that project.
I could write a parser for them, I don't know if I'll have enough time.
Send me an email with a few game logs attached and what info you would
like extracted. I'm not making any promises.
>>Anyway, that's really interesting stuff. You should have one of those
>>"blog thingys" man.
>
> Yeah, if someone wants to donate the software and technical knowledge,
> I'll get right on that.
Blogs are like drivers licenses. They must obviously be easy to use
considering how many shitty blogs (or drivers) there are in existence.
The key part would be to find one that a) was free and b) didn't suck ass.
I have no experience on this unfortunately. If you run into technical
snafu's I could help, but as far as choosing one system or another  I
have no idea. If I were to have a blog I'd probably write the software
myself (spend 10 minutes learning blog software or a week writing it? The
latter obviously!).
Hmm.
<snip>
I love ascii graphs. *love*
> These data are even noisier than the bust data
I love taking quotes out of context. *love*
His stats and his writing both. He's a talented dude, that Ed.
Fire Wally Holmgren!!!!1!!11!
>On Sun, 10 Oct 2004 21:34:47 0400, igor eduardo küpfer wrote:
>
>> By the way, sorry about the Seahawks game. First Hawks game I've seen in
>> about 3 years, and they blew it in overtime. Look forward to watching the
>> Pats game.
>
>Same here, it should be a good one.
>
>Oh whoops, I forgot to put on my Seahawk's fan hat. The game will
>probably suck, the Seahawks blew their chances at making the playoffs,
>Holmgren should be fired, Koren Robinson is the antichrist, and it's the
>end of the world.
>
The game did suck (and by suck, I mean it was pretty good but the Hawks
lost).
...
>
>> It doesn't take
>> any time at all for any stats package to work out the optimal values for
>> each variable.
>
>Well, maybe I'm being pedantic, or ignorant, but are you sure about that?
>
>Are you sure it's not just giving you a *good* value for each variable,
>and not the optimal? There's a lot of algorithms that can generate a
>"good" result but not guarantee that that result is the best. It just
>seems like with 20 dependent variables it's asking a lot to get the
>optimal solution (although it's not asking a lot to get a solution that's
>"close enough."). Anyway, it's interesting either way.
Of course I'm not sure. My stats pack is giving me 4 significant digits 
I'm going to hazard the guess that it optimizes to the point when these
digits change no longer. The computing time for calculating the least
squares for ~30 variables was about 4 or 5 seconds.
...
>> I can't imagine that HS players affect this very much. The sample simply
>> isn't large enough to draw any firm conclusions  remember, I'm looking
>> at data from players' first five seasons following the draft, which
>> means I can't use player data post1999, and that's when most of the HS
>> players have come into the league.
>
>I remember you did a study correlating a players first 30 games with their
>career  and that the correlation was actually pretty strong (or stronger
>than one would think). Maybe you could cheat and just use first year
>contributions.
That was for counting stats  eg points per game, assists per game, etc.
The stats I used for this study were more abilitybased. These are highly
variable from year to year. Punching them in would screw the results, I
think.
>>>
>> I have all the raw data, box scores and game logs. I'd love to provide
>> the former to anyone who wants to maintain a public DB or something. I
>> can do limited work on the playbyplay logs  limited by my lame
>> programming abilities. What is needed is someone to program a parser for
>> these logs. I'm asking around to see if I can get someone interested in
>> that project.
>
>I could write a parser for them, I don't know if I'll have enough time.
>Send me an email with a few game logs attached and what info you would
>like extracted. I'm not making any promises.
I forbid you to spend any time whatsoever on this. Do not devote time which
would be better spent on improving your life even considering the idea.
This is not reverse psychology.
I'll email you this week.
...
> The game did suck (and by suck, I mean it was pretty good but the Hawks
> lost).
Yeah, another game where they outplayed the other team for 3 quarters and
still lost.
>>> It doesn't take
>>> any time at all for any stats package to work out the optimal values for
>>> each variable.
>>
>>Well, maybe I'm being pedantic, or ignorant, but are you sure about that?
>>
>>Are you sure it's not just giving you a *good* value for each variable,
>>and not the optimal? There's a lot of algorithms that can generate a
>>"good" result but not guarantee that that result is the best. It just
>>seems like with 20 dependent variables it's asking a lot to get the
>>optimal solution (although it's not asking a lot to get a solution that's
>>"close enough."). Anyway, it's interesting either way.
>
> Of course I'm not sure.
You seem pretty goddamn certain of your uncertainty there pal.
> My stats pack is giving me 4 significant digits 
> I'm going to hazard the guess that it optimizes to the point when these
> digits change no longer. The computing time for calculating the least
> squares for ~30 variables was about 4 or 5 seconds.
I guess it's not too complicated, it's a lot like solving systems of
equations, which can be done pretty fast using matrices. But still, 30
variables in 45 seconds is pretty amazing to me.
> ...
>>> I can't imagine that HS players affect this very much. The sample simply
>>> isn't large enough to draw any firm conclusions  remember, I'm looking
>>> at data from players' first five seasons following the draft, which
>>> means I can't use player data post1999, and that's when most of the HS
>>> players have come into the league.
>>
>>I remember you did a study correlating a players first 30 games with their
>>career  and that the correlation was actually pretty strong (or stronger
>>than one would think). Maybe you could cheat and just use first year
>>contributions.
>
> That was for counting stats  eg points per game, assists per game, etc.
> The stats I used for this study were more abilitybased. These are highly
> variable from year to year. Punching them in would screw the results, I
> think.
Well, screw it then (not the results, the idea).
>>>>
>>> I have all the raw data, box scores and game logs. I'd love to provide
>>> the former to anyone who wants to maintain a public DB or something. I
>>> can do limited work on the playbyplay logs  limited by my lame
>>> programming abilities. What is needed is someone to program a parser for
>>> these logs. I'm asking around to see if I can get someone interested in
>>> that project.
>>
>>I could write a parser for them, I don't know if I'll have enough time.
>>Send me an email with a few game logs attached and what info you would
>>like extracted. I'm not making any promises.
>
> I forbid you to spend any time whatsoever on this. Do not devote time which
> would be better spent on improving your life even considering the idea.
> This is not reverse psychology.
>
> I'll email you this week.
Liar. Unless you email me in the second half of the week.