On Thu, 23 Jul 2015 18:09:27 -0700 (PDT), Tinniam V Ganesh
>Mike,
> Some of the charts for e.g. mean runs/wickets against opposition/at venue provide a lot of information about how well the batsman/bowler performs at home or away. Since this average over all his innings it is a good indicator of whether the player performs only at home/ or overseas. I have function that computes and plots these which I did not include in this analysis.
>
>Unfortunately some of the suggestions that you make regarding spin is qualitative and is really not captures as data. The Cricinfo does not include data on how Cook played Warne versus Kumble or Muralitharan. If there was such data then an analysis is possible. Similarly is a batsman suspect against genuine pace at different pitches cannot be infered based on available data at Cricinfo. For this we would need details on the speed and bounce at different pitches.
>
>But these charts are based on what is available.
I'm only too aware of the fact that the information which is recorded
is incomplete. It is blindingly obvious to anyone who follows cricket
that there are a lot of things which have a lot of effect on a game
but aren't recorded in the abridged scorecards which are routinely
available.
Big data techniques could filter commentary (eg Cricinfo's text stuff)
if anyone had a mind to do it, so it's theoretically possible to
enhance the information we have about past matches. Weather data is
probably available in some meteorological archives. It would be
laborious because the commentary is of indeterminate accuracy - one
commentator might adjudge something a bad missed catch while another
says it was a good effort (so one would count as an extra life for the
batsman and the other wouldn't, for instance) - so to be authoritative
you might need to compile info from more than one source. But it's
conceptually possible to capture an awful lot more these days than was
done in the golden 1950s. If we knew what all the right indicators
were, it's not impossible to imagine being able to create a
statistical model of incredible power.
The rating system originally designed by Deloittes which is the ICC's
official player ranking system attempts to deal with the quality of
bowling and overall conditions ideas by making inferences from their
other numbers. Since they can calculate the current ratings of all the
bowlers at the beginning of a match, and they know how many overs each
bowler bowled, they can come up with a factor for how good the bowling
ought to be; they draw inferences about the overall condtions from the
overall level and rate of scoring during a match.
They deal with fluctuations in form by using a weighted average - the
latest match counts as 1 match, and they discount each previous match
by 3%, so the numbers for match -1 are multiplied by 0.97, the ones
for match -2 by 0.97*0.97, etc.
They incorporate a slug factor for new players: the raw rating figure
is scaled down by a varying amount until someone has batted a certain
number of times or has reached some bowling milestone - what they are
effectively saying is that until a player's record is substantial, the
figures are pretty dodgy.
All their fudge factors, therefore, are derived from available
numerical data.
But they can only be approximations. You might have the top four
bowlers in the world in an attack, but because they all had the same
stomach bug, they're all 15 mph down on pace on a given day, so the
batsmen's easy runs in that innings will get way over-valued, and the
bowlers' updated records will be slightly affected for the worse. And
that rogue data point will have a tiny ripple effect, so if there is
some universal algorithm for deriving an accurate model of games from
all the relevant recorded facts as conceived above, this calculated
model won't be quite right.
My overarching point is that statistics should be presented hedged
about with some idea of their limitations. Partly that's at the
confidence interval level, as in the previous discussion, but I think
it also important when presenting a new tool to give an idea of what
factors are included and point out things which the author realises
are inadequacies.
I devised a rating I called the bowlers' Power Index, which you
calculate by taking sqrt(average*strikerate). To my amazement when I
bunged a couple of hundred Test bowlers through it, the top two were
SF Barnes and MD Marshall, both of whom have many, many people who say
they are the best of all time, and no other statistical analysis I've
seen does that so precisely. I find that a lot of the other results it
comes up with seem very sane, which has made the entries in the list
which surprise me worth looking into. What I contend the measure does
is evaluate bowlers' effectiveness when viewed as strike bowlers.
These are the guys whose wicket-taking is most likely to win you a
Test match. But that's not what's always uppermost in a captain's
mind. There are times when what he wants is a bowler who concedes 0.3
runs an over to tie one end down, with any wickets being a bonus. For
that purpose, he'd need another indicator (ER, pretty obviously).
Having got interesting results from career figures, further playing
around with it led me to the conclusion that if the result falls
outside the range of just under 20 to just over 70, it's beginning to
break down, and by the time you get to 10 or 100, what you're really
getting is garbage, and the same can be said if you try and analyse
less than about five matches with it. What I conclude from that is
that it's a moderately good approximation in normal circumstances.
Much as Newtonian physics is fine for working out how cricket balls
behave here on Earth but not much use when everything is travelling at
0.999c.
I've seen a lot of people claim that their new analysis is the bees'
knees. I'd like to see a lot more honest admission that since relevant
data is missing, the numerical analysis can only be an approximation,
and highlighting of weaknesses in the model as well as strengths.
Cheers,
Mike
--