My analysis of top batsmen & top 2 bowlers in the current Ashes series using my R package 'cricketr'

Tinniam V Ganesh

unread,

Jul 23, 2015, 8:28:04 AM7/23/15

to

Hi,
Take a look at my post where I analyze the top batsmen & top 2 bowlers from England and Australia using my R package 'cricketr'

https://gigadom.wordpress.com/2015/07/20/cricketr-digs-the-ashes/

Regards
Ganesh

hamis...@gmail.com

unread,

Jul 23, 2015, 8:54:34 AM7/23/15

to

Um, I'm not really sure that Siddle qualifies as the #2 bowler for Australia currently...

Tinniam V Ganesh

unread,

Jul 23, 2015, 9:22:38 AM7/23/15

to

Hamis,
Siddle's rating is higher than Starc in ESPN Cricinfo and he has also played more tests.

Ganesh

Mike Holmans

unread,

Jul 23, 2015, 12:49:08 PM7/23/15

to

On Thu, 23 Jul 2015 05:28:02 -0700 (PDT), Tinniam V Ganesh
<tvgan...@gmail.com> tapped the keyboard and brought forth:

>Hi,
> Take a look at my post where I analyze the top batsmen & top 2 bowlers from England and Australia using my R package 'cricketr'
>
>https://gigadom.wordpress.com/2015/07/20/cricketr-digs-the-ashes/

Blimey. You do go through an awful lot of calculations and graphs to
come up with stunningly dull conclusions. Having reached the end and
found what conclusions you'd drawn from this wealth of data, I find
statements which I could have trotted out off the top of my head.

Other than allowing one to make pretty obvious statements backed by a
spurious degree of precision (eg taking probabilities to four
significant figures for no obviously justified reason), what does this
tool allow one to do?

Cheers,

Mike

--

Brian Lawrence

unread,

Jul 23, 2015, 1:10:06 PM7/23/15

to

But none in 2015.

Dikran Marsupial

unread,

Jul 23, 2015, 1:12:15 PM7/23/15

to

There's nothing like constructive criticism

<groucho>and that's nothing like constructive criticism</groucho>

Tinniam V Ganesh

unread,

Jul 23, 2015, 1:26:53 PM7/23/15

to

On Thursday, July 23, 2015 at 10:19:08 PM UTC+5:30, Mike Holmans wrote:

>
> >Hi,
> > Take a look at my post where I analyze the top batsmen & top 2 bowlers from England and Australia using my R package 'cricketr'
> >
> >https://gigadom.wordpress.com/2015/07/20/cricketr-digs-the-ashes/
>
> Blimey. You do go through an awful lot of calculations and graphs to
> come up with stunningly dull conclusions. Having reached the end and
> found what conclusions you'd drawn from this wealth of data, I find
> statements which I could have trotted out off the top of my head.
>
> Other than allowing one to make pretty obvious statements backed by a
> spurious degree of precision (eg taking probabilities to four
> significant figures for no obviously justified reason), what does this
> tool allow one to do?
>
> Cheers,
>
> Mike
>
> --

Mike,
Just looking at some data and claiming that Warner has a better strike rate is very gross. If you look carefully at the strike rate graph you can see that Smith has a better strike rate for runs below hundred and above hundred Root does better.

The Runs frequency graph does show that Smith has a higher frequency percentage of runs in several 10 runs buckets.

Hopefully the moving average does show how they have progressed through their career. I don't see how you can trot this from the figures or from the top of your head.

Anyway too bad you did not find it useful

Regards
Ganesh

Dikran Marsupial

unread,

Jul 23, 2015, 1:37:42 PM7/23/15

to

On Thursday, July 23, 2015 at 6:26:53 PM UTC+1, Tinniam V Ganesh wrote:
> On Thursday, July 23, 2015 at 10:19:08 PM UTC+5:30, Mike Holmans wrote:
>
> >
> > >Hi,
> > > Take a look at my post where I analyze the top batsmen & top 2 bowlers from England and Australia using my R package 'cricketr'
> > >
> > >https://gigadom.wordpress.com/2015/07/20/cricketr-digs-the-ashes/
> >
> > Blimey. You do go through an awful lot of calculations and graphs to
> > come up with stunningly dull conclusions. Having reached the end and
> > found what conclusions you'd drawn from this wealth of data, I find
> > statements which I could have trotted out off the top of my head.
> >
> > Other than allowing one to make pretty obvious statements backed by a
> > spurious degree of precision (eg taking probabilities to four
> > significant figures for no obviously justified reason), what does this
> > tool allow one to do?
> >
> > Cheers,
> >
> > Mike
> >
> > --
>
> Mike,
> Just looking at some data and claiming that Warner has a better strike rate is very gross. If you look carefully at the strike rate graph you can see that Smith has a better strike rate for runs below hundred and above hundred Root does better.

The problem is that the more you split hypotheses in this way, the more likely you are to generate a spurious result. This is often called "data dredging" or p-hacking in data science and statistics. I doubt the difference is practically significant, even if it is statistically significant.

> The Runs frequency graph does show that Smith has a higher frequency percentage of runs in several 10 runs buckets.
>
> Hopefully the moving average does show how they have progressed through their career. I don't see how you can trot this from the figures or from the top of your head.
>
> Anyway too bad you did not find it useful
>
> Regards
> Ganesh

If you are looking for ideas for statistical analysis of cricket, it would be very useful to have a program that could test whether some statement made on a cricket newsgroup is statistically unusual or not, by Monte Carlo permutation testing. For instance I suspect that the claim that Ian Bell had never scored a century unless someone higher in the order had done so already (which is no longer true) was one of those statements that is true, but not actually as surprising as you might think. This could be tested by Monte-Carlo simuation, constructing a large sample of synthetic batting histories (with the same number of innings as Bell in each position) and see for what proportion of the synthetic careers the statement would be true. That would be a really useful program if you could specify the hypothesis to be tested in a generic manner.

I mention this as you are interested in cricket and R (MATLAB >> R ;o).

John Hall

unread,

Jul 23, 2015, 1:39:27 PM7/23/15

to

In message <eed17291-5dc5-444c...@googlegroups.com>,
Tinniam V Ganesh <tvgan...@gmail.com> writes

That the bowlers include Siddle is more than a little surprising.
Certainly the Australian selectors don't seem to be under the impression
that he's one of their best two bowlers.
--
I'm not paid to implement the recognition of irony.
(Taken, with the author's permission, from a LiveJournal post)

Tinniam V Ganesh

unread,

Jul 23, 2015, 2:12:07 PM7/23/15

to

Dikran,
I need to check up on data-dredging and p-hacking. Not sure why it is not significant (practically, when it is statistically). While there is no certainty in the claims, the plots do show up the data with little more clarity, in my opinion, and makes more sense than tables of data.

Will check on it though.

Ganesh

Dikran Marsupial

unread,

Jul 23, 2015, 2:59:31 PM7/23/15

to

This is a classic problem with frequentist hypothesis testing. The larger your sample, the smaller the effect size can be detected with "statistical significance". If you play enough games, there may be a statistically significant difference in strike rate of 0.0001, but would that really be a rational reason to pick one player over another? No, not really, the difference is too small to be really meaningful.

Do read up on p-hacking and data dredging, a lot of that goes on unwittingly in analysis of cricket statistics.

I'd recommend looking at http://stats.stackexchange.com/ which is a good place for finding out about the subtleties of statistics (as you can ask questions).

Mike Holmans

unread,

Jul 23, 2015, 3:56:48 PM7/23/15

to

On Thu, 23 Jul 2015 10:26:51 -0700 (PDT), Tinniam V Ganesh

<tvgan...@gmail.com> tapped the keyboard and brought forth:

>On Thursday, July 23, 2015 at 10:19:08 PM UTC+5:30, Mike Holmans wrote:
>
>>
>> >Hi,
>> > Take a look at my post where I analyze the top batsmen & top 2 bowlers from England and Australia using my R package 'cricketr'
>> >
>> >https://gigadom.wordpress.com/2015/07/20/cricketr-digs-the-ashes/
>>
>> Blimey. You do go through an awful lot of calculations and graphs to
>> come up with stunningly dull conclusions. Having reached the end and
>> found what conclusions you'd drawn from this wealth of data, I find
>> statements which I could have trotted out off the top of my head.
>>
>> Other than allowing one to make pretty obvious statements backed by a
>> spurious degree of precision (eg taking probabilities to four
>> significant figures for no obviously justified reason), what does this
>> tool allow one to do?
>>
>> Cheers,
>>
>> Mike
>>
>> --
>
>Mike,
> Just looking at some data and claiming that Warner has a better strike rate is very gross. If you look carefully at the strike rate graph you can see that Smith has a better strike rate for runs below hundred and above hundred Root does better.

And what does that mean?

>The Runs frequency graph does show that Smith has a higher frequency percentage of runs in several 10 runs buckets.

And what does that mean?

That's the question I'm driving at here. You can produce all the
calculations and graphs you like, but unless it's to some purpose,
then it's just burning CPU cycles.

I think that the usual statistics compiled such as career averages are
useless - your moving average is much more useful than those - but
I've seen very little which shows any promise at being generalisable
or gives sensible answers to interesting questions.

Why should a particular number or graph that you generate interest me,
or a selector, or a cricket historian or anyone else? Why does the
analysis that you are performing demonstrate what you say it
demonstrates? For instance, the figures you give where a batsman has a
60% chance of scoring 20 runs of 35 balls might be useful to someone
who wanted to bet on what score a batsman is going to get in his next
innings because if the bookie's odds offer that as a 30% chance, it
would make a sensible bet.

How would you construct figures which showed how good a batsman is at
playing spin? I'd point out that I'm not at all sure that it would be
good enough simply to take figures generated from how many balls of
spin a batsman faced in each innings: Pietersen, for instance, was
notoriously befuddled by slow left arm finger spin, which means you'd
have to break spin bowling down into types. But then, surely it also
matters whether the spin bowlers are high or low quality - there's a
huge difference between facing Shane Warne's legspin and Scott
Borthwick's. And it surely also matters whether the pitch is taking a
lot of spin or is completely useless for spinners - which is not
recorded in a scorecard.

By the time you've separated out the different types of bowlers and
the different types of conditions, you find that unless someone's had
a 10+ year career, you're trying to build a case based on six data
points spread over two series five years apart, which are almost
certainly going to have been at very different points in the player's
performance curve, which means that any conclusion you draw is going
to have a 90% confidence interval of +/- 70%, which is just about
useless. (OK, Drs Cawley and Walker, so it's not 70% with six data
points, but it's certainly wider with 6 than with 60 or 600.)

Data mining projects which do not have some well-defined goals almost
always end up being a waste of time and resources. Making data mining
work is a matter of identifying the kind of questions which you think
you might be able to derive answers for from a pool of data and then
working out how you'd go about deciding which dimensions of your fact
table to use for slicing and dicing.

Where you could make a very serious contribution would be to calculate
the confidence intervals your eventual figures have. It's all very
well seeing

Tendulkar 143.55
Rahane 136.81
Dravid 125.88

but it would mean a lot more if it were

Tendulkar 143.55 139.80-146.12
Rahane 136.81 102.93-170.34
Dravid 125.88 123.98-127.03

I have no idea what those numbers represent, by the way - I've just
made them up. But because we can see the confidence intervals, we can
be pretty sure that Tendulkar's number really is about 12% higher than
Dravid's, and while Rahane splits them at the moment, almost anything
could happen until we've got enough data points to reduce the
confidence interval.

Few statisticians bother to tell us how confident we can be that the
statistics are worth betting the farm on, and so you get people busily
arguing over whether x is better than y because of a difference of 0.1
in their averages, which is probably not a valid conclusion if the
confidence intervals are respectively +/- 0.8 and +/- 0.5. If stats
routinely got published with estimates of how fuzzy they are, people
might use them a lot more sensibly.

Cheers,

Mike

--

Tinniam V Ganesh

unread,

Jul 23, 2015, 9:01:13 PM7/23/15

to

Dikran,
I will definitely read that the posts in stats exchange. But I think the p-hacking you are referring to is the p-value which is considered to be debatable There was an article in Scientific American
http://www.scientificamerican.com/article/scientists-perturbed-by-loss-of-stat-tools-to-sift-research-fudge-from-fact/

Not all of my charts or analysis uses p-values and uses regular mean which gives a lot of info. Mean runs against opposition/ venues, mean wickets against opp/venues or mean economy rate. These mean (pun unintended) a lot.

As I had said before I will look up the articles

Ganesh

Tinniam V Ganesh

unread,

Jul 23, 2015, 9:09:28 PM7/23/15

to

Mike,
Some of the charts for e.g. mean runs/wickets against opposition/at venue provide a lot of information about how well the batsman/bowler performs at home or away. Since this average over all his innings it is a good indicator of whether the player performs only at home/ or overseas. I have function that computes and plots these which I did not include in this analysis.

Unfortunately some of the suggestions that you make regarding spin is qualitative and is really not captures as data. The Cricinfo does not include data on how Cook played Warne versus Kumble or Muralitharan. If there was such data then an analysis is possible. Similarly is a batsman suspect against genuine pace at different pitches cannot be infered based on available data at Cricinfo. For this we would need details on the speed and bounce at different pitches.

But these charts are based on what is available.

You point on providing confidence interval is well taken and I will incorporate this into the functions

Ganesh

hamis...@gmail.com

unread,

Jul 23, 2015, 10:08:24 PM7/23/15

to

But he's unlikely to play a test in the Ashes series unless somebody gets injured.
He hasn't played a test this year, last year he played 6 tests and took 12 wickets @53.83
He's not going to displace any of Johnson, Starc or Hazlewood.

Dikran Marsupial

unread,

Jul 24, 2015, 2:58:14 AM7/24/15

to

The idea is basically the same, if you look for correlations or statistical anomalies, you will almost always be able to find them. The harder you look, the more likely they will be spurious. Having many statistics to look at and many ways to make the hypothesis more complicated (c.f. Occam's razor), the more chance of a spurious result.

The p-hacking thing isn't debatable. As soon as you design the experiment to reduce the p-value you have immediately invalidated it (unless you take steps to compensate, e.g. the Bonferoni adjustment, but almost nobody does that). The fact that people do this on a regular basis does not mean that it is statistically acceptable practice.

David North

unread,

Jul 24, 2015, 2:38:31 PM7/24/15

to

"Tinniam V Ganesh" <tvgan...@gmail.com> wrote in message
news:94a8e830-0472-45cc...@googlegroups.com...

"The Cricinfo does not include data on how Cook played Warne versus Kumble
or Muralitharan."

Their Statsguru doesn't have that data, but Cricinfo does have player v
player data for each match since about 2001 - follow the link from each
scorecard. You would have to gather the data yourself match by match,
though, which would be rather time-consuming. They also have ball-by-ball
commentary going back several years beyond that, so you could compile player
v player data from that if you were really keen.
--
David North

Mike Holmans

unread,

Jul 24, 2015, 2:54:02 PM7/24/15

to

On Thu, 23 Jul 2015 18:09:27 -0700 (PDT), Tinniam V Ganesh

<tvgan...@gmail.com> tapped the keyboard and brought forth:

>Mike,
> Some of the charts for e.g. mean runs/wickets against opposition/at venue provide a lot of information about how well the batsman/bowler performs at home or away. Since this average over all his innings it is a good indicator of whether the player performs only at home/ or overseas. I have function that computes and plots these which I did not include in this analysis.
>
>Unfortunately some of the suggestions that you make regarding spin is qualitative and is really not captures as data. The Cricinfo does not include data on how Cook played Warne versus Kumble or Muralitharan. If there was such data then an analysis is possible. Similarly is a batsman suspect against genuine pace at different pitches cannot be infered based on available data at Cricinfo. For this we would need details on the speed and bounce at different pitches.
>
>But these charts are based on what is available.

I'm only too aware of the fact that the information which is recorded
is incomplete. It is blindingly obvious to anyone who follows cricket
that there are a lot of things which have a lot of effect on a game
but aren't recorded in the abridged scorecards which are routinely
available.

Big data techniques could filter commentary (eg Cricinfo's text stuff)
if anyone had a mind to do it, so it's theoretically possible to
enhance the information we have about past matches. Weather data is
probably available in some meteorological archives. It would be
laborious because the commentary is of indeterminate accuracy - one
commentator might adjudge something a bad missed catch while another
says it was a good effort (so one would count as an extra life for the
batsman and the other wouldn't, for instance) - so to be authoritative
you might need to compile info from more than one source. But it's
conceptually possible to capture an awful lot more these days than was
done in the golden 1950s. If we knew what all the right indicators
were, it's not impossible to imagine being able to create a
statistical model of incredible power.

The rating system originally designed by Deloittes which is the ICC's
official player ranking system attempts to deal with the quality of
bowling and overall conditions ideas by making inferences from their
other numbers. Since they can calculate the current ratings of all the
bowlers at the beginning of a match, and they know how many overs each
bowler bowled, they can come up with a factor for how good the bowling
ought to be; they draw inferences about the overall condtions from the
overall level and rate of scoring during a match.

They deal with fluctuations in form by using a weighted average - the
latest match counts as 1 match, and they discount each previous match
by 3%, so the numbers for match -1 are multiplied by 0.97, the ones
for match -2 by 0.97*0.97, etc.

They incorporate a slug factor for new players: the raw rating figure
is scaled down by a varying amount until someone has batted a certain
number of times or has reached some bowling milestone - what they are
effectively saying is that until a player's record is substantial, the
figures are pretty dodgy.

All their fudge factors, therefore, are derived from available
numerical data.

But they can only be approximations. You might have the top four
bowlers in the world in an attack, but because they all had the same
stomach bug, they're all 15 mph down on pace on a given day, so the
batsmen's easy runs in that innings will get way over-valued, and the
bowlers' updated records will be slightly affected for the worse. And
that rogue data point will have a tiny ripple effect, so if there is
some universal algorithm for deriving an accurate model of games from
all the relevant recorded facts as conceived above, this calculated
model won't be quite right.

My overarching point is that statistics should be presented hedged
about with some idea of their limitations. Partly that's at the
confidence interval level, as in the previous discussion, but I think
it also important when presenting a new tool to give an idea of what
factors are included and point out things which the author realises
are inadequacies.

I devised a rating I called the bowlers' Power Index, which you
calculate by taking sqrt(average*strikerate). To my amazement when I
bunged a couple of hundred Test bowlers through it, the top two were
SF Barnes and MD Marshall, both of whom have many, many people who say
they are the best of all time, and no other statistical analysis I've
seen does that so precisely. I find that a lot of the other results it
comes up with seem very sane, which has made the entries in the list
which surprise me worth looking into. What I contend the measure does
is evaluate bowlers' effectiveness when viewed as strike bowlers.
These are the guys whose wicket-taking is most likely to win you a
Test match. But that's not what's always uppermost in a captain's
mind. There are times when what he wants is a bowler who concedes 0.3
runs an over to tie one end down, with any wickets being a bonus. For
that purpose, he'd need another indicator (ER, pretty obviously).

Having got interesting results from career figures, further playing
around with it led me to the conclusion that if the result falls
outside the range of just under 20 to just over 70, it's beginning to
break down, and by the time you get to 10 or 100, what you're really
getting is garbage, and the same can be said if you try and analyse
less than about five matches with it. What I conclude from that is
that it's a moderately good approximation in normal circumstances.
Much as Newtonian physics is fine for working out how cricket balls
behave here on Earth but not much use when everything is travelling at
0.999c.

I've seen a lot of people claim that their new analysis is the bees'
knees. I'd like to see a lot more honest admission that since relevant
data is missing, the numerical analysis can only be an approximation,
and highlighting of weaknesses in the model as well as strengths.

Cheers,

Mike

--

Tinniam V Ganesh

unread,

Jul 24, 2015, 10:57:11 PM7/24/15

to

David,
Thanks for the pointer. Looks like I will have merge data across multiple tables for the analysis. Will take a look

Thanks
Ganesh