Statistics 101

paul

unread,

Dec 17, 2009, 7:06:55 PM12/17/09

to

For anyone who wants to use simulations to attempt to learn something
about bridge, may I suggest you familiarize yourself with some basic
concepts of statistics? Here's a link to a fine web tutorial:

http://stattrek.com/Lesson4/Estimation.aspx?Tutorial=stat

The topic is "Estimation". When you run a double dummy simulation, the
usual reason is to estimate the number of tricks a particular hand or
combination of hands will produce. Key points
(1) The average of multiple results is an estimator of the true value.
(2) The more hands you simulate, the better your estimate will be
(more info = better estimate.).
(3) However, there is no magic number of hands that allow you to say,
"this is the true answer!" In other words, the sample mean is a point
estimate, and point estimators are basically worthless, ***no matter
how large the sample***.
(4) Instead, the appropriate method is to create an "interval
estimate", a range of plausible values for the answer you're looking
for.
(5) Interval estimates can be constructed from quite small sample
sizes (30 is a typical rule of thumb minimum), but the the resulting
intervals are apt to be too wide to be of any use.
(6) An interval estimate consists of a midpoint -- the sample mean or
proportion -- and a Margin of Error (MOE) -- a "plus or minus"
statement. A common alternative is to just subtract and add the MOE to
the sample mean and give the upper and lower bounds. This avoids
drawing attention inappropriately to a single value which is almost
certainly NOT "true".
(7) To construct an interval estimate you need four things:
a) x-bar, he sample mean (average result.)
b) s, the sample standard deviation (a measure of the variability of
results.)
c) n, the sample size (number of hands simulated.)
d) t, a critical value from Student's t-distribution, which depends on
the sample size and the desired level of confidence. 95% is typical
and for n=1000, t = 1.962.

Got all that? Then your MOE is t*s/sqrt(n) .

Example: Mean # tricks at notrump = 8.2, n = 1000, s = 2.1, 95%
confidence so t = 1.96.
MOE = 1.962*2.1/sqrt(1000) = .13 . So the 95% confidence interval is
8.07 to 8.33 tricks.
Now, that wasn't too difficult, was it?

For other sample sizes, you can get the proper t value from Excel's
"TINV" function. It needs two parameters: a probability "alpha", which
you get by subtracting your desired confidence level from 1.00. So for
95%, or .95, alpha = .05; for 99%, alpha = .01. Second is "degrees of
freedom", which is just n-1. So, to get t for 95%, sample size 100, I
would enter
=TINV(.05,99) and Excel tells me 1.984.

A slightly different procedure applies when estimating a proportion,
e.g., "How often will 3NT succeed?" Terms:
p-hat: the sample proportion
n: sample size
s: not a distinct value but calculated directly from p-hat; part of
"se"
se: standard error, calculated as sqrt(p(1-p)/n)
z: a critical value from the normal distribution. For 95%, z = 1.96;
99%, z= 2.58.

MOE = z*se

Example: p-hat = .60, n = 100, 95% so z = 1.96; MOE = 1.96*sqrt(.6(.4)/
100) = .096
So, with the contract succeeding 60% in 100 tries, we can estimate
with 95% confidence it will succeed between 50.4% and 69.6% in
unlimited tries.

A quick estimate of a 95% CI for a proportion is MOE = 1/sqrt(n); so
it's no surprise that with n = 100, our result was good to within
about 10%. This quick estimator is most accurate for proportions close
to 50%. Also, you must have a sufficiently large sample size so that
you can expect at least 5 successes and 5 failures; if p were 97%, for
example, we would expect only 3 failures in 100 tries and so the
sample would be too small.

Finally, how do we know all this works for bridge? The answer lies in
the Central Limit Theorem: no matter what the underlying source of the
data, no matter how it may behave, sample means and sample proportions
will be approximately normally distributed for sufficiently large
sample size. We don't have to calibrate or validate or investigate the
applicability of these methods to any particular field of study,
bridge included. When you start aggregating things, they begin to
behave "normally".

paul

unread,

Dec 17, 2009, 8:37:56 PM12/17/09

to

On Dec 17, 7:06 pm, paul <paulh...@infi.net> wrote:
Bonus question for the statisticians in the group:

Several posters have identified themselves as statisticians. Me, I
merely tutor the subject. along with math, economics and computer
science. So:

Suppose I'm testing a hypothesis on a proportion. My null hypothesis
is, say, p = .6. I compute my test statistic and let's say it equals
1.8 . At a significance level of .05 (critical value 1.645) that's
sufficient evidence to reject the null and conclude p > .6. Right?
However, it's not sufficient to prove p is not equal to .6, for which
the critical value would be 1.96.

So, the same evidence can be enough to prove something is greater than
something else, but can't prove it's not equal? I know all about one-
tailed and two-tailed tests and how to compute p-values and why we do
it that way, but doesn't this seem illogical? Anyone care to explain
away this apparent paradox?

Frisbieinstein

unread,

Dec 17, 2009, 9:19:07 PM12/17/09

to

I can try to explain it.

The usual procedure is to decide the question to be answered before
collecting the data. This is what the procedures in statistics texts
are oriented towards. If one looks at the data and then makes a
hypothesis then things are done differently.

If I ruled the world this is the one thing every student of statistics
would learn. It is seldom made clear and seldom understood. (I
personally think the way statistics is usually taught is a ridiculous
waste of time.)

So yes, if you look at the data and then make a hypothesis then the
procedures in the basic texts don't apply. If you want to do that, go
ahead form a hypothesis from the data, then start over from scratch
and collect new data to test that hypothesis.

KS has gone about this correctly. He constructed hands intended to be
better or worse than average. So he had his hypothesis in hand before
collecting the data.

Frisbieinstein

unread,

Dec 17, 2009, 9:29:05 PM12/17/09

to

On Dec 18, 8:06 am, paul <paulh...@infi.net> wrote:
>
> Finally, how do we know all this works for bridge? The answer lies in
> the Central Limit Theorem: no matter what the underlying source of the
> data, no matter how it may behave, sample means and sample proportions
> will be approximately normally distributed for sufficiently large
> sample size. We don't have to calibrate or validate or investigate the
> applicability of these methods to any particular field of study,
> bridge included. When you start aggregating things, they begin to
> behave "normally".

This is not true. There are conditions to the Central Limit Theorem.
Some theoretical distributions never converge. Averages of bridge
scores will converge, but are badly behaved and converge slowly.

Common errors include non-random data, biased data, and incorrectly
assuming convergence to normality.

paul

unread,

Dec 17, 2009, 9:58:19 PM12/17/09

to

Of course, and I meant to discuss bias in particular: statistical
methods assume the data is random and unbiased.

Do you agree that Kurt's simulations meet the conditions required by
the Central Limit Theorem, at least if we consider the population of
interest to be all possible DD results? Jog and I question the
applicability to table results, but Travis, for example, wasn't
convinced the method made sense at all. And if the conditions of the
Central Limit Theorem apply, aren't my calculations of confidence
intervals the correct way to assess the accuracy of the results? And
that Kurt's claim that you can't assess reliability from one sim pure
nonsense?

paul

unread,

Dec 17, 2009, 10:05:53 PM12/17/09

to

I did not mean to imply that I was choosing what hypothesis to test
after looking at the data; I call that result-shopping and try to
emphasize to my students that changing the experiment after viewing
the results invalidates all the probabilities. (I see this a lot with
multiple regression, tossing variables in and out and then reporting
the results using the same data.)

So, consider two experimenters working independently with the same
data. One wants to test p > .6, the other p not equal to .6. They both
submit their results to be published in the same scientific journal.
Would the journal actually report that p has been shown to be more
than .6, but has not been proven to be different than .6 ?

Frisbieinstein

unread,

Dec 17, 2009, 10:47:23 PM12/17/09

to

On Dec 18, 10:58 am, paul <paulh...@infi.net> wrote:
> On Dec 17, 9:29 pm, Frisbieinstein <patmpow...@gmail.com> wrote:
>
>
>
> > On Dec 18, 8:06 am, paul <paulh...@infi.net> wrote:
>
> > > Finally, how do we know all this works for bridge? The answer lies in
> > > the Central Limit Theorem: no matter what the underlying source of the
> > > data, no matter how it may behave, sample means and sample proportions
> > > will be approximately normally distributed for sufficiently large
> > > sample size. We don't have to calibrate or validate or investigate the
> > > applicability of these methods to any particular field of study,
> > > bridge included. When you start aggregating things, they begin to
> > > behave "normally".
>
> > This is not true. There are conditions to the Central Limit Theorem.
> > Some theoretical distributions never converge. Averages of bridge
> > scores will converge, but are badly behaved and converge slowly.
>
> > Common errors include non-random data, biased data, and incorrectly
> > assuming convergence to normality.
>
> Of course, and I meant to discuss bias in particular: statistical
> methods assume the data is random and unbiased.
>
> Do you agree that Kurt's simulations meet the conditions required by
> the Central Limit Theorem, at least if we consider the population of
> interest to be all possible DD results?

Tricks taken should converge to normality reasonably quickly. 1000
hands is surely enough.

> Jog and I question the
> applicability to table results, but Travis, for example, wasn't
> convinced the method made sense at all. And if the conditions of the
> Central Limit Theorem apply, aren't my calculations of confidence
> intervals the correct way to assess the accuracy of the results?

I don't care about the number of tricks, I want to know whether this
hand is worth a point more than average. I want a yes-no answer. If
you prefer a confidence interval, that's your affair. It is true that
these two answers are closely related and the distinction is somewhat
subtle and not terribly important.

> And
> that Kurt's claim that you can't assess reliability from one sim pure
> nonsense?

I'm not willing to go back over who said what, so I take no no view on
this vital issue.

Frisbieinstein

unread,

Dec 17, 2009, 10:56:43 PM12/17/09

to

Yes, that would be correct. The key thing is that they decide what
they want to test before looking at the data. By making this
commitment one may apply a more powerful test.

One must consider the mind of the reader. Should he open the journal
wishing to know whether p > .6, he may use the evidence that this is
so. Why should he concern himself with an experiment that asks a
different question? If the reader does not care about p then he may
decide that no conclusion has been reached, but since he does not care
this is no great loss.

Bill Jacobs

unread,

Dec 18, 2009, 12:21:06 AM12/18/09

to

paul <paul...@infi.net> wrote in news:a68f9f59-6cc5-40e5-ab3a-
4e029f...@u7g2000yqm.googlegroups.com:

> On Dec 17, 7:06�pm, paul <paulh...@infi.net> wrote:
> Bonus question for the statisticians in the group:
>
> Several posters have identified themselves as statisticians. Me, I
> merely tutor the subject. along with math, economics and computer
> science. So:

Hmmph. I certainly have a degree in (maths and) statistics, and I don't
have the foggiest idea what you are talking about in your original post,
let alone your bonus question. It's all Greek to me, and I never studied
Greek.

The passage of time will do that to you ...

Cheers ... Bill

(MY bonus question is: Was "Student" of "Student's t-test" fame actually a
student, or was that his real name, or is it simply named that way because
all students should learn it? I've often wondered about that - in fact
it's the only part of my statistics background that I actually recall.)

Charles Brenner

unread,

Dec 18, 2009, 1:50:39 AM12/18/09

to

On Dec 17, 9:21 pm, Bill Jacobs <bill.jac...@quest.com> wrote:
> paul <paulh...@infi.net> wrote in news:a68f9f59-6cc5-40e5-ab3a-
> 4e029fdff...@u7g2000yqm.googlegroups.com:

I'm going to try to do this without Google, so in grading my answer I
should get a small handicap.

He worked for Guiness and "Student" was his self-chosen nom de plume.
I think I heard that he had a business reason to disguise his
identity; perhaps his employer wouldn't have approved of his
publishing state secrets. He was certainly no amateur at statistics;
he was sophisticated enough to make a fool of Karl Pearson.

Charles

Eddie Grove

unread,

Dec 18, 2009, 1:17:31 AM12/18/09

to

You skip an important point. If you want 95% confidence, you do not get that
when you have 10 separate values calculated to 95% confidence each.

Eddie

Frisbieinstein

unread,

Dec 18, 2009, 2:47:47 AM12/18/09

to

That is correct. He had realized that the statistical tests performed
at Guiness were invalid. Management was unreceptive to his
suggestion. He published under a pseudonym to avoid being fired as a
troublemaker.

Frisbieinstein

unread,

Dec 18, 2009, 2:55:28 AM12/18/09

to

Yes and no.

If you are on a data fishing expedition, messing around looking for
something, perform ten tests at random and find that two are 95%
confidence then you are kidding yourself.

But if you have 10 hypothesis, test them and find that 8 show 95%
confidence result, then one might have great confidence in those
especially if they are related in some way. This was the case with
Kurt's 9 constructed hands.

If you have 10 independent hypothesis and one pans out then it is time
for a second test.

Dave Flower

unread,

Dec 18, 2009, 4:57:38 AM12/18/09

to

There is an alternative approach, entirely practical in these days of
freely available massive computing power. The total number of
different lies of the two defensive hands is 10,400,600 These are a
priori equally likely.
Obviously, the bidding, the opening lead, and possible subsequent
defence, eliminates some of them; it may also be necessary to
partially eliminate some. (For example, if the bidding goes 1NT 3NT,
and opening leader has two identical 4-card majors)
However, this approach has the virtue that sampling errors are
removed.
An example of sucah an analysis is in a recent thread 'I was surprised
at the odds'

Dave Flower

eleaticus

unread,

Dec 18, 2009, 6:26:21 AM12/18/09

to

"paul" <paul...@infi.net> wrote in message
news:a68f9f59-6cc5-40e5...@u7g2000yqm.googlegroups.com...

On Dec 17, 7:06 pm, paul <paulh...@infi.net> wrote:

paul:
===========

============

The posing of your question didn't make it clear to me exactly what you were
questioning, but I think I hae some necessary info for you. I hope it is in
the right ball park.

First, though, Statistical Hypothesis Testing never "proves" anything. It
merely sets/uses arbitrary standards by which one my say "Well, I guess it
is/isn't so." Those standards standardly are a little loose (5% level) or
fairly tight (1% level).

OK, about two critical test test statistics in roughly the same situation.

When you have a thesis/hypothesis that two somethings are not equal, but
have no idea or concern about the relative sizes of the two, you have a
two-tailed test statistic, where your standard level (5%, 1%) is split evely
between the high and low tails of the test statistic distribution.

When your thesis is that A>B or A<B (B a previously "established" value),
only one tail is a valid basis for rejecting the hypothesis of the
difference.and the whole 5% or 1% will be assigned to the relevant tail.

If it is A>B in question, the null hypothesis is that A<=B and it requires a
sizeable A>B to decide to reject the null hypothese.

But not as sizeable as required to reject A<>B because you have in this case
two shots at A not equalling B. A showing up much lower than B, and A
showing much greater.

In the two-tailed test (if I remember what 1.96 represents) either -1.96 or
1.96 will do the job.

In the one-tailed test A>B a +1.80 is required (if that's the right number
of the same "significance" as 1.96; I have no table) and A<B requires -1.80
.

To summarize, a two-tailed test (A<>B) at 5% is 2.5% at both tails, and a
one-tailed test (A>B, A<B) requires the whole 5% at just one tail.

oren

jogs

unread,

Dec 18, 2009, 8:59:03 AM12/18/09

to

DD has a known bias favoring the defending side.
It's somewhere between 0.3 to 0.5 depending on
the hand. We are clearly not dealing with unbiased
data. Therefore all results are suspicious.

On the notrump hands the bias can be calculated.
Just rerun the entire sample with RHO as declarer.
The opening leader bias should be approximately
1/2 the difference of the two sample means.

KWSchneider

unread,

Dec 18, 2009, 9:44:14 AM12/18/09

to

Would this not be a good idea for all of the runs? Not just the
notrump ones? It would eliminate any lead bias??

Kurt

paul

unread,

Dec 18, 2009, 9:59:57 AM12/18/09

to

On Dec 18, 8:59 am, jogs <vspo...@hotmail.com> wrote:

Good methodology, and worth doing. If the results are constant across
a variety of hands such as Kurt studied, then we could reasonably
ignore it in the future. My guess and I assume yours is that it would
substantially alter his results.

I would go further: why not test each deal 4 times with each hand
having the opening lead? This might require a technique such as ANOVA
to analyze properly, but Excel provides a variety of methods in their
Data Analysis Toolpack.

Still, until someone validates such results against real-world
declarer play with a single dummy analyzer, I remain a skeptic of DD
analysis. Where DD reinforces common sense (Aces, tens and nines are
undervalued by the 4321 count), fine. Where it contradicts common
sense ("touching honors are death") I maintain such a result is apt to
be due to the difference between perfect and partial information and
not worth altering my evaluation of hands.

Thomas Andrews found that 4333 was the best shape for notrump (among
that and 5332 or 4432); Peter Cheung found a slight advantage for
having a 5 card suit, about an extra 2% chance of making IIRC; someone
posted here recently that the French Bridge Federation studied actual
hands and determined a five card suit was worth .4 points at notrump
(roughly the same as a ten.) So DD was enough to convince me that the
often reported advice to add 1 point for a 5 card suit might be wrong,
but I was skeptical that 4333 was better than 5332. It makes sense
that having a doubleton creates a weakness which the defense might be
able to exploit; DD they always do, real-life they often don't, giving
declarer time to set up his source of tricks.

jogs

unread,

Dec 18, 2009, 10:00:50 AM12/18/09

to

In trump suits there's a problem. N-S and W-E will choose
different suits as trumps. The mean differences will be more
due to choice of trump suit than the opening lead.

Charles Brenner

unread,

Dec 18, 2009, 12:33:23 PM12/18/09

to

That's a clever idea and it corrects for something -- which we might
call the "first strike" advantage. But I think what it corrects for
has little or nothing to with the bias between single dummy
performance and double dummy.

Charles

OldPalooka

unread,

Dec 18, 2009, 2:01:42 PM12/18/09

to

I agree, but the little voice tells me that defender's dd advantage is
likely to be reasonably close to the inverse of declarer's
advantage. There is a clear relationship in the opening lead for a
start.

-- Bill

Andrew

unread,

Dec 18, 2009, 2:29:21 PM12/18/09

to

Thomas may wish to weight in on this statement. I suspect he would say
it is a misinterpretation of his results. Thomas found that 4-3-3-3 is
slightly better than 4-4-3-2 and 5-3-3-2 for NT purposes when tricks
are measured assuming *any shape* for partner. In practice, when
partner holds a distributional hand, the final contract is rarely
played in NT. So Thomas's trick valuations for shapes are misleading,
since many of the hands where 4-3-3-3 performs well in NT are ones
that in practice will not play in NT. When he constrained responder to
holding balanced patterns, 4-3-3-3 became a net negative for play in
NT. Drawing conclusions from double dummy analysis is not easy.

On a separate thread, I truly appreciate Kurt's efforts. Whether or
not one agrees with Kurt's conclusions, the topic is worth discussing
and I have learned from reading all sides of the debate.

Andrew

Fred.

unread,

Dec 18, 2009, 3:28:54 PM12/18/09

to

On Dec 17, 7:06 pm, paul <paulh...@infi.net> wrote:

Paul,

While there are statistical methds which are independent of the
underlying distribution, you really do have to have an underlying
statistical model to apply the Central Limit Theorem. Otherwise, you
have no assurance that the mean, or whatever other statistics your're
attempting to estimate actually exist. And, there are some fairly
simple experiments fitting models with no means. My best guess it
that most experiments posted on rgb meet the requirements of the
theorem, but who checks?

I'd also like to point out that the pseudo-random number generators
supporting the simulations are mildly suspect. The actual
verification of a generator using a spectral test is a very complex
task.requiring difficult programming and a gread deal of processing.
I once looked at it in Knuth's _Art of Programming_ and my reaction
was "wow!". And, I'm not easy to intimidate as a programmer. It
would be easier to achive the very large precison artithmetic required
using today's object oriented languages, but it would still go though
lots of processor. (Yes, I know what gigaherz are).

While I'm sure it must have happened, I've never actually heard of
anyone running one. I suspect that what most people do to meet
schedule is to take a generally accepted algorithm for the generator
and run some simple tests to make sure they haven't skewed it in some
obvious way.

But, I don't worry about it too much. I doubt that anyone on rgb is
going to take any one simuluation result for the absolute truth.

Fred.

jogs

unread,

Dec 18, 2009, 4:33:26 PM12/18/09

to

Charles Brenner wrote:

>
> That's a clever idea and it corrects for something -- which we might
> call the "first strike" advantage. But I think what it corrects for
> has little or nothing to with the bias between single dummy
> performance and double dummy.
>
> Charles

You're right. It only estimates 'first strike' advantage.
I worded it badly.

KWSchneider

unread,

Dec 18, 2009, 7:10:58 PM12/18/09

to

I've figured out how to run the GIB engine in SD mode from DOS - ie
GIB will play all of the hands. I can set it up so the auction goes 1N
AP and the lead will automatically be "selected" by the GIB robot by
West [South declares]. However, I haven't been able to figure out how
to:

a) batch the deals [have to run 1 at a time - I'm trying to rework
DEAL to use the SD switches and output]. Although I can create a batch
file, I need to first create 100 txt files to run.
b) capture the output to a file - note the output is a MP score which
I have to convert to tricks [easy enough but a pain].

I plan on "fixing" south's hand with a 10pt count and something like
AQxx KJx xxx xxx but I need to put some constraints on the other hands
since I'm not running 1000 SIMS - probably 100 or so. The DD will be
run in notrump only.

Then we can compare the SD result [mean + Sdev] with the DD result
[mean + Sdev]

Ideas?

Cheers,
Kurt

castigamatti

unread,

Dec 18, 2009, 7:28:23 PM12/18/09

to

Only quarks tonight in the sky.
:-)

BR

Charles Brenner

unread,

Dec 18, 2009, 7:43:46 PM12/18/09

to

"inverse"? "relationship in the opening lead"?? Usually I agree with
what you say. This time I don't even have any idea what you mean. For
sure, it's difficult to even think through clearly, let alone put into
words, the exact relationship between SD vs. DD bias and Jog's
experiment. We might need a diagram.

Charles

paul

unread,

Dec 18, 2009, 10:23:59 PM12/18/09

to

Excellent! I wonder if the designer of GIB could help -- doesn't seem
like it would be difficult to wrap an SD solver around the GIB engine.
Anyone know how to contact him?

KWSchneider

unread,

Dec 18, 2009, 11:53:22 PM12/18/09

to

I've talked to Fred Gitelman - he's said that they cannot provide
support and that I'm on my own. As I said, I can generate results but
it doesn't appear easy to automate the process for multiple SIMS. The
problem is I need to capture the results to a file [since they provide
the play of each card], parse it, extract the result and convert to
tricks - then move on to the next deal.

Something for the holidays I expect...

Cheers,
Kurt

KWSchneider

unread,

Dec 19, 2009, 10:22:59 AM12/19/09

to

On Dec 18, 9:59 am, paul <paulh...@infi.net> wrote:

I would like to point out that the "lead" bias has already been
determined, at least for the "no constraint" situation. For example,
if I deal 100,000 hands at random and play them all at notrump,
declarer will make 6.02 tricks [+/-.03]. Since ALL of the hands are
equally likely to have the same cards at some point, the declarer
location is meaningless - hence the result shows that for 10pts in all
hands the DD "lead bias" is exactly 0.5 tricks for notrump. I noted
this in my blog last year.

Or stated another way - we toss away 1/2 a trick defending 1N on
average when each pair's combined assets are 20 pts.

Whether this is the same bias for a specifically constrained hand such
as Kxx xxx Axxx Axx, I can't say - although one would expect with this
shape, the lead bias would be less, since it is harder to get the lead
wrong.

Cheers,
Kurt

jogs

unread,

Dec 19, 2009, 10:36:58 AM12/19/09

to

Just do 10 boards. Post the boards. Then we can see
if we agree that the methodology is valid.
Also no xxx. Replace the xx's with the lowest available
card. xxx xxx Axxx AKx becomes 432 432 A432 AK2.
QJx QJx Qxxx QJx becomes QJ2 QJ2 Q432 QJ2.
This will reduce the unknown cards to 39.

Make sure we are comparing apples with apples. Mean
and Sdev can wait. Just post the number of tricks
expected by us and RHO.

Nick France

unread,

Dec 19, 2009, 10:37:34 AM12/19/09

to

> Kurt- Hide quoted text -
>
> - Show quoted text -

Your assumption that we throw away a half a trick when defending
notrump is not supported by the results. Did you actually find that
real life playing made on average 6.5 tricks. All you have shown is
the advantage on being on lead in the situation is worth a half a
trick.

Nick France

KWSchneider

unread,

Dec 19, 2009, 1:42:49 PM12/19/09

to

In DD SIMS if you have the lead you make 7tricks and if you don't you
make 6tricks. This follows the discussion earlier in this or the other
post - and corroborates some anecdotal data from Peter Cheung where he
found a 0.3-0.5 trick bias. Since the expected result with exactly 1/2
of the points is 1/2 of the tricks, there is definitely a lead bias in
DD notrump. Whether there is an equivalent one in real life remains to
be seen. I hope to clarify this in the next week.

Kurt

paul

unread,

Dec 19, 2009, 7:52:45 PM12/19/09

to

I found www.GIBware.com online, with sales and tech support phone
numbers and email addresses. Since the "G" stands for Ginsberg, not
Gitelman, is Ginsberg still affiliated with the software? I'm too poor
to purchase anything at the moment so I'll let others try contacting
them.

jogs

unread,

Dec 19, 2009, 8:39:48 PM12/19/09

to

The amount of the bias probably isn't constant among
all boards. When HCP are nearly equal the bias is
probably(?) greater. With slam range hands the bias
is probably smaller. Still probably many more slams
are beat double dummy than in real play. I still think
boards imported from an online site will give better mean
results than any monte carlo using DD. By better I
mean closer to reality.

Frisbieinstein

unread,

Dec 19, 2009, 10:50:14 PM12/19/09

to

I'm not surprised. I think you might have more luck with a commercial
product like Jack. Maybe they already have such a feature. If you can
get enough people to promise to pay enough money for it, you will get
it.

It would be an excellent tool for designing bidding systems, which is
the ultimate goal. Four players all with exactly the same
capabilities that play at the speed of light all day and night.
What's not to like?

By the way, there is a computer bridge championship every year.
Moscito is the winning system so far.

Frisbieinstein

unread,

Dec 19, 2009, 10:52:39 PM12/19/09

to

That is quite an interesting result. I think it has put me in the
"bid NT quick" camp as opposed to the "ask for stoppers" faction. The
most important thing is to give the defenders little to go on.

Frisbieinstein

unread,

Dec 19, 2009, 10:56:51 PM12/19/09

to

On Dec 20, 9:39 am, jogs <vspo...@hotmail.com> wrote:
> I still think
> boards imported from an online site will give better mean
> results than any monte carlo using DD. By better I
> mean closer to reality.

A data base of a million or more online games would be a very good
tool, if anyone wishes to go to the trouble. I would take that over
double dummy.

If an analysis of that data says AQ KJ is better than AK QJ then I
would be very inclined to believe that. Whether there is a materially
significant difference is another matter.

KWSchneider

unread,

Dec 19, 2009, 11:33:11 PM12/19/09

to

Be careful with the "speed of light" comment. I've run a couple of SD
SIMS - for each play of each player at the table, GIB evaluates up to
250 different possibilities. Bottom line it takes as long as 10
seconds for an individual SIM. 100 SIMS won't push the envelope but
1000 SIMS will start to be a pain, especially for multiple hand
configurations.

Kurt

KWSchneider

unread,

Dec 19, 2009, 11:33:38 PM12/19/09

to

> I foundwww.GIBware.comonline, with sales and tech support phone

> numbers and email addresses. Since the "G" stands for Ginsberg, not
> Gitelman, is Ginsberg still affiliated with the software? I'm too poor
> to purchase anything at the moment so I'll let others try contacting
> them.

Fred contacted me - after I sent an email to their tech support. I'm
on my own, but I think I'll have it solved in a few days...

Kurt

Travis Crump

unread,

Dec 20, 2009, 12:51:59 AM12/20/09

to

Shouldn't this take all of 5-10 lines of perl? What language are you
trying to do it in? Presumably one you are familiar with, I can't
imagine it's that hard...

Jürgen R.

unread,

Dec 20, 2009, 8:33:29 AM12/20/09

to

William Gosset was the name of the person using Student
as a pseudonym.

Bill Jacobs wrote:
> paul <paul...@infi.net> wrote in news:a68f9f59-6cc5-40e5-ab3a-
> 4e029f...@u7g2000yqm.googlegroups.com:

>
>> On Dec 17, 7:06 pm, paul <paulh...@infi.net> wrote:

>> Bonus question for the statisticians in the group:
>>
>> Several posters have identified themselves as statisticians. Me, I
>> merely tutor the subject. along with math, economics and computer
>> science. So:
>
> Hmmph. I certainly have a degree in (maths and) statistics, and I
> don't have the foggiest idea what you are talking about in your
> original post, let alone your bonus question. It's all Greek to me,
> and I never studied Greek.
>
> The passage of time will do that to you ...
>
> Cheers ... Bill
>
> (MY bonus question is: Was "Student" of "Student's t-test" fame
> actually a student, or was that his real name, or is it simply named
> that way because all students should learn it? I've often wondered
> about that - in fact it's the only part of my statistics background
> that I actually recall.)

Jürgen R.

unread,

Dec 20, 2009, 8:45:04 AM12/20/09

to

paul wrote:
[...]
> So, the same evidence can be enough to prove something is greater than
> something else, but can't prove it's not equal? I know all about one-
> tailed and two-tailed tests and how to compute p-values and why we do
> it that way, but doesn't this seem illogical? Anyone care to explain
> away this apparent paradox?

This is the kind of nonsense usenet was made for. If anybody actually is
paying you to 'teach statistics' you had best hide your real name.

If you think that statistical methods are used to prove equalities
and inequalities you have misunderstood very basic ideas.

paul

unread,

Dec 20, 2009, 9:20:58 AM12/20/09

to

Bah, pointless nitpicking. Our textbooks call the methodology "proof
by contradiction." The same data is sufficient to reject the (null)
hypothesis that p >= .6 with 95% confidence, but insufficient to
reject the hypothesis that p = .6 . Still sounds illogical.

Jürgen R.

unread,

Dec 20, 2009, 10:04:27 AM12/20/09

to

In this context, where the distributions are bounded and discrete,
the first two moments always exist; and if successive trials
are independent the central limit theorem is applicable.

>
> I'd also like to point out that the pseudo-random number generators
> supporting the simulations are mildly suspect. The actual
> verification of a generator using a spectral test is a very complex
> task.requiring difficult programming and a gread deal of processing.
> I once looked at it in Knuth's _Art of Programming_ and my reaction
> was "wow!". And, I'm not easy to intimidate as a programmer. It
> would be easier to achive the very large precison artithmetic required
> using today's object oriented languages, but it would still go though
> lots of processor. (Yes, I know what gigaherz are).

You are introducing another pseudo-problem. Not only does Knuth give
good advice on how to avoid problems but for the specific case of
bridge dealing programs van Staveren's code is available.
In case of doubts about a specific dealing program, it is quite easy to test
for correctness.

Jürgen R.

unread,

Dec 20, 2009, 10:14:03 AM12/20/09

to

jogs wrote:

> On Dec 18, 6:44 am, KWSchneider <questionofbala...@yahoo.com> wrote:
>> On Dec 18, 8:59 am, jogs <vspo...@hotmail.com> wrote:
>>
>>> DD has a known bias favoring the defending side.
>>> It's somewhere between 0.3 to 0.5 depending on
>>> the hand. We are clearly not dealing with unbiased
>>> data. Therefore all results are suspicious.
>>
>>> On the notrump hands the bias can be calculated.
>>> Just rerun the entire sample with RHO as declarer.
>>> The opening leader bias should be approximately
>>> 1/2 the difference of the two sample means.
>>

>> Would this not be a good idea for all of the runs? Not just the
>> notrump ones? It would eliminate any lead bias??
>>
>> Kurt
>
> In trump suits there's a problem. N-S and W-E will choose
> different suits as trumps. The mean differences will be more
> due to choice of trump suit than the opening lead.

Since you are only interested in the difference in the nbr of
tricks, you can force the simulator to play an unreasonable
contract, i.e. a contract in the opponents' suit.
However, it doesn't seem to be true that 1/2 this difference
is at all the same is the dd-vs-sd lead advantage.

Jürgen R.

unread,

Dec 20, 2009, 10:36:16 AM12/20/09

to

paul wrote:

Previously you were talking about p > 0.6, now p >= 0.6. That's
an essential difference for your question.
What textbook are you referring to, just out of curiosity?

Thomas Andrews

unread,

Dec 20, 2009, 12:53:18 PM12/20/09

to

On Dec 18, 2:29 pm, Andrew <agump...@gmail.com> wrote:

> On Dec 18, 6:59 am, paul <paulh...@infi.net> wrote:
>
>
>
> > On Dec 18, 8:59 am, jogs <vspo...@hotmail.com> wrote:
>
> > > DD has a known bias favoring the defending side.
> > > It's somewhere between 0.3 to 0.5 depending on
> > > the hand. We are clearly not dealing with unbiased
> > > data. Therefore all results are suspicious.
>
> > > On the notrump hands the bias can be calculated.
> > > Just rerun the entire sample with RHO as declarer.
> > > The opening leader bias should be approximately
> > > 1/2 the difference of the two sample means.
>

> > Good methodology, and worth doing. If the results are constant across
> > a variety of hands such as Kurt studied, then we could reasonably
> > ignore it in the future. My guess and I assume yours is that it would
> > substantially alter his results.
>
> > I would go further: why not test each deal 4 times with each hand
> > having the opening lead? This might require a technique such as ANOVA
> > to analyze properly, but Excel provides a variety of methods in their
> > Data Analysis Toolpack.
>
> > Still, until someone validates such results against real-world
> > declarer play with a single dummy analyzer, I remain a skeptic of DD
> > analysis. Where DD reinforces common sense (Aces, tens and nines are
> > undervalued by the 4321 count), fine. Where it contradicts common
> > sense ("touching honors are death") I maintain such a result is apt to
> > be due to the difference between perfect and partial information and
> > not worth altering my evaluation of hands.
>

> >Thomas Andrewsfound that 4333 was the best shape for notrump (among

> > that and 5332 or 4432);
>

> Thomas may wish to weight in on this statement. I suspect he would say
> it is a misinterpretation of his results. Thomas found that 4-3-3-3 is
> slightly better than 4-4-3-2 and 5-3-3-2 for NT purposes when tricks
> are measured assuming *any shape* for partner. In practice, when
> partner holds a distributional hand, the final contract is rarely
> played in NT

And the main reason (possibly the entire reason) for this was that the
average doubleton includes Qx and Jx, which makes the average 4432
weaker. This is also probably the reason that the 5440 pays better
than 4441 in notrump, on average - because stiff honors are included
in the averages for 4441.

And the rest of what Andrew says is true, too. Basically, you want to
know how well a shape plays in notrump *when you are likely to play in
notrump.* I did some research when this came up, and it was pretty
clear that 4333 is a negative when you want to play in notrump.

I actually added an article to my site about this a while back:

http://bridge.thomasoandrews.com/valuations/experts4333.html

The results are far from conclusive, and I doubt that 4333 is worth
deducting a whole point from a hand, but the expert practice to treat
4333 as a weakness when playing notrump is certainly vindicated in
this experiment.

eleaticus

unread,

Dec 20, 2009, 12:58:12 PM12/20/09

to

"paul" <paul...@infi.net> wrote in message
news:6ac33b6f-44be-4c97...@t12g2000vbk.googlegroups.com...

On Dec 20, 8:45 am, J�rgen R. <jurg...@web.de> wrote:
> paul wrote:
>
> [...]
>
> > So, the same evidence can be enough to prove something is greater than
> > something else, but can't prove it's not equal? I know all about one-
> > tailed and two-tailed tests and how to compute p-values and why we do
> > it that way, but doesn't this seem illogical? Anyone care to explain
> > away this apparent paradox?
>
> This is the kind of nonsense usenet was made for. If anybody actually is
> paying you to 'teach statistics' you had best hide your real name.
>
> If you think that statistical methods are used to prove equalities
> and inequalities you have misunderstood very basic ideas.

PAUL:

Bah, pointless nitpicking. Our textbooks call the methodology "proof
by contradiction." The same data is sufficient to reject the (null)
hypothesis that p >= .6 with 95% confidence, but insufficient to
reject the hypothesis that p = .6 . Still sounds illogical.

oren:
I amsurprised your textbook names the basis of the methodology at all.

But statistical hypothesis testing is a "weak" case reduction to the absurd.
Weak because it doesn't prove, it suggests.

And I thoroughly explained before in this thread why the numerical standard
in the two cases seem different. They are. One-tailed vs two-tailed.

oren

jogs

unread,

Dec 20, 2009, 1:17:08 PM12/20/09

to

<
< A data base of a million or more online games would be a very good
< tool, if anyone wishes to go to the trouble. I would take that over
< double dummy.
<

Not as good as I thought.

xxx xxx Axxx AKx

where the x's are between 2 to 9.

From one million boards we would be lucky to find 35
occurrences which mean the criteria. Each board would
probably be played about 100 times.

jogs

unread,

Dec 20, 2009, 1:19:45 PM12/20/09

to

On Dec 19, 8:33 pm, KWSchneider <questionofbala...@yahoo.com> wrote:
.

Be careful with the "speed of light" comment. I've run a couple of SD
SIMS - for each play of each player at the table, GIB evaluates up to
250 different possibilities. Bottom line it takes as long as 10
seconds for an individual SIM. 100 SIMS won't push the envelope but
1000 SIMS will start to be a pain, especially for multiple hand
configurations.

< Kurt

This isn't poker, where one can have one billion iterations
in under ten seconds.

castigamatti

unread,

Dec 20, 2009, 2:17:00 PM12/20/09

to

It isn't chess either.

BR

Frisbieinstein

unread,

Dec 20, 2009, 7:52:19 PM12/20/09

to

That is true. Practically all real-world distributions are bounded.
But in practically all real-world experiments sample size is also
bounded, so one may not simply invoke the central limit theorem and be
assured that everything works out.

Frisbieinstein

unread,

Dec 20, 2009, 8:44:47 PM12/20/09

to

According to my calculations about one in every fifty hands has xxx
xxx Axxx AKx. At four hands per board that is one in 13 boards.
That makes about 75,000 boards out of one million. That ought to be
enough.

Frisbieinstein

unread,

Dec 20, 2009, 9:15:47 PM12/20/09

to

On Dec 21, 2:19 am, jogs <vspo...@hotmail.com> wrote:
> On Dec 19, 8:33 pm, KWSchneider <questionofbala...@yahoo.com> wrote:
> .
>
> Be careful with the "speed of light" comment.
>

> This isn't poker, where one can have one billion iterations
> in under ten seconds.

Hey, it beats keeping a thousand LOLs chained up in the barn.

KWSchneider

unread,

Dec 20, 2009, 9:50:38 PM12/20/09

to

Thank you for vindicating my position - SIM is the only way. I'm close
to getting the SD up and running.

Kurt

paul

unread,

Dec 20, 2009, 11:04:50 PM12/20/09

to

On Dec 20, 10:36 am, Jürgen R. <jurg...@web.de> wrote:
> paul wrote:

"p>.6" was, of course, one of the alternate hypotheses. I recast in
terms of rejecting or not rejecting the null hypothesis, since you
objected to "proving" anything. The point I was making is unaffected
by the change. The textbook I've used most is at work, it's by Keller.

I'm going to drop the whole thing. I mentioned this issue to several
of my fellow statistics tutors; they had no trouble grasping the
apparent contradiction. It was intended as a mildly puzzling and
amusing aside.

paul

unread,

Dec 20, 2009, 11:09:16 PM12/20/09

to

On Dec 18, 6:26 am, "eleaticus" <eleati...@bellsouth.net> wrote:
> "paul" <paulh...@infi.net> wrote in message
>
> news:a68f9f59-6cc5-40e5...@u7g2000yqm.googlegroups.com...

> On Dec 17, 7:06 pm, paul <paulh...@infi.net> wrote:
>

> paul:
> ===========

> Bonus question for the statisticians in the group:
>
> Several posters have identified themselves as statisticians. Me, I
> merely tutor the subject. along with math, economics and computer
> science. So:
>

> Suppose I'm testing a hypothesis on a proportion. My null hypothesis
> is, say, p = .6. I compute my test statistic and let's say it equals
> 1.8 . At a significance level of .05 (critical value 1.645) that's
> sufficient evidence to reject the null and conclude p > .6. Right?
> However, it's not sufficient to prove p is not equal to .6, for which
> the critical value would be 1.96.

>
> So, the same evidence can be enough to prove something is greater than
> something else, but can't prove it's not equal? I know all about one-
> tailed and two-tailed tests and how to compute p-values and why we do
> it that way, but doesn't this seem illogical? Anyone care to explain
> away this apparent paradox?

> ============
>
> The posing of your question didn't make it clear to me exactly what you were
> questioning, but I think I hae some necessary info for you. I hope it is in
> the right ball park.
>
> First, though, Statistical Hypothesis Testing never "proves" anything. It
> merely sets/uses arbitrary standards by which one my say "Well, I guess it
> is/isn't so." Those standards standardly are a little loose (5% level) or
> fairly tight (1% level).
>
> OK, about two critical test test statistics in roughly the same situation.
>
> When you have a thesis/hypothesis that two somethings are not equal, but
> have no idea or concern about the relative sizes of the two, you have a
> two-tailed test statistic, where your standard level (5%, 1%) is split evely
> between the high and low tails of the test statistic distribution.
>
> When your thesis is that A>B or A<B (B a previously "established" value),
> only one tail is a valid basis for rejecting the hypothesis of the
> difference.and the whole 5% or 1% will be assigned to the relevant tail.
>
> If it is A>B in question, the null hypothesis is that A<=B and it requires a
> sizeable A>B to decide to reject the null hypothese.
>
> But not as sizeable as required to reject A<>B because you have in this case
> two shots at A not equalling B. A showing up much lower than B, and A
> showing much greater.
>
> In the two-tailed test (if I remember what 1.96 represents) either -1.96 or
> 1.96 will do the job.
>
> In the one-tailed test A>B a +1.80 is required (if that's the right number
> of the same "significance" as 1.96; I have no table) and A<B requires -1.80
> .
>
> To summarize, a two-tailed test (A<>B) at 5% is 2.5% at both tails, and a
> one-tailed test (A>B, A<B) requires the whole 5% at just one tail.
>
> oren

Thanks for the effort, I'll resist the temptation to be snide and just
say forget it.

eleaticus

unread,

Dec 20, 2009, 11:11:31 PM12/20/09

to

"paul" <paul...@infi.net> wrote in message

news:9f162593-7b77-4fce...@t19g2000vbc.googlegroups.com...

Paul:

Thanks for the effort, I'll resist the temptation to be snide and just
say forget it.

oren:
You have been told by others also.

eleaticus

unread,

Dec 20, 2009, 11:21:56 PM12/20/09

to

"paul" <paul...@infi.net> wrote in message

news:c0644b49-d0f1-452e...@r26g2000vbi.googlegroups.com...
On Dec 20, 10:36 am, J�rgen R. <jurg...@web.de> wrote:

Jurgen:

> Previously you were talking about p > 0.6, now p >= 0.6. That's
> an essential difference for your question.
> What textbook are you referring to, just out of curiosity?

Paul:

"p>.6" was, of course, one of the alternate hypotheses. I recast in
terms of rejecting or not rejecting the null hypothesis, since you
objected to "proving" anything. The point I was making is unaffected
by the change. The textbook I've used most is at work, it's by Keller.

I'm going to drop the whole thing. I mentioned this issue to several
of my fellow statistics tutors; they had no trouble grasping the
apparent contradiction. It was intended as a mildly puzzling and
amusing aside.

oren:
Jurgen had missed your original postings apparently.

Your fellow statistics!!!??? tutors are incredibly ignorant, but I do
wonder what textbook it was that mentioned reduction ad absurdum in the
context of statistical hypothesis testing.

I am not a collector, nor aficionado of stat texts but I have never see one
to do so.

Possibly because of the conflict with Student's one-sample t usage.

(I leave it to the student to figure that out.)

oren

Charles Brenner

unread,

Dec 20, 2009, 11:33:42 PM12/20/09

to

Paul's paradox, as you know and have explained, derives from one-
tailed versus two-tailed significance tests. The problem is that Paul
stipulated in advance that he understands such things. In other words,
he basically admitted that he knows the answer but is not satisfied
with it. Therefore there is nothing mathematical that will satisfy
him. Possibly there exists a really clear exposition using just the
right "magic words" that would, or maybe not.

Charles

eleaticus

unread,

Dec 21, 2009, 2:28:14 AM12/21/09

to

"Charles Brenner" <cbre...@berkeley.edu> wrote in message
news:556f3124-8952-4765...@2g2000prl.googlegroups.com...

On Dec 20, 8:11 pm, "eleaticus" <eleati...@bellsouth.net> wrote:
> "paul" <paulh...@infi.net> wrote in message
>

> Paul:
> Thanks for the effort, I'll resist the temptation to be snide and just
> say forget it.
>
> oren:
> You have been told by others also.

Charles:

Paul's paradox, as you know and have explained, derives from one-
tailed versus two-tailed significance tests. The problem is that Paul
stipulated in advance that he understands such things. In other words,
he basically admitted that he knows the answer but is not satisfied
with it. Therefore there is nothing mathematical that will satisfy
him. Possibly there exists a really clear exposition using just the
right "magic words" that would, or maybe not.

oren:
Which country commands your efforts in its diplomatic corps?

paul

unread,

Dec 21, 2009, 10:34:27 AM12/21/09

to

Thanks, at least one person gets it. Everyone I've explained this to
in person gets it. Xeno's paradox required, I believe, the concept of
limits to resolve. Perhaps this is similar, something about the fuzzy
nature of probabilistic evidence. At the limit, sufficient evidence to
conclude with 100% confidence that p>.6 would also be sufficient to
conclude p is not equal to .6 . That this is not true for lower levels
of confidence is counter-intuitive. I thought perhaps there was a well-
known resolution beyond one vs. two-tailed.

paul

unread,

Dec 21, 2009, 10:50:04 AM12/21/09

to

On Dec 20, 11:21 pm, "eleaticus" <eleati...@bellsouth.net> wrote:

>
> Your fellow statistics!!!??? tutors are incredibly ignorant

You, sir, are an ass.

Fred.

unread,

Dec 21, 2009, 11:44:09 AM12/21/09

to

> > Fred.- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

I wasn't making any accusations. I was just suggesting that
validation might be limited by it difficulty. And, there is a case
for having an economically viable tool with limited testing rather
than no tool at all.

But, just because a solution to a problem exists doesn't mean that it
is a psuedo-problem, and certainly doesn't mean that the solution has
been applied.

Fred.

Charles Brenner

unread,

Dec 21, 2009, 1:16:05 PM12/21/09

to

While I agree that there's something confusing going on, I don't think
there's anything profound about it. Rather, I think the "paradox" is
superficial and at root it's merely semantic, resting on the
misleading (i.e. arbitrary and contrived) way that classical
hypothesis testing branch of statistics uses words. I come to this
conclusion by privately translating the statistical terms into
mathematics (i.e. Mathematics), whereupon the source of the confusion
becomes clear to me and the paradox is therefore resolved in my mind.

I won't waste breath spelling out my private translation. My
experience with this type of thing is that each person needs to find a
resolution in their own terms, and what is a suitably comforting
explanation to you might seem to me not particularly different from
another explanation that doesn't satisfy you. (Hence my term "magic
words.")

Charles

Frisbieinstein

unread,

Dec 21, 2009, 10:03:13 PM12/21/09

to

The one sided test is always more powerful than the two sided. There
is no such thing as 100% confidence in statistics. So the limit
argument makes no sense.

Frisbieinstein

unread,

Dec 21, 2009, 10:06:38 PM12/21/09

to

I think this is correct. He's got to find his own intuition. To me
the problem is a false analogy between hypothesis testing and
mathematics. In mathematics it is not possible for a number to be
both greater than x and equal to equal to x. Bu this isn't
mathematics, it's asking questions about something partially known.
Ask two different questions, get two different answers.