Doubt in the formula

RAMS

unread,

Jun 15, 2006, 9:08:11 AM6/15/06

to MedStats

Hi all,

I have entered 1500 subjects information. We dont have time
for double entry. I want to check the quality of data entry through a
random sample. I want a formula for calculating n with the help of
N=1500.

One of my friend said

n=sqrt(N).

Is it correct?

Anyone please help me.

Thanks in advance.

John Uebersax

unread,

Jun 16, 2006, 3:03:23 AM6/16/06

to MedStats

Perhaps someone can give a better answer, but I seem to recall people
sticking with round numbers here--like 5% or 10% audits.

Hope this helps.

--
John Uebersax PhD

Martin P. Holt

unread,

Jun 17, 2006, 5:55:32 AM6/17/06

to MedS...@googlegroups.com

Hi,

I Google searched on 'survey' and 'sample size' and looked at the resulting
links: this one looks useful:

http://www.ryerson.ca/~mjoppe/ResearchProcess/SurveySampleSize.htm

HTH

Martin

Simon, Steve, PhD

unread,

Jun 20, 2006, 11:37:27 AM6/20/06

to MedS...@googlegroups.com

ramsa...@gmail.com writes:

> I have entered 1500 subjects information. We dont have time
> for double entry. I want to check the quality of data entry through a
> random sample. I want a formula for calculating n with the help of
> N=1500.

I think a good general rule is to expend roughly 5 to 10% of your
resources on quality control, so that mirrors a previous comment made.

If you wanted a more rigorous justification for the sample size needed
in your audit, specify the maximum error rate (P) that you are willing
to tolerate. Then sample 3/P records. If you find any errors, check all
of your data. If you find no errors then you are 95% confident that the
error rate is less than P.

So for example, if you are willing to tolerate up to a 2% error rate,
then sample 3/0.02=150 records. If you notice one or more errors, then
you may have an unacceptably high error rate and you need to check the
entire data set.

This method is based on the famous "rule of three".

http://www.childrensmercy.org/stats/size/zeroevents.asp

Steve Simon, ssi...@cmh.edu, Standard Disclaimer.
Look for my book "Statistical Evidence in Medical Trials"
newly published by OUP. For more details, see
http://www.childrens-mercy.org/stats/evidence.asp

Ted Harding

unread,

Jun 20, 2006, 1:16:11 PM6/20/06

to MedS...@googlegroups.com

On 20-Jun-06 Simon, Steve, PhD wrote:
>
> ramsa...@gmail.com writes:
>
>> I have entered 1500 subjects information. We dont have time
>> for double entry. I want to check the quality of data entry through a
>> random sample. I want a formula for calculating n with the help of
>> N=1500.
>
> I think a good general rule is to expend roughly 5 to 10% of your
> resources on quality control, so that mirrors a previous comment made.
>
> If you wanted a more rigorous justification for the sample size needed
> in your audit, specify the maximum error rate (P) that you are willing
> to tolerate. Then sample 3/P records. If you find any errors, check all
> of your data. If you find no errors then you are 95% confident that the
> error rate is less than P.
>
> So for example, if you are willing to tolerate up to a 2% error rate,
> then sample 3/0.02=150 records. If you notice one or more errors, then
> you may have an unacceptably high error rate and you need to check the
> entire data set.
>
> This method is based on the famous "rule of three".
>
> http://www.childrensmercy.org/stats/size/zeroevents.asp

Perhaps this could do with being expanded a bit. The following
is based on exact calculations. The left-hand column is the
number of records in the 1500 with an error. For each sample
size, the value is the probability that at least 1 record in
the sample will have an error (and ring the alarm).

Sample Size
Errors 50 100 150 200 250 300 400 600 750
0.2% 3 0.097 0.19 0.27 0.35 0.42 0.49 0.61 0.78 0.88
0.4% 6 0.184 0.34 0.47 0.58 0.67 0.74 0.85 0.95 0.98
1% 15 0.400 0.65 0.80 0.88 0.94 0.97 0.99 1.00 1.00
2% 30 0.642 0.88 0.96 0.99 1.00 1.00 1.00 1.00 1.00
3% 45 0.787 0.96 0.99 1.00 1.00 1.00 1.00 1.00 1.00
4% 60 0.875 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00
5% 75 0.926 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

(These are rounded to 2 significant decimal places per
column, hence the "1.00" values).

To compare with Simon's "rule of 3" on his "2% error rate"
case, a sample of 150 has a 96% chance of getting a positive
compared with his 95%, so pretty close.

For 0.4%, we have 3/p = 3/0.004 = 750 for 95% confidence
by the "rule of 3", whereas the exact calculation gives
0.98 confidence (sample size of 600 for 95% confidence).
So at least the "rule of 3" is conservative here, which
is (I hope) what you want anyway -- you are more likely to
detect errors with the "rule of 3" than you think you are!

But as you can see it's not particularly accurate for these
small percentages -- e.g. again for p = 0.002 (0.2% = 3),
the exact calculation gives 1164 for the minimum sample
size (95.02% confidence), while the "rule of 3" suggests
a sample size of 3/0.002 = 1500 -- i.e. you have to sample
the lot! (Might as well go for the full double entry in
that case).

The "rule of 3" is of course targeted at a 95% confidence
in the result. What if you want to be more confident (e.g.
99%)?

For Simon's example of p = 0.02 (2%), by exact calculation
the probability of detection first exceeds 0.99 (0.9901438)
when the sample size is 212. Since 0.02*212 = 4.24, perhaps
one could suggest a "rule of 4.25" for 99% coinfidence?

Trying the "rule of 4.25 = 17/4" out on other error proportions:

p = 0.002 (0.2% = 3): Exact SS = 1177, 17/4*p = 2125
p = 0.004 (0.4% = 6): Exact SS = 803, 17/4*p = 1063
p = 0.01 (1% = 15): Exact SS = 395, 16/4*p = 425

so the "rule of 4.25" is a good rule of thumb for 1% error
rate or higher, but is not good,or even useless, at lower
rates.

Clearly it depends very much on

a) What is the largest error rate you are prepared to accept
if you do not detect it?
b) How confident do you want to be of detecting an error
rate at least this big?

If you are being stringent, and the answer to (a) is "very small,
e.g. less than 1 in 200", and to (b) "at least 99$ confident",
then I think you will need to go down the exact route, or simply
examine the lot (to be frank).

Best wishes to all,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 20-Jun-06 Time: 18:16:08
------------------------------ XFMail ------------------------------

RAMS

unread,

Jun 20, 2006, 10:46:25 PM6/20/06

to MedStats

Simon, Steve, PhD wrote:
> ramsa...@gmail.com writes:
>

> So for example, if you are willing to tolerate up to a 2% error rate,
> then sample 3/0.02=150 records. If you notice one or more errors, then
> you may have an unacceptably high error rate and you need to check the
> entire data set.

3 means what?

Ted Harding

unread,

Jun 21, 2006, 3:47:53 AM6/21/06

to MedS...@googlegroups.com

If you look at the website

http://www.childrensmercy.org/stats/size/zeroevents.asp

which Steve Simon cited, you will see that the "3" in 3/p
arise because it is (very nearly) the natural log of 1/0.05,
where 0.05 is the probability with which you will get a negative
result (no errors in the sample) when you should get a positive
one (proportion p of errors in the population baring sampled).

ln(1/0.05) = 2.9957

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861

Date: 21-Jun-06 Time: 08:47:48
------------------------------ XFMail ------------------------------

Ted Harding

unread,

Jun 21, 2006, 4:39:02 AM6/21/06

to MedS...@googlegroups.com

On 21-Jun-06 Ted Harding wrote:
> On 21-Jun-06 RAMS wrote:
>> Simon, Steve, PhD wrote:
>>> So for example, if you are willing to tolerate up to a 2%
>>> error rate, then sample 3/0.02=150 records. If you notice
>>> one or more errors, then you may have an unacceptably high
>>> error rate and you need to check the entire data set.
>>
>> 3 means what?
>
> If you look at the website
>
> http://www.childrensmercy.org/stats/size/zeroevents.asp
>
> which Steve Simon cited, you will see that the "3" in 3/p
> arise because it is (very nearly) the natural log of 1/0.05,
> where 0.05 is the probability with which you will get a negative
> result (no errors in the sample) when you should get a positive
> one (proportion p of errors in the population baring sampled).
>
> ln(1/0.05) = 2.9957
>
> Best wishes,
> Ted.

For the sake of clarity (and especially to avoid confusion in
relation to my previous post illustrating exact calculations)
this too needs expanding!

The argument underlying the "rule of 3" is, in full:

Suppose an event may occur with probability p (where p is small)
on each attempt to observe it, suppose you make k attempts

*** where occurrences of the event are independent over attempts

then the probability that no occurrences will be observed in
the k attempts is

[1] (1 - p)^k

Observing zero events in k attempts gives you 95% confidence
that the event rate does not exceed p provided [1] is 0.05, so

[2] k*log(1 - p) = log(0.05) = -2.9957 ~= -3

and, provided p is small, it is adequate to replace log(1-p)
by (-p), from which it follows that

[3] k*p = 3 or k = 3/p

It is, however, essential in RAMS's application to note that
there is a finite set (N = 1500) of records and that the checking
process will involve sampling without replacement.

Hence the assumption at *** above is not exactly true.

This would not matter if the number being sampled was itself
a very small propertion of the whole 1500 population, since
the degree of resulting non-independence would be small. But,
in order to have acceptable confidence (say 95%) that a sample
from a population with a small proportion (p) of errors will
contain at least 1 error, it is necessary to sample quite
a large *number* (k) of the population which, for a population
as small as 1500, is not a small proportion. Hence the
assumption at (***) will not hold even approximately. (It would
be different in the case of say N = 1500000 records).

Hence the necessity to consider the results of exact calculations,
as I did previously.

When you have N records of which M have errors, and you sample
k records without replacement, there are

[ (N-M) ; k ]

different possible samples which give k records without errors
where [ (N-M) ; k ] is used to denote the number of difefrent
ways of choosing k out of (N-M). The number of all possible
different samples of k records out of N without replacement is

[ N ; k ]

and so the probability of getting a sample of k out of N with
no errors is the ratio of these two numbers.

When k is more than a small fraction of N, this result is
substantially different from the (1-p)^k which applies to
sampling with replacement, and so the "rule of 3" which is
based on (1-p)^k will be inaccurate (as illustrated in my
previous response), or even useless.

Hoping this helps!
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861

Date: 21-Jun-06 Time: 09:38:59
------------------------------ XFMail ------------------------------

Doug Altman

unread,

Jun 21, 2006, 5:09:09 AM6/21/06

to MedS...@googlegroups.com

It is worth noting that the often quoted value of 3 is based on the upper limit of a one-sided confidence interval for the observed proportion, i.e. 3/150 = 0.02. The value of 150 is thus arguably too small.

For comparability with the standard two-sided intervals it is necessary to take a one-sided 97.5% confidence interval, for which the upper limit is 3.7/n, and so the required sample size is thus n=3.7/0.02 = 185.

Further, there is a better approximation. In Newcombe & Altman (2000) we wrote

"If a easily memorable formula is wanted we suggest a new ‘rule of four’ in which the upper limit of the 95% confidence interval is 4/(n+4), which is a close approximation to the correct value."

So for the example, if p=4/(n+4), we have n = (4/0.02)-4 = 196.

This formula is a slight simplification of the exact (Clopper-Pearson) method as promoted by Robert Newcombe for obtaining CI for any proportion.

Of course, this approach does not address the issue of finite population. I don't know what the impact of that would be.

Best wishes
Doug

Ref:
Newcombe RN, Altman DG. Proportions and their differences. In: Altman DG, Machin D, Bryant TN, Gardner MJ (eds).
Statistics with confidence. 2nd edn. London: BMJ Books, 2000: 45–56.

_____________________________________________________

Doug Altman
Professor of Statistics in Medicine
Centre for Statistics in Medicine
Wolfson College Annexe
Linton Road
Oxford OX2 6UD

email:  doug....@cancer.org.uk
Tel:    01865 284400 (direct line 01865 284401)
Fax:    01865 284424

Web:     http://www.csm-oxford.org.uk/

Ted Harding

unread,

Jun 21, 2006, 5:40:07 AM6/21/06

to MedS...@googlegroups.com

On 21-Jun-06 Doug Altman wrote:
> It is worth noting that the often quoted value of
> 3 is based on the upper limit of a one-sided
> confidence interval for the observed proportion,
> i.e. 3/150 = 0.02. The value of 150 is thus arguably too small.

In the case of this particular application, surely it
is the one-sided case which is of interest? The check
is being carried out lest there be too many errors in
the database.

> For comparability with the standard two-sided
> intervals it is necessary to take a one-sided
> 97.5% confidence interval, for which the upper
> limit is 3.7/n, and so the required sample size
> is thus n=3.7/0.02 = 185.
>
> Further, there is a better approximation. In
> Newcombe & Altman (2000) we wrote
>
> "If a easily memorable formula is wanted we
> suggest a new ‘rule of four’ in which the upper
> limit of the 95% confidence interval is 4/(n+4),
> which is a close approximation to the correct value."
>
> So for the example, if p=4/(n+4), we have n = (4/0.02)-4 = 196.
>
> This formula is a slight simplification of the
> exact (Clopper-Pearson) method as promoted by
> Robert Newcombe for obtaining CI for any proportion.
>
> Of course, this approach does not address the
> issue of finite population. I don't know what the impact
> of that would be.

Well, when the finite population is small enough that
its size has a perceptible effect on the diference
between the exact probability and the "rule of thumb"
probability, then you need to go into the exact calculation
which is, essentially, based on the hypergeometric
distribution.

There is no particular difficulty, with the right method
of calculation, in obtaining an exact confidence interval
by this approach.

But it does not correspond to a simple and easy rule of thumb.
A rule of thumb is all very well, "If you have any strength
in your thumb", but when that fails you need to resort to a
digital method ...

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861

Date: 21-Jun-06 Time: 10:40:04
------------------------------ XFMail ------------------------------

RAMS

unread,

Jun 22, 2006, 10:46:00 PM6/22/06

to MedStats

none of the procedure did not take my population of size 1500. The
above procedures are common for all population size.

Consider an example if my population size is 100. I have entered the
data for all the 100 subjects. Now i wish to takr some subjects
randomly to test the validity of data entry for that i have to choose a
sample of size 150, 185 and so on from the population of size 100 based
in the above formulas given by various gentlemen. How come it is
possible for this situation?

I need a formula which one consists my population size.

Ted Harding

unread,

Jun 23, 2006, 5:49:03 AM6/23/06

to MedS...@googlegroups.com

RAMS,

The discussion considered more than one aspect of the problem.

One aspect described the rule of thumb approach exemplified
by the "rule of 3" according to which, to achieve a 95% confidence
to detect an error rate of 2 per cent, one may sample a number

3/p = 3/0.02 = 150.

Another aspect pointed out that this rule depends on assuming that
the items are sampled independently of each other, which is not
exactly possible for a finite population of any size, and can
be seriously inaccurate (or useless, leading to sample sizes
greater than the population being sampled) for small populations.

The way to proceed for finite populations, when the inaccuracy
of the "independence" assumption becomes important, is to use
methods which are correct for samples from finite populations.
This depends on making the calculations on the basis of the
hypergeometric probability distribution, and not the binomial
probability distribution on which the "rule of 3" and its
variants are based.

Even when you do it exactly, you may find that the number of
items you need to sample in order to achieve your desired
level of confidence (for example 95%) is so large a fraction
of the population that you may as well look at them all, and
make absolutely sure of it.

You will need software which can calculate the probability of
sampling x=0 items with error, when a number k are sampled
without replacement from a population of size N which includes
M items with error. The level of confidence you can have in
detecting the presence of errors is then the probability that
you get x=1 or more error items in the sample, which is 1 minus
the probability that x=0. The following uses the software
called "R" -- see http://www.r-project.org -- which has the
function 'phyper' (probability function of the hypergeometric
distribution). Other statistical siftware packages will also
surely have a function for this probability distribution.

Here is an example which takes your instance of N = 100 records,
with a hypothetical 2 per cent (M = 2) of items with errors,
and tries different sample sizes k, calculating the probability
of observing x=0 and subtracting this from 1; varying the sample
size k until the result just exceeds 0.95:

> N<-100; M<-2; x<-0;
> k<-20; 1-phyper(x,M,N-M,k)
[1] 0.3616162
> k<-30; 1-phyper(x,M,N-M,k)
[1] 0.5121212
> k<-50; 1-phyper(x,M,N-M,k)
[1] 0.7525253
> k<-70; 1-phyper(x,M,N-M,k)
[1] 0.9121212
> k<-75; 1-phyper(x,M,N-M,k)
[1] 0.939394
> k<-77; 1-phyper(x,M,N-M,k)
[1] 0.9488889
> k<-78; 1-phyper(x,M,N-M,k)
[1] 0.9533333

So, to be 95% sure of detecting the presence of errors when there
are 2 or more errors in the population, you need to sample 78 out
of the 100 items. This is a case where you might as well examine
all 100 records for the sake of being sure about it -- it is only
an extra 22 compared with the 78 you need to examine anyway, for
which you will only get 95% confidence.

And, of course, the "rule of 3" gives 3/0.02 = 150, which is
nonsense.

Now compare this with what happens with N=1500 records (as in your
original query), with again 2% (M=30) items with error:

> N<-1500; M<-30; x<-0;
> k<-50; 1-phyper(x,M,N-M,k)
[1] 0.6419856
> k<-100; 1-phyper(x,M,N-M,k)
[1] 0.87641
> k<-120; 1-phyper(x,M,N-M,k)
[1] 0.920103
> k<-140; 1-phyper(x,M,N-M,k)
[1] 0.9486804
> k<-160; 1-phyper(x,M,N-M,k)
[1] 0.9672542
> k<-150; 1-phyper(x,M,N-M,k)
[1] 0.9589716
> k<-145; 1-phyper(x,M,N-M,k)
[1] 0.954104
> k<-143; 1-phyper(x,M,N-M,k)
[1] 0.9520046
> k<-141; 1-phyper(x,M,N-M,k)
[1] 0.9498125
> k<-142; 1-phyper(x,M,N-M,k)
[1] 0.9509204

So in this case you need to sample 142 out of 1500. And the "rule
of 3" still gives, of course, the same answer as before, namely
150 -- and now this is clearly (a) not nonsense, (b) not a bad
approximation to the exact answer 142.

Now consider a 10-times larger database of 15000 records, still
with 2% (M=300) of items with error. Now we get

> N<-15000; M<-300; x<-0;
> k<-100; 1-phyper(x,M,N-M,k)
[1] 0.8682746
> k<-120; 1-phyper(x,M,N-M,k)
[1] 0.9123226
> k<-140; 1-phyper(x,M,N-M,k)
[1] 0.9416736
> k<-160; 1-phyper(x,M,N-M,k)
[1] 0.9612205
> k<-150; 1-phyper(x,M,N-M,k)
[1] 0.9524376
> k<-149; 1-phyper(x,M,N-M,k)
[1] 0.951457
> k<-148; 1-phyper(x,M,N-M,k)
[1] 0.9504563
> k<-147; 1-phyper(x,M,N-M,k)
[1] 0.949435

so the smallest value of k which just gives 95% confidence is
now 148, and this is close to the "rule of 3" value of 150.

And so on!

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.H...@nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861

Date: 23-Jun-06 Time: 10:48:53
------------------------------ XFMail ------------------------------

Reply all

Reply to author

Forward