How does one calculate a sample size when the intraclass correlation coefficient (ICC) is being used to assess agreement between raters.

William Niven

unread,

Jun 11, 2014, 4:42:13 PM6/11/14

to meds...@googlegroups.com

Scenario:

A risk stratification tool called the HEART score has been developed to assess patients presenting to the ED with chest pain

It is a score out of 10 that has been derived from the following 5 variables: history, ECG findings, age, risk factors and troponin level each variable being given a score of 0,1 or 2.

This has been a well validated tool in the ED, but I want to show that the scores are reproducible irrespective of the grade of doctor or assessment nurse calculating the score.

The idea would be that a convenience sample of consecutively eligible patients presenting to the ED would be recruited, consented and then assessed by each of the following grades of doctor and nurse: triage-trained nurse, foundation year or intern doctor, registrar and consultant. Each of these grades of doctor would then calculate the score and enter it into an online spreadsheet. The most correct score would be calculated at the same time by the researchers and also entered as a gold standard.

My question is: How do I calculate the sample size in this context?

Kind Regards

Will Niven

London

ציפי שוחט‎

unread,

Jun 12, 2014, 1:57:54 AM6/12/14

to meds...@googlegroups.com

I would treat this as an ANOVA problem.

PASS sofrware has a module for this problem.

See: http://www.ncss.com/software/pass/correlation-in-pass/#ICC

Tzippy Shochat

--
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules

---
You received this message because you are subscribed to the Google Groups "MedStats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to medstats+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

MarkP

unread,

Jun 13, 2014, 4:09:11 AM6/13/14

to meds...@googlegroups.com

Hi Will,

If you have a number of raters but also a gold-standard, then perhaps you could view this as assessing the accuracy of individual raters. But some more detail about the situation would be helpful.

Is each patient to be assessed by every grade of professional? Will there be a number of different professionals at each grade (e.g. a number of different doctors)?

Also, does the magnitude of misclassification matter (e.g. is a difference from the gold standard score of 2 worse than a difference of 1, or is any misclassification equally bad)?

Best wishes,

Mark

BXC (Bendix Carstensen)

unread,

Jun 13, 2014, 8:45:30 AM6/13/14

to meds...@googlegroups.com

As always, the best way to do sample size calculations is to simulate data repeatedly under a different scenarios (for each scenario say 1000 times), analyse and see what precision you have.

That ensures that you have taken all relevant factors into account and that you know exactly what you intend to do and where your simplifying assumption are.

If you have trouble finding out how to generate data under a given scenario, any other sample size calculation you do will most likely rely on assumptions that you are unaware of.

Indeed it is normally a troublesome process because you will in most cases have to specify exactly which pieces of thin air you drew your assumptions from.

Regards,
Bendix Carstensen
______________________________________________

Bendix Carstensen
Senior Statistician
Clinical Epidemiology
Steno Diabetes Center A/S
Niels Steensens Vej 2-4
DK-2820 Gentofte
Denmark
+45 44 43 87 38 (direct)
+45 30 75 87 38 (mobile)
b...@steno.dk http://BendixCarstensen.com
www.steno.dk

Steve Simon, P.Mean Consulting

unread,

Jun 16, 2014, 2:26:32 PM6/16/14

to meds...@googlegroups.com

There is no formal hypothesis in this setting, so you can't really do a
power calculation. Well, maybe you could but it would be a rather forced
and artificial setting.

What you want here is a confidence interval for the intraclass
correlation coefficient (ICC). And you want that confidence interval to
be reasonably narrow. An ICC with a confidence interval that goes from
0.06 to 0.91 is pretty worthless.

So dig out the formula for the confidence interval for the ICC and find
a sample size that makes your interval reasonably narrow. Make sure that
you plug in a plausible value for the ICC and not zero.

The formula for this confidence interval is very messy, so you will
almost certainly be better off with Bendix Cartensen's suggestion of
using a simulation approach. Set up a dozen or so plausible scenarios
for your research that include both weak and strong measures of
association and also include a range of marginal distributions. Run
these simulations and show that at your proposed sample size, all the
95% confidence intervals under all the scenarios are reasonably narrow.
Then pick one of the scenarios to present to the IRB or anyone else
reviewing your research.

Now the actual research will probably compute some statistics that are
far more sophisticated than a simple confidence interval for single ICC.
You might want to compare one ICC to another ICC. Or you might want to
estimate the source of disagreement if the ICC is too small (e.g., the
nurses rate more harshly than the doctors). But don't worry too much
about this. If you get a nice narrow interval for a simple ICC, then
everything else will probably also have reasonably good precision and/or
power.

Steve Simon, n...@pmean.com, Standard Disclaimer.
Sign up for the Monthly Mean, the newsletter that
dares to call itself average at www.pmean.com/news

Doug Altman

unread,

Jun 17, 2014, 5:04:54 AM6/17/14

to meds...@googlegroups.com

Steve's reply is based on the assumption that the intraclass correlation coefficient is what you need. The observed value of the ICC will depend partly on who is in the sample. A convenience sample often makes sense but it won't necessarily be representative of all patients, even if you can say what is the population that you wish to represent. Also, although the ICC is widely used in such circumstances it is of questionable relevance to the main question which, as I see it, is how much difference might there be between assessments for the same patient. That can be assessed by considering directly the distribution of the within-individual differences between assessors, using limits of agreement. Martin Bland discusses sample size for this approach here:
http://www-users.york.ac.uk/~mb55/meas/sizerep.htm

The distinction between the approaches is in essence between considering differences in scores relative to overall variation in scores or as absolute differences. The same issue arises when comparing different methods of measurement for which ICC can be considered appropriate or inappropriate depending on perspective. See

Bland JM, Altman DG. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Computers in Biology and Medicine 1990;20:337-340.

Some people label the two approaches as reliability and agreement. I strongly favour assessing agreement for several reasons. A key reason is that the results can be seen to be directly relevant to assessing an individual patient.

Best wishes
Doug

Doug Altman
Centre for Statistics in Medicine
University of Oxford
Botnar Research Centre
Windmill Road
Oxford OX3 7LD
Email: doug....@csm.ox.ac.uk

phone: +44 (0)1865 223444
PA (Jacqueline Wright): (jacqueli...@csm.ox.ac.uk) +44 (0)1865 223447
www: http://www.csm-oxford.org.uk/

CONSORT 2010 Statement: www.consort-statement.org
EQUATOR Network - resources for reporting research: www.equator-network.org/
Trials journal: www.trialsjournal.com

Reply all

Reply to author

Forward