Implications of using a portion of a random sample

18 views
Skip to first unread message

colin morse

unread,
Mar 4, 2014, 12:13:10 PM3/4/14
to piface-d...@googlegroups.com
Imagine a study that attempts to estimate the time it takes to receive a disability placard. Suppose that Analyst A has a dataset that represents a complete population of all disability placard applications. However, many records are missing outcome data (i.e., start date, completion date, etc.) 
 
In order to estimate the time to completion, Analyst A elects to draw a random sample of 75 records from the database and pull the original paper documents stored at an off-site archive. He will then manually enter the correct values into a spreadsheet. Analyst A's colleague, Analyst B, points out that some records may still be in process (and therefore will be missing a completion date) and suggests that Analyst A assume an incomplete rate of 25% and draw a sample of 100 records to compensate.
 
Further suppose that Analyst A generates a random sample using statistical software and then travels to the archive and begins pulling records from the list of 100 randomly selected records. To his surprise, the incomplete rate is much lower - closer to 6%. After pulling 80 records, Analyst A has collected data on 75 complete records and 5 incomplete. Because he has reached his original goal of 75 complete records (which he believes is a sufficient sample size to make inferences at his preferred confidence level and margin of error), Analyst B suggests that it is unnecessary to pull the final 20 records in the random sample. Analyst A is concerned, however, that he has drawn a convenience sub-sample from the original random sample because he systematically moved down the list of randomly selected records. Analyst B argues that it doesn't matter because the original sample was drawn randomly.
 
Can anyone articulate more clearly why Analyst B is wrong?
 
Thank You,
 
Colin
 
 
 

Lenth, Russell V

unread,
Mar 4, 2014, 12:26:05 PM3/4/14
to morse...@gmail.com, piface-d...@googlegroups.com

It depends on the order in which the records were drawn. For example, suppose that our sample of record IDs is

 

  569 69202 25250 83021 46353  . . .  21002 32701 28426 78790

 

You are OK stopping early with this one. However, suppose the sample is

 

  302   569  2543  2826  4059  . . .  98634 98966 99487 99886

 

I’ll add that this is the same sample as the earlier set, but they are sorted in increasing order. You are NOT OK stopping early with this sample. Why? Because if you do, you are biasing the sample towards records with low ID numbers. As long as you draw your sample in random order, you are OK.

 

Russ

 

Russell V. Lenth  -  Professor Emeritus

Department of Statistics and Actuarial Science  

The University of Iowa  -  Iowa City, IA 52242  USA  

Voice (319)335-0712 (Dept. office)  -  FAX (319)335-3017

--
You received this message because you are subscribed to the Google Groups "PiFace discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to piface-discuss...@googlegroups.com.
To post to this group, send email to piface-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/piface-discussion.
To view this discussion on the web visit https://groups.google.com/d/msgid/piface-discussion/55dcb6cb-b8ae-44d2-ab8b-285083c77406%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages