Message from discussion
Stratified sampling with 'proc surveyselect'
Path: archiver1.google.com!news2.google.com!news.maxwell.syr.edu!news.he.net!newsfeed1.easynews.com!easynews.com!easynews!feed.news.qwest.net!emac1.ocs.lsu.edu!nntp.msstate.edu!finch!cronkite!news.uga.edu!not-for-mail
From: cassell.da...@EPAMAIL.EPA.GOV (David L. Cassell)
Newsgroups: comp.soft-sys.sas
Subject: Re: Stratified sampling with 'proc surveyselect'
Message-ID: <OF06D056F0.4CC57324-ON88256DC7.0073F602@epamail.epa.gov>
Date: 22 Oct 03 21:20:37 GMT
Sender: sasl...@LISTSERV.UGA.EDU
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Reply-To: cassell.da...@EPAMAIL.EPA.GOV
Lines: 70
Will Dwinnell <predi...@BELLATLANTIC.NET> replied:
> What I really need to do is draw a sample from a given population, so
> that distributions of particular variables are "similar" to those of
> the source population. I figured that numeric variables could be
> stratified, say, at deciles, and run through proc surveyselect.
Okay, that's a lot clearer. First, let me say that *any* probability
sample without weights is going to look 'similar' to your population.
And any sample with weights will still yield results that look 'similar'
to your population parameters with proper analysis. So you may not
need to do this sort of 'stratifying'.
> I had in mind something like this:
>
> proc surveyselect data=LMN method=sys seed=1077 rate=0.1
> out=LMNExtract;
> strata
> VarADecile
> VarBDecile;
> run;
>
> ...where 'VarADecile' is the decile of numeric variable A, whose
> distribution we wish to match (from sample to population), and so
> forth.
I still maintain that you could probably avoid the round-off problems
you're seeing by using N= instead of RATE= here. But I also think
you're going about this the wrong way. Instead of trying this complex
stratification-by-quantiles approach, why don't you try using more of
the
features of PROC SURVEYSELECT?
Suppose instead that we do a serpentine sort by your VarA and Varb. For
two variables, that is equivalent to sorting in ascending order for the
first variable and in descending order for the second. For more
variables,
you just switch back and forth between ascending and descending order.
Now that the data are all sorted by your characteristics, you can select
sequentially, using something like Chromy's sequential sampling
approach,
which is providing an implicit stratification, much as you seem to be
after. (Okay, I'm using my crystal ball here a bit. :-)
Now, do we need to do some complex sorting first? No.
Do we need to build deciles, or any manner of quantiles? No.
Do we need to do a lot of fiddling in order to get the PROC to generate
what we're after? No.
Here's how we do it, without any sorting or quantile-building first:
proc surveyselect data=LMN method=seq seed=1077 rate=0.1 out=LMNExtract
STATS;
control VarA VarB;
run;
I added the STATS option in so you'll have the sampling weights in your
output data set.
I usually like to choose a random seed between 1 and (2**31)-1 , but
if you leave that part out, the proc will generate its own random seed
and tell you what it used.
HTH,
David
--
David Cassell, CSC
Cassell.Da...@epa.gov
Senior computing specialist
mathematical statistician