Differences between respondents surveyed in two phases.

Gouri Shankar Mishra

unread,

Jun 16, 2016, 4:59:57 PM6/16/16

to Davis R Users' Group

We conducted a survey twice - once in Spring and then again in Fall. I am combining the two data sets to undertake a rather straight forward parametric analysis: E(y) = Xb

where y is the outcome variable (happens to be binary) and X is a vector of various confounders. I also included an indicator s which takes on value of 1 if the data is from Spring, 0 otherwise.

The Y-Standardized coefficient of s is large and statistically "very" significant. For example the coefficient is ~1.5 and test statistic is almost 30 or so.

The above seems to be an indication that the samples for the two surveys are substantially different from each other - but I am not sure how I should go about identifying the differences. Looking at the means / SDs of the X covariates, I do not see much difference.

Further, how should I adjust my parametric analysis to accommodate these differences. Can you share any best practices?

Thanks

Gouri

--

Best Regards,
- GSM

Institute of Transportation Studies
University of California, Davis

Email | Research Website

Michael Koontz

unread,

Jun 17, 2016, 7:03:29 PM6/17/16

to davi...@googlegroups.com

Hi Gouri,

My first instinct for analyzing binary data like you describe would be to use a logistic regression (which you can do with the glm() function in R with the family= argument set to “binomial”). This lets you ask how your covariates (including whether each survey was conducted in the spring or fall) affect the probability of getting a response value of 1. Is this what you are doing? Can you describe your data a little bit further? Or provide a subset of it?

The interpretation of your coefficient on your ’s’ variable would be that a survey in the spring increases the log-odds of your response variable being true by 1.5. If I’m understanding your analysis correctly, that is the difference you are seeking to identify. 1.5 is the effect size; it is on the log-odds, aka logit, scale. (Log-odds is calculated as log(p / 1-p), where p is the probability of your response being true.) I attached a basic logistic regression workflow script, in case that’s useful.

Analysis of binary response data can feel non-intuitive, so for me the best practices have been:

1) to remember that the estimates for the regression coefficients are on a transformed scale (the log-odds scale)

2) that we are ultimately asking how different covariates affect the probability that the response is true (rather than asking about the mean of the response variable, like we do in linear regression)

Hope this helps, but feel free to reply back and flesh out some more details of your approach!

Mike

logistic-regression-analysis.R

Matt Espe

unread,

Jun 27, 2016, 1:39:27 PM6/27/16

to Davis R Users' Group

Hi Gouri,

I would add one note to Mike's excellent advice: If you surveyed the same individuals at the two different time points, those two data points are not independent conditional on time of survey and you might need to add a within-individual adjustment to account for this (i.e., a random effect for individual). You would use glmer() instead of glm() to accomplish this.

Matt

Gouri Shankar Mishra

unread,

Jun 27, 2016, 1:52:05 PM6/27/16

to davis-rug

Thanks Mike and Matt.

Actually, the survey respondents were not the same in both surveys - the only similarities were the zip codes and most of the survey questions.

I found the following note in Kenneth Train's 2003 book on discrete choice modeling (Chapter 2 of here) which seems to be exactly the situation I am facing.

A similar issue of interpretation arises when the same model is estimated on different data sets. The relative scale of the estimates from the two data sets reflects the relative variance of unobserved factors in the data sets. Suppose mode choice models were estimated in Chicago and Boston. For Chicago, the estimated cost coefficient is −0.55 and the estimated coefficient of time is −1.78. For Boston, the estimates are −0.81 and −2.69. The ratio of the cost coefficient to the time co- efficient is very similar in the two cities: 0.309 in Chicago and 0.301 in Boston. However, the scale of the coefficients is about fifty percent higher for Boston than for Chicago. This scale difference means that the unobserved portion of utility has less variance in Boston than in Chicago: since the coefficients are divided by the standard deviation of the unobserved portion of utility, lower coefficients mean higher standard deviation and hence variance. The models are revealing that factors other than time and cost have less effect on people in Boston than in Chicago. Stated more intuitively, time and cost have more importance, relative to unobserved factors, in Boston than in Chicago, which is consistent with the larger scale of the coefficients for Boston.

Have you faced a situation like this before?

On Mon, Jun 27, 2016 at 10:39 AM, Matt Espe <lck...@gmail.com> wrote:

I would add one note to Mike's excellent advice: If you surveyed the same individuals at the two different time points, those two data points are not independent conditional on time of survey and you might need to add a within-individual adjustment to account for this (i.e., a random effect for individual). You would use glmer() instead of glm() to accomplish this.

Reply all

Reply to author

Forward