Logistic regression in R

374 views
Skip to first unread message

Ben

unread,
Mar 17, 2007, 12:39:07 AM3/17/07
to CorpLing with R
Dear all,

here's a probably pretty basic question about logistic regression in
R. I've read Dalgaard (2002: ch. 11) on the topic, but I find this
chapter not too informative ... and 'help(glm)' I find confusing ;-)

Anyways, what I would like to do is fit a logistic regression model
with one binary dependent ("TOKEN") and three independents
("PORANI,PORLTH,FINALSIB"); FINALSIB is binary as well, PORANI is 4-
way categorical.

Thus, I imported an excel spreadship into a table, which looks as
follows:

> genitives
TOKEN PORANI PORLTH FINALSIB
1 0 4 3 1
2 1 3 1 0
3 0 3 3 1
4 0 4 4 0
5 1 3 1 0
.
.
.
999 0 4 2 0

So, every line represents one observation -- and I would like to know
how the independents predict group membership of TOKEN (0 or 1).

This is what I've tried:

> model = glm(TOKEN~PORANI+PORLTH+FINALSIB,family=binomial("logit"))

This yields the following output (inter alia):

[snip]
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.59119 0.31358 14.641 < 2e-16 ***
PORANI -0.82485 0.07386 -11.168 < 2e-16 ***
PORLTH -0.95424 0.09277 -10.286 < 2e-16 ***
FINALSIB -1.16648 0.21606 -5.399 6.71e-08 ***
[/snip]

So far so good. Here are my questions though:

- where and how can I specify that PORANI and FINALSIB are categorical/
binary variables? The above estimates seem to assume they are scalar
variables.
- how can I get R to report odds ratios, instead of merely
unstandardized coefficients?
- is there a way, by any chance, to obtain a Pseudo-R sq. statistic
(approximately how much of the overall variance in the dependent is
accounted for by the independents in the model) ... ?

thanks a million
-Ben

Stefan Th. Gries

unread,
Mar 17, 2007, 2:32:55 AM3/17/07
to CorpLing with R
Binary logistic regression in R, that's not one of the easiest things
to do especially when coming from a different software such as, now
what would be an unlikely candidate, maybe SPSS ... (And I don't have
the faintest idea about multinomial logistic regression). I had to
struggle with binary log regr quite a lot myself and my understanding
of all of this is still somewhat sketchy, but here's what I came up
with.

> - where and how can I specify that PORANI and FINALSIB are categorical/binary variables? The above estimates seem to assume they are scalar variables.
That is correct. The problem arises because when you load the data you
used read.table's default settings. The default setting for read.table
is that it categorizes all columns that have only numbers in it as
vectors and all columns that have character strings in it as factors.
The functions lm and glm (as well as many others), however, is that R
assumes vectors are numerical/scalar variables whereas factors
represent categorical variables. This is what you can do:

- recode the categorical variables using character strings: instead of
0 and 1 write "bilabial" and "not_bilabial" or whatever 0 and 1 stand
for;
- change the way you read in the data: the function read.table can
take an argument colClasses, which specifies the data structure you'd
like each column to represent. Thus:

data<-read.table(file.choose(), header=T, sep="\t",
colClasses=c("double", "factor", "factor", "factor"))

> - how can I get R to report odds ratios, instead of merely unstandardized coefficients?
> - is there a way, by any chance, to obtain a Pseudo-R sq. statistic (approximately how much of the overall variance in the dependent is accounted for by the independents in the model) ... ?


To explain this, let me walk you through a quick example, which
also explains a few things you might already know: Let assume you have
a data set like this


set.seed(1) # set the random number generator to ensure replicability
across users
iv.int.1<-runif(100, min=1, max=4) # an ratio-scaled independent
variable
iv.int.2<-runif(100, min=1, max=5) # an ratio-scaled independent
variable
iv.cat.1<-factor(sample(letters[1:2], 100, replace=T)) # a categorical
independent variable, note the factor!
iv.cat.2<-factor(sample(letters[3:5], 100, replace=T)) # a categorical
independent variable, note the factor!
dv<-sample(0:1, 100, replace=T) # a binary dependent variable
data<-data.frame(iv.int.1, iv.int.2, iv.cat.1, iv.cat.2, dv) # a data
frame with all the stuff

# write.table(data, file=file.choose(), sep="\t", row.names=F) #
uncomment if you want to save this into a file


It is then best to (i) load a few packages that are useful and (ii)
configure R's way to compute differences between means and sums of
squares:


library(car) # for Anova
library(Design) # for lrm
options(contrasts=c("contr.sum", "contr.poly")) # for type III sums of
squares and the right contrasts


There are now two ways of going about this, which are pretty
similar though.


# This line provides you with an answer regarding the overall R-
squared value because it provides a variety of summary statistics: the
chi-square (as L.R.), the p-value (as P), the R-square (as R2) etc.
You will find these values when you do the same thing in SPSS (but cf
below!)
(model.lrm<-lrm(dv ~ iv.int.1+iv.int.2+iv.cat.1+iv.cat.2, tol=1E-300))

# And now this line, I guess, is the answer to your question about
odds ratios: it gives you the strengths of the effects, same as the
rightmost column in SPSS (but cf below!)
(effect.sizes<-exp(1)^model.lrm$coefficients)

# These lines give you nearly the same, but sometimes in a slightly
different format. The residual deviance is what in SPSS is log-
likelihood. This output does not provide an R-squared value as output
but uses AIC as an indicator of model quality.
(model.glm<-glm(dv ~ iv.int.1+iv.int.2+iv.cat.1+iv.cat.2,
family=binomial))
summary(model.glm)

# And this gives you coefficients for each variable, as in SPSS (but
cf below!)
Anova(model.glm, type="III", test.statistic="Wald")


Now, what does this "SPSS (but cf below!)" mean all the time? It
means that there are several different ways of setting the contrasts.
The results of SPSS and R match up perfectly if

- in R, you set the contrast as above;
- in SPSS, you set the contrasts for the categorical covariates to
Deviation (), reference category: last.

Then you'll get exactly the same results. The additional prediction
statistics that SPSS generates can be obtained by using

predictions<-predict.lrm(model.lrm)

and then proceed with cross-tabulation.
(All of this above is true for R for Windows 2.4.1, svn rev 40228,
of 12/18/2006.) This is certainly not all one can say about this
topic, but that's it off the top of my head this morning. Excellent
books on this issue and related ones are the following. Faraway (2006)
is a whole book on stuff like this, and the books by Crawley are about
the best I have ever come across in terms of general statistics books
(they deal with log regr., too):

Crawley, Michael J. 2002. /Statistical computing: an introduction to
data analysis using S-Plus/. Chichester: John Wiley and Sons.
Crawley, Michael J. 2005. /Statistics: An introduction using R/.
Chichester: John Wiley and Sons.
Faraway, Julian J. 2005. /Linear models with R/. Boca Raton et al.:
Chapman and Hall.
Faraway, Julian J. 2006. /Extending the linear model with R:
generalized linear, mixed effects and nonparametric regression
models/. Boca Raton et al.: Chapman and Hall.

HTH,
STG
--
Stefan Th. Gries
-----------------------------------------------
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries
-----------------------------------------------

Ben

unread,
Mar 17, 2007, 5:29:21 AM3/17/07
to CorpLing with R
thank you so much for the elaborate instructions -- much appreciated!

> - recode the categorical variables using character strings: instead of
> 0 and 1 write "bilabial" and "not_bilabial" or whatever 0 and 1 stand
> for;
> - change the way you read in the data: the function read.table can
> take an argument colClasses, which specifies the data structure you'd
> like each column to represent. Thus:
> data<-read.table(file.choose(), header=T, sep="\t",
> colClasses=c("double", "factor", "factor", "factor"))

allright -- easy enough.


>
> > - how can I get R to report odds ratios, instead of merely unstandardized coefficients?
> > - is there a way, by any chance, to obtain a Pseudo-R sq. statistic (approximately how much of the overall variance in the dependent is accounted for by the independents in the model) ... ?
>
> To explain this, let me walk you through a quick example,

[snip]

oh wow, not entirely trivial ... so I guess this what I'm going to do
this weekend ;-)
as soon as I'm done, I might get back to you guys with a meticulous
comparison of R's output vis-à-vis the output of "some different"
statistical package, let's say SPSS ;-)

Oh, as an aside one other thing since you did mention multinomial
logistic regression: is anybody aware of *any* study in the realm of
linguistics where this procedure has been used ...?

best wishes
-Ben

Ben

unread,
Mar 17, 2007, 12:44:16 PM3/17/07
to CorpLing with R
> I might get back to you guys with a meticulous
> comparison of R's output vis-à-vis the output of "some different"
> statistical package, let's say SPSS

well, what am I to say -- the two outputs are a PERFECT match.
Awesome!! and actually pretty easy if you know the tricks, or know
somewone who knows them :-)

-Ben

Stefan Th. Gries

unread,
Mar 17, 2007, 1:28:18 PM3/17/07
to CorpLing with R
Glad I could help! :-)

As to multinomial logistic regression: I know one of my colleagues -
Pat Clancy - here at UCSB used it once, but apart from that I have
never seen any one use it. Unfortunately, I may actually end up having
to do this some time soon to validate results from a cluster analysis
of near synonyms, but that's still a little down the road and from
what I heard about the method is not going to be easy anyway. I have a
hunch of having read that to make things match with SPSS, the
contrasts in SPSS might again have to be changed because the SPSS guys
use a different contrast setting in their own two logistic regression
modules. But that's really all I know about this and I will have to
turn to Faraway (2006) for this myself.

jan.s...@googlemail.com

unread,
Mar 22, 2007, 9:31:09 PM3/22/07
to CorpLing with R
Hello,

I have actually used the multinom function
from the package nnet in R to model the
choice between 3 (or 4) possessive constructions
in Low Saxon and it worked quite
well. I had some additional problems because
I had to use a stratified random sample rather
than a simple random sample.

Jan Strunk
str...@linguistics.rub.de

Stefan Th. Gries

unread,
Mar 24, 2007, 5:45:38 AM3/24/07
to CorpLing with R
Ok, that's good to know, thanks a lot. I had read about that function
in Faraway (2006) and either McDonald/Braun or Everitt/Hothorn, but I
haven't tried it yet so I was not sure how easy that was going to be.
Is the output provided by multinom compatible with SPSS or other
software or would one have to tweak the contrasts etc. to make them be
the same?

DD

unread,
Mar 30, 2007, 11:18:06 AM3/30/07
to CorpLing with R
Hi,

I am the proud owner of a dataset that I could analyze beautifully
using ordinal logistic regression if it weren't for the fact that I
have taken multiple observations within subjects, so the error terms
are not independent across observations. Apparently, random effects
models are used in this situation. Anyone any experience with these or
pointers to relevant literature?

Thanks so much,
Dagmar

Stefan Th. Gries

unread,
Mar 30, 2007, 11:33:28 AM3/30/07
to CorpLing with R
Hi

The library Design allows you to compute ordinal log reg with the
function lrm but I don't know whether it can handle the repeated
measurement thing you mention. As to random effects and repeated
measures, Faraway (2006: Ch. 8 and 9). Also, Ch. 10 in the same book
talks about generalized linear mixed models. Maindonald and Braun
(2003) has a section on ordinal regression models (Section 8.5) and
use polr (in the library MASS), and Chapters 8 and 9 of M&B deal with
a variety of models that may be useful in this connection.
That's it off the top of my head, but I HTH,

kath...@gmail.com

unread,
Apr 1, 2007, 1:45:33 PM4/1/07
to CorpLing with R

I don't have an answer, but I, too, would love help on this. I've
been using the package nlme for mixed effects models, using the
function lme for doing linear regression with random effects. However,
I also would like to be able to do binomial (or more) logistic
regression with random effects. I've been given to understand that
the nlme function(s) will do this, but I haven't figured out how yet.

Kathryn

--
Kathryn Campbell-Kibler
University of Michigan
http://www.umich.edu/~kbck

Stefan Th. Gries

unread,
Apr 1, 2007, 2:01:17 PM4/1/07
to CorpLing with R
Kathryn, according to <http://tolstoy.newcastle.edu.au/R/help/
06/01/18381.html>, the function lmer might be useful, but I haven't
played around with this myself.

Austin F. Frank

unread,
Apr 5, 2007, 8:35:46 PM4/5/07
to CorpLing with R
On Mar 17, 12:39 am, "Ben" <ben...@googlemail.com> wrote:
> Anyways, what I would like to do is fit a logistic regression model
> with one binary dependent ("TOKEN") and three independents
> ("PORANI,PORLTH,FINALSIB"); FINALSIB is binary as well, PORANI is 4-
> way categorical.

> So, every line represents one observation -- and I would like to know


> how the independents predict group membership of TOKEN (0 or 1).

> - where and how can I specify that PORANI and FINALSIB are categorical/


> binary variables? The above estimates seem to assume they are scalar
> variables.

If you have the data in a dataframe called d, you would encode the
independent variables as categorical factors as follows:

d$PORANI <- as.factor(d$PORANI)
d$PORLTH <- as.factor(d$PORLTH)
d$FINALSIB <- as.factor(d$FINALSIB)

HTH,
/au

Benedikt Szmrecsanyi

unread,
Apr 6, 2007, 4:03:30 AM4/6/07
to corplin...@googlegroups.com
Austin F. Frank wrote:
> If you have the data in a dataframe called d, you would encode the
> independent variables as categorical factors as follows:
>
> d$PORANI <- as.factor(d$PORANI)
> d$PORLTH <- as.factor(d$PORLTH)
> d$FINALSIB <- as.factor(d$FINALSIB)
>
thanks -- yet another way to do it ;-)

best wishes
-Ben

jan.s...@googlemail.com

unread,
Apr 20, 2007, 10:27:08 AM4/20/07
to CorpLing with R
Sorry for answering after such a long time...

I have no real experience with SPSS, besides reading
the excellent book Discovering Statistics with SPSS by Andy Field.
So I can't really compare the output of multinom and the SPSS methods.
I used contrast coding (which is the default I think).

Those of you who are interested can look at my R notes
here:

http://www.linguistics.rub.de/~strunk/rnotes_forum.txt

(If you happen to see any errors, please tell me...)

Best,

Jan Strunk
str...@linguistics.rub.de

Reply all
Reply to author
Forward
0 new messages