[R] Separation issue in binary response models - glm, brglm, logistf

76 views

Skip to first unread message

Xochitl CORMON

unread,

Feb 27, 2013, 12:04:42 PM2/27/13

to r-h...@r-project.org, Xochitl CORMON

Dear all,

I am encountering some issues with my data and need some help.
I am trying to run glm analysis with a presence/absence variable as
response variable and several explanatory variable (time, location,
presence/absence data, abundance data).

First I tried to use the glm() function, however I was having 2 warnings
concerning glm.fit () :
# 1: glm.fit: algorithm did not converge
# 2: glm.fit: fitted probabilities numerically 0 or 1 occurred
After some investigation I found out that the problem was most probably
quasi complete separation and therefor decide to use brglm and/or logistf.

* logistf : analysis does not run
When running logistf() I get a error message saying :
# error in chol.default(x) :
# leading minor 39 is not positive definite
I looked into logistf package manual, on Internet, in the theoretical
and technical paper of Heinze and Ploner and cannot find where this
function is used and if the error can be fixed by some settings.

* brglm : analysis run
However I get a warning message saying :
# In fit.proc(x = X, y = Y, weights = weights, start = start, etastart #
= etastart, :
# Iteration limit reached
Like before i cannot find where and why this function is used while
running the package and if it can be fixed by adjusting some settings.

In a more general way, I was wondering what are the fundamental
differences of these packages.

I hope this make sense enough and I am sorry if this is kind of
statistical evidence that I'm not aware of.

It is my first time asking a question so I apologize if it's not like it
should be and kindly ask you to not hesitate to let me know about it.

Thank you for your help

Xochitl C.

-----------------------------------------------------------------------

Here an extract of my table and the different formula I run :

> head (CPUE_table)
Year Quarter Subarea Latitude Longitude Presence.S CPUE.S Presence.H
CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW Presence.C CPUE.C
Presence.P CPUE.P Presence.W CPUE.W
1 2000 1 31F1 51.25 1.5 0 0 0
0 0 0 0 0 1 76.002
0 0 1 3358.667
2 2000 1 31F2 51.25 2.5 0 0 0
0 0 0 0 0 1 12.500
0 0 1 3028.500
3 2000 1 32F1 51.75 1.5 0 0 0
0 0 0 0 0 1 5.500
0 0 1 2256.750
4 2000 1 32F2 51.75 2.5 0 0 0
0 0 0 0 0 1 10.000
0 0 1 808.000
5 2000 1 32F3 51.75 3.5 0 0 0
0 0 0 0 0 1 19.000
0 0 1 277.000
6 2000 1 33F1 52.25 1.5 0 0 0
0 0 0 0 0 0 0.000
0 0 1 2.000

> tail (CPUE_table)
Year Quarter Subarea Latitude Longitude Presence.S CPUE.S
Presence.H CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW Presence.C
CPUE.C Presence.P CPUE.P Presence.W CPUE.W
4435 2012 3 50F3 60.75 3.5 1 103.000
1 110.000 1 2379.00 1 20.000 1 6.000
0 0 1 22.000
4436 2012 3 51E8 61.25 -1.5 1 1311.600
1 12.000 1 4194.78 0 0.000 1 18.000
0 0 0 0.000
4437 2012 3 51E9 61.25 -0.5 1 34.336
1 46.671 1 11031.56 1 2.668 1 3.335
0 0 1 3.333
4438 2012 3 51F0 61.25 0.5 1 430.500
1 148.000 1 1212.22 1 3279.200 1 2.000
0 0 1 2.000
4439 2012 3 51F1 61.25 1.5 1 115.000
1 85.000 1 2089.50 1 1.000 1 22.000
1 2 1 40.000
4440 2012 3 51F2 61.25 2.5 1 72.500
1 35.500 1 270.48 1 516.300 1 11.500
1 1 1 16.000

logistf_binomPres <- logistf (Presence.S ~ (Presence.BW + Presence.W +
Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H +
CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude +
Longitude)^2, data = CPUE_table)

Brglm_binomPres <- brglm (Presence.S ~ (Presence.BW + Presence.W +
Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H +
CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude +
Longitude)^2, family = binomial, data = CPUE_table)

-----------------------------------------------------------------------

--

<>< <>< <>< <><

Xochitl CORMON
+33 (0)3 21 99 56 84

Doctorante en sciences halieutiques
PhD student in fishery sciences

<>< <>< <>< <><

IFREMER
Centre Manche Mer du Nord
150 quai Gambetta
62200 Boulogne-sur-Mer

<>< <>< <>< <><

______________________________________________
R-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Ben Bolker

unread,

Feb 28, 2013, 11:22:17 AM2/28/13

to r-h...@stat.math.ethz.ch

Xochitl CORMON <Xochitl.Cormon <at> ifremer.fr> writes:

> Dear all,
>
> I am encountering some issues with my data and need some help.
> I am trying to run glm analysis with a presence/absence variable as
> response variable and several explanatory variable (time, location,
> presence/absence data, abundance data).
>
> First I tried to use the glm() function, however I was having 2 warnings
> concerning glm.fit () :
> # 1: glm.fit: algorithm did not converge
> # 2: glm.fit: fitted probabilities numerically 0 or 1 occurred
> After some investigation I found out that the problem was most probably
> quasi complete separation and therefor decide to use brglm and/or logistf.
>
> * logistf : analysis does not run
> When running logistf() I get a error message saying :
> # error in chol.default(x) :
> # leading minor 39 is not positive definite
> I looked into logistf package manual, on Internet, in the theoretical
> and technical paper of Heinze and Ploner and cannot find where this
> function is used and if the error can be fixed by some settings.

chol.default is a function for Cholesky decomposition, which is
going to be embedded fairly deeply in the code ...

> * brglm : analysis run
> However I get a warning message saying :
> # In fit.proc(x = X, y = Y, weights = weights, start = start, etastart #
> = etastart, :
> # Iteration limit reached
> Like before i cannot find where and why this function is used while
> running the package and if it can be fixed by adjusting some settings.
>
> In a more general way, I was wondering what are the fundamental
> differences of these packages.

You might also take a crack with bayesglm() in the arm package,
which should (?) be able to overcome the separation problem by
specifying a not-completely-uninformative prior.

> I hope this make sense enough and I am sorry if this is kind of
> statistical evidence that I'm not aware of.
>

> -----------------------------------------------------------------------
>
> Here an extract of my table and the different formula I run :
>
> > head (CPUE_table)
> Year Quarter Subarea Latitude Longitude Presence.S CPUE.S Presence.H
> CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW Presence.C CPUE.C
> Presence.P CPUE.P Presence.W CPUE.W
> 1 2000 1 31F1 51.25 1.5 0 0 0
> 0 0 0 0 0 1 76.002
> 0 0 1 3358.667

[snip]

> logistf_binomPres <- logistf (Presence.S ~ (Presence.BW + Presence.W +
> Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H +
> CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude +
> Longitude)^2, data = CPUE_table)
>
> Brglm_binomPres <- brglm (Presence.S ~ (Presence.BW + Presence.W +
> Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H +
> CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude +
> Longitude)^2, family = binomial, data = CPUE_table)

It's not much to go on, but:

* are you overfitting your data? That is, do you have at least 20 times
as many 1's or 0's (whichever is rarer) as the number of parameters you
are trying to estimated?
* have you examined your data graphically and looked for any strong
outliers that might be throwing off the fit?
* do you have some strongly correlated/multicollinear predictors?
* for what it's worth it looks like a variety of your variables might
be dummy variables, which you can often express more compactly by
using a factor variable and letting R construct the design matrix
(i.e. generating the dummy variables on the fly), although that shouldn't
change your results

Xochitl CORMON

unread,

Feb 28, 2013, 11:56:59 AM2/28/13

to r-h...@r-project.org

Le 28/02/2013 17:22, Ben Bolker a écrit :

Thank you for your help !

> Xochitl CORMON<Xochitl.Cormon<at> ifremer.fr> writes:
>
>> Dear all,
>>
>> I am encountering some issues with my data and need some help. I am
>> trying to run glm analysis with a presence/absence variable as
>> response variable and several explanatory variable (time,
>> location, presence/absence data, abundance data).
>>
>> First I tried to use the glm() function, however I was having 2
>> warnings concerning glm.fit () : # 1: glm.fit: algorithm did not
>> converge # 2: glm.fit: fitted probabilities numerically 0 or 1
>> occurred After some investigation I found out that the problem was
>> most probably quasi complete separation and therefor decide to use
>> brglm and/or
logistf.
>>
>> * logistf : analysis does not run When running logistf() I get a
>> error message saying : # error in chol.default(x) : # leading minor
>> 39 is not positive definite I looked into logistf package manual,
>> on Internet, in the theoretical and technical paper of Heinze and
>> Ploner and cannot find where this function is used and if the error
>> can be fixed by some settings.
>
> chol.default is a function for Cholesky decomposition, which is going
> to be embedded fairly deeply in the code ...

If I understand good I should just not use this package as this error is
not easily fixable ?

>

>> * brglm : analysis run However I get a warning message saying : #
>> In fit.proc(x = X, y = Y, weights = weights, start = start,
>> etastart # = etastart, : # Iteration limit reached Like before i
>> cannot find where and why this function is used while running the
>> package and if it can be fixed by adjusting some settings.
>>
>> In a more general way, I was wondering what are the fundamental
>> differences of these packages.
>
> You might also take a crack with bayesglm() in the arm package, which
> should (?) be able to overcome the separation problem by specifying a
> not-completely-uninformative prior.

Thank you for the tip I will have a look into this package and its doc
tomorrow. Do you have any idea of what is this fit.proc function ?

>

>> I hope this make sense enough and I am sorry if this is kind of
>> statistical evidence that I'm not aware of.
>>
>> -----------------------------------------------------------------------
>>
>>
>>
Here an extract of my table and the different formula I run :
>>
>>> head (CPUE_table)
>> Year Quarter Subarea Latitude Longitude Presence.S CPUE.S
>> Presence.H CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW
>> Presence.C CPUE.C Presence.P CPUE.P Presence.W CPUE.W 1 2000 1 31F1
>> 51.25 1.5 0 0 0 0 0 0 0 0 1 76.002 0 0 1 3358.667
>
> [snip]
>
>> logistf_binomPres<- logistf (Presence.S ~ (Presence.BW + Presence.W
>> + Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW +
>> CPUE.H + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter +
>> Latitude + Longitude)^2, data = CPUE_table)
>>
>> Brglm_binomPres<- brglm (Presence.S ~ (Presence.BW + Presence.W +
>> Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H
>> + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude +
>> Longitude)^2, family = binomial, data = CPUE_table)
>
> It's not much to go on, but:

Yeah sorry my table header appeared really bad on the email :s

>

> * are you overfitting your data? That is, do you have at least 20
> times as many 1's or 0's (whichever is rarer) as the number of
> parameters you are trying to estimated?

I have 16 explanatory variable and with interactions we go to 136
parameters.

> length (which((CPUE_table)[,]== 0))
[1] 33466

> length (which((CPUE_table)[,]== 1))
[1] 17552

I assume the over fitting is good, isn't it?

> * have you examined your data graphically and looked for any strong
> outliers that might be throwing off the fit?

I did check my data graphically in a lot and different ways. However if
you have any particular suggestions, please let me know. Concerning
strong outliers, I do not really understand what you mean. I have
outliers here and there but how can I know that they are strong enough
to throw off the fit? Most of the time they are really high abundance
coming from the fact that I'm using survey data and probably related to
the fact that the boat fished over a fish school.

> * do you have some strongly correlated/multicollinear predictors?

It's survey data so they indeed are correlated in time and space.
However I checked the scatterplot matrix and I didn't notice any linear
relation between variable.

> * for what it's worth it looks like a variety of your variables
> might be dummy variables, which you can often express more compactly
> by using a factor variable and letting R construct the design matrix
> (i.e. generating the dummy variables on the fly), although that
> shouldn't change your results

I will check about dummy variable concept as to be honest I don't really
understand what it means...

Thank you again for your time and help

Ben Bolker

unread,

Feb 28, 2013, 12:56:20 PM2/28/13

to r-h...@stat.math.ethz.ch

Xochitl CORMON <Xochitl.Cormon <at> ifremer.fr> writes:

> Le 28/02/2013 17:22, Ben Bolker a écrit :
> > Xochitl CORMON<Xochitl.Cormon<at> ifremer.fr> writes:

> >> I am encountering some issues with my data and need some help. I am
> >> trying to run glm analysis with a presence/absence variable as
> >> response variable and several explanatory variable (time,
> >> location, presence/absence data, abundance data).

[snip]

> >> * logistf : analysis does not run When running logistf() I get a
> >> error message saying : # error in chol.default(x) : # leading minor
> >> 39 is not positive definite I looked into logistf package manual,
> >> on Internet, in the theoretical and technical paper of Heinze and
> >> Ploner and cannot find where this function is used and if the error
> >> can be fixed by some settings.
> >
> > chol.default is a function for Cholesky decomposition, which is going
> > to be embedded fairly deeply in the code ...
>
> If I understand good I should just not use this package as this error is
> not easily fixable ?

Yes.

> >> * brglm : analysis run However I get a warning message saying : #
> >> In fit.proc(x = X, y = Y, weights = weights, start = start,
> >> etastart # = etastart, : # Iteration limit reached Like before i
> >> cannot find where and why this function is used while running the
> >> package and if it can be fixed by adjusting some settings.
> >>
> >> In a more general way, I was wondering what are the fundamental
> >> differences of these packages.
> >
> > You might also take a crack with bayesglm() in the arm package, which
> > should (?) be able to overcome the separation problem by specifying a
> > not-completely-uninformative prior.
>
> Thank you for the tip I will have a look into this package and its doc
> tomorrow. Do you have any idea of what is this fit.proc function ?

It is again deep inside brglm. You can use debug() to try to
follow the process, but it will probably not help much. **However**
you should definitely see ?brglm and especially ?brglm.control ...
adding

control.brglm=brglm.control(br.maxit=1000)

to your function call might help (the default is 100)

> Here an extract of my table and the different formula I run :
> >>
> >>> head (CPUE_table)
> >> Year Quarter Subarea Latitude Longitude Presence.S CPUE.S
> >> Presence.H CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW
> >> Presence.C CPUE.C Presence.P CPUE.P Presence.W CPUE.W 1 2000 1 31F1
> >> 51.25 1.5 0 0 0 0 0 0 0 0 1 76.002 0 0 1 3358.667
> >
> > [snip]
> >
> >> logistf_binomPres<- logistf (Presence.S ~ (Presence.BW + Presence.W
> >> + Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW +
> >> CPUE.H + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter +
> >> Latitude + Longitude)^2, data = CPUE_table)
> >>
> >> Brglm_binomPres<- brglm (Presence.S ~ (Presence.BW + Presence.W +
> >> Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H
> >> + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude +
> >> Longitude)^2, family = binomial, data = CPUE_table)
> >
> > It's not much to go on, but:
>

> > * are you overfitting your data? That is, do you have at least 20
> > times as many 1's or 0's (whichever is rarer) as the number of
> > parameters you are trying to estimated?
>
> I have 16 explanatory variable and with interactions we go to 136
> parameters.
>
> > length (which((CPUE_table)[,]== 0))
> [1] 33466
>
> > length (which((CPUE_table)[,]== 1))
> [1] 17552
>
> I assume the over fitting is good, isn't it?

No, overfitting is *bad*. But you do have a large data set, so you
might get away with fitting a model that's this complex. I would
certainly start by seeing if you can successfully fit your model with
main effects only (i.e. temporarily get rid of the ^2)

I don't think that your statements above really count the
zeros and ones in the _response_ variable -- I think you need

table(CPUE_table$Presence.S)

> > * have you examined your data graphically and looked for any strong
> > outliers that might be throwing off the fit?
>
> I did check my data graphically in a lot and different ways. However if
> you have any particular suggestions, please let me know. Concerning
> strong outliers, I do not really understand what you mean. I have
> outliers here and there but how can I know that they are strong enough
> to throw off the fit? Most of the time they are really high abundance
> coming from the fact that I'm using survey data and probably related to
> the fact that the boat fished over a fish school.
>
> > * do you have some strongly correlated/multicollinear predictors?
>
> It's survey data so they indeed are correlated in time and space.
> However I checked the scatterplot matrix and I didn't notice any linear
> relation between variable.
>
> > * for what it's worth it looks like a variety of your variables
> > might be dummy variables, which you can often express more compactly
> > by using a factor variable and letting R construct the design matrix
> > (i.e. generating the dummy variables on the fly), although that
> > shouldn't change your results
>
> I will check about dummy variable concept as to be honest I don't really
> understand what it means...
>

What it means is that if you have a categorical factor with
several levels (e.g. A, B, C) the way to fit a linear model
is to construct 'dummy variables' or 'contrasts' so that instead
of

var = A B C C B B A

you have three variables:

intercept B C
1 0 0
1 1 0
1 0 1
1 0 1
1 1 0
1 1 0
1 0 0

Reply all

Reply to author

Forward

0 new messages