Regression analysis in contrastive studies?

64 views
Skip to first unread message

cbechet90

unread,
Sep 30, 2016, 7:51:48 AM9/30/16
to StatForLing with R
Hello everyone,

I would like to know if, conceptually speaking, Multinomial Logistic Regression can be applied to find areas of convergence/divergence among constructions and their cognates in other languages.

To give an example, I would like to determine how strongly in place of, in lieu of, and instead of are related to their French cognates à la place de and au lieu de, but also what differentiates them, this on the basis of several syntactic and semantic features as predictors. I know that LG can be used in studies on synonymy to predict the speaker's choice between near synonyms or variants language-internally, but how about cross-linguistic attested constructions? Does it make sense to fit logistic models with constructions from different languages?

Thanking you in advance for your advice and suggestions,

C. Béchet.

Matías Guzmán Naranjo

unread,
Sep 30, 2016, 7:58:44 AM9/30/16
to statforli...@googlegroups.com
IIRC Bresnan did that for varieties of English.

 Something you could try would be to fit two models, one for English, one for French, and then predict French with the English model and the other way around, and compare accuracy scores across models and to the baseline.

--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-with-r+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

cbechet90

unread,
Sep 30, 2016, 8:45:19 AM9/30/16
to StatForLing with R
IIRC Bresnan did that for varieties of English.

Thanks for pointing it out to me! 

 Something you could try would be to fit two models, one for English, one for French, and then predict French with the English model and the other way around, and compare accuracy scores across models and to the baseline.

I'm not sure that I get it well. Do you mean comparing the two models using anova( ) to see if one of the models differs significantly from the other one?

Matías Guzmán Naranjo

unread,
Sep 30, 2016, 11:15:07 AM9/30/16
to statforli...@googlegroups.com
I mean the following:

you assign each one of the levels in your dependent variable a letter: 'a', 'b', 'c', etc. that makes it comparable with their French cognates (so in lieu of and au lieu de get the same letter). You then split your data into training and testing data, and then train one model on English and one on French, and you calculate their respective accuracies on the corresponding testing data sets. Finally, use the model you trained on English data to predict the French data, and vice versa. For this to work all levels and factors should be comparable. You can then compare how the models behave.

Christophe Bechet

unread,
Sep 30, 2016, 12:06:32 PM9/30/16
to statforli...@googlegroups.com

Okay! Now I see. Thank you very much for the brilliant idea!


To unsubscribe from this group and stop receiving emails from it, send an email to statforling-with-r+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Stefan Th. Gries

unread,
Sep 30, 2016, 1:51:32 PM9/30/16
to StatForLing with R
At the risk of (too much) promoting work I have been involved in ...

MGN> Something you could try would be to fit two models, one for
English, one for French, and then predict French with the English
model and the other way around, and compare accuracy scores across
models and to the baseline.
Yes, something like could work and, with co-authors, I have done some
recent work of a similar kind, namely work that proposes and
exemplifies an approach called MuPDAR, which you could read up on in
2014j, 2014k, 2015k, 2015q, 2016a, 2016b, to appear e on my website.

Sorry for self-advertising ...,
STG
--
Stefan Th. Gries
----------------------------------
Univ. of California, Santa Barbara
http://tinyurl.com/stgries
----------------------------------

Christophe Bechet

unread,
Sep 30, 2016, 2:13:56 PM9/30/16
to statforli...@googlegroups.com

No worries for the self-promotion. Thank you for pointing me to interesting literature ☺


cbechet90

unread,
Mar 15, 2017, 1:57:30 PM3/15/17
to StatForLing with R
Hello there,

Just a few methodological questions which I think are essential here.

I have never used the MuPDAR approach nor Mixed Effects modelsand I wonder if a Mixed Model would be the best option for what I planned to do.

As expected in studies of the kind, there might be subject-specific effects (in my case, endogeneity bias within each French source text if the construction occurs many times in a text, i.e. within text variation).

Since, as a matter of fact, the data for the other language (say English) come from another corpus and so with other texts, does it matter at all if I use the text source as random effect and then try to predict the outcome in the other language?

Stefan Th. Gries

unread,
Mar 15, 2017, 2:14:54 PM3/15/17
to StatForLing with R
What you want to do is probably best done like this:

- do R1 on the reference data, use random effects as required and possible;
- apply R1 results to target data with either
- no random effects (e.g., if your only random effects were
speakers, which are are different from the speakers in the reference
data) or
- with only the random effects you want (e.g., random effects for
words); check the syntax: ?predict.merMod
- check accuracy / compute deviations
- do R2 on accuracy/deviations, use random effects as required and possible;

If you have used neither mixed-effects models nor MuPDAR before, make
sure you know exactly what you're doing - MuPDAR isn't complex or
anything, but if you're trying to learn mixed-effects modeling at the
same time, it adds up ...

HTH,

Christophe Bechet

unread,
Mar 15, 2017, 4:45:12 PM3/15/17
to statforli...@googlegroups.com
Thank you for the advice! I basically wanted to fit a binary logistic model on the data, but I was aware of the individual effects within some independent variables and the likelihood of getting multicollinearity. An alternative would have been to 1) run Multiple Correspondence Analysis and 2) fitting a binary logistic regression on the dimensions of variability. However, this would make the integration of a Time variable rather complex. Mixed models seemed therefore the best option to me. Just need to learn mixed-effects models first (Gries 2013: Chap. 5), then re-read the appropriate literature on the MuPDAR approach and trying to apply the method to cross-linguistic data.

Stefan Th. Gries

unread,
Mar 16, 2017, 1:15:08 AM3/16/17
to StatForLing with R
> Just need to learn mixed-effects models first (Gries 2013: Chap. 5), then re-read the appropriate literature on the MuPDAR approach and trying to apply the method to cross-linguistic data.
Hm, my chapter 5 will not be enough to learn m-e modeling because, for
a variety of reasons, I hardly discuss them there. Better read that
chapter to get a good background understanding of regression modeling,
then maybe my 2015 paper in Corpora and Gelman/Hill (or Fields, Miles,
& Fields), and then you might be good to go.

Just my $0.02,

cbechet90

unread,
Mar 20, 2017, 7:34:47 PM3/20/17
to StatForLing with R
Hello again,

I fitted different mixed-effects models on my French data and it appears that the model always fails to converge. The reason is that some values of several (combinations of) predictors predict the outcome perfectly, i.e. the cross-tabulation of different predictors yields (quasi-)complete separation. So instead of using the predictors as is, the alternative would be to conflate some predictors into different predictive dimensions using a dimensionality-reduction technique (VNC for the variable "Years" and MCA for categorical variables), except for the variables "Text genre" and "Author". For the former, I think I can conflate the levels, but for the latter it is inconceivable.

1) Do you think I can remove the "Author" variable, without losing too much random effect?

2) How about using Random Forests instead of Mixed-Effects models? Would it solve the problem of (quasi-)complete separation? I attach a snippet with the structure of my data.

> summary(french.table.3)
 ARTICLE      COMPSYNT2   POSITION      COMPTYPE   SUBSTYPE2   VARIANT2      YEAR          DECADE           GENRE2  
 
No :815   DetPoss :332       :  1   Hum    :430   Comp :  2   1:487    Min.   :1550   Min.   :1550   Treat_Ess:293  
 
Yes:172   NP      :332   Fin :529   SoA    :228   Conc : 12   2:500    1st Qu.:1623   1st Qu.:1620   Novel    :216  
           
NFClau  :235   Init:275   Conc   :154   Contr:416            Median :1656   Median :1650   Drama    :143  
           
SubClau : 30   Med :182   Abstr  :101   Dep  : 45            Mean   :1650   Mean   :1646   Poetry   : 84  
           
ProForm : 19              Prop   : 31   Emp  :103            3rd Qu.:1686   3rd Qu.:1680   Memoirs  : 82  
           
PronPers: 15              Plant  : 16   Repl :324            Max.   :1719   Max.   :1710   Corresp  : 80  
           
(Other) : 24              (Other): 27   Subst: 85                                          (Other)  : 89  
            AUTHOR2  
 
Corneille_P    : 59  
 
Galland_A      : 48  
 
Urfe_H_d       : 44  
 
Sevigne_Me_de  : 42  
 
Courcillon_P_de: 38  
 
Bossuet_JB     : 22  
 
(Other)        :734

I forgot to mention that the English corpus I use (nl. texts from Early English Books Online) is in almost no way comparable to the French corpus (Frantext), so using random effects in R2 of the MuPDAR procedure is not an option for the time being.

I thank you in advance.

C.B.

Stefan Th. Gries

unread,
Mar 20, 2017, 8:01:28 PM3/20/17
to StatForLing with R
> I forgot to mention that the English corpus I use (nl. texts from Early English Books Online) is in almost no way comparable to the French corpus (Frantext), so using random effects in R2 of the MuPDAR procedure is not an option for the time being.
It is, just not the random effects from R1; in a minimal nutshell:

r1 <- glmer(englishchoices ~ englishpredictors + (1|ENGLISHAUTHOR),
data=ENGLISH)
preds <- predict(r1, newdata=FRENCH, re.form=NA)
matched <- preds == frenchchoices
r2 <- glmer(matched ~ frenchpredictors + (1|FRENCHAUTHOR), data=FRENCH)

Matías Guzmán Naranjo

unread,
Mar 20, 2017, 8:04:14 PM3/20/17
to statforli...@googlegroups.com
>How about using Random Forests instead of Mixed-Effects models? Would it solve the problem of (quasi-)complete separation?

No. The model will work, but you'll still have the underlying issue. Try to find why and where you have complete separation. In my experience, when this happens it is because something is fundamentally wrong.

--

cbechet90

unread,
Mar 20, 2017, 9:35:34 PM3/20/17
to StatForLing with R
Thank you very much for the trick! I will try this as soon as I solve my problem of convergence (certainly using Bayesian Mixed-Effects models).

cbechet90

unread,
Mar 20, 2017, 10:27:57 PM3/20/17
to StatForLing with R
I don't really see what's wrong except that either outcome sometimes does not occur in some levels of "Year" (e.g., outcome 2 occurs in 1551 whereas outcome 1 does not):

> head(table(YEAR, VARIANT2))
      VARIANT2
YEAR  
1 2
 
1550 6 4
 
1551 0 3
 
1552 1 0
 
1553 5 0
 
1554 2 0
 
1555 2 0


The same for predictor "Author":

> head(table(AUTHOR2, VARIANT2))
                      VARIANT2
AUTHOR2                
1 2
 
Abbadie_J            1 2
 
Anonymous            1 0
 
Arnauld_A            3 1
 
Arnauld_A_Lancelot_C 1 0
 
Arnauld_A_Nicole_P   1 1
 
Arnauld_D'Andilly_R  5 3


The same for "Complement type" (this is normal and should be fine after conflation):

> tail(table(COMPTYPE, VARIANT2))
        VARIANT2
COMPTYPE  
1   2
   
Anim    5   7
   
Conc   93  61
   
Hum    52 378
   
Plant  11   5
   
Prop   31   0
   
SoA   228   0


Here is what I get when trying to fit a mixed-effects logistic regression:

> m1<-glmer(VARIANT2 ~ YEAR + COMPSYNT2 + POSITION + SUBSTYPE2 + (1|AUTHOR2/GENRE2), data=french.table.3, family=binomial)
Warning messages:
1: In (function (fn, par, lower = rep.int(-Inf, n), upper = rep.int(Inf,  :
  failure to converge
in 10000 evaluations
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  unable to evaluate scaled gradient
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
 
Model failed to converge: degenerate  Hessian with 1 negative eigenvalues

So I have many levels with 0s and I don't know if I should remove such observations or if there is a way to penalize the model so as not to take observations out.

cbechet90

unread,
Mar 24, 2017, 8:37:56 AM3/24/17
to StatForLing with R
It may be preferable that I post a sample of my dataset in order for you to get a clear picture of how my data look like.

I have tried many things, but all models return warnings, except when I fit a Bayesian Mixed-Effects Logistic model:

> priors = list(R = list(fix=1, V=(1/k) * (I + J), n = k - 1),
+                 G = list(G1 = list(V = diag(k - 1), n = k - 1)))
> m0<-MCMCglmm(VARIANT2 ~ PERIOD + dim1 + dim2 + dim3, random = ~ AUTHOR2, data=new.french.data, family="categorical", prior=priors,  verbose = TRUE, burnin = 10000, nitt = 60000, thin = 50)



MCMC iteration
= 60000


 
Acceptance ratio for liability set 1 = 0.377207


> summary(m0)


 
Iterations = 10001:59951
 
Thinning interval  = 50
 
Sample size  = 1000


 DIC
: 334.0763


 G
-structure:  ~AUTHOR2


        post
.mean l-95% CI u-95% CI eff.samp
AUTHOR2      
3.06    1.071    5.227    132.4


 R
-structure:  ~units


      post
.mean l-95% CI u-95% CI eff.samp
units        
1        1        1        0


 
Location effects: VARIANT2 ~ PERIOD + dim1 + dim2 + dim3


            post
.mean l-95% CI u-95% CI eff.samp  pMCMC    
(Intercept)  -36.3831 -61.7319  -6.2908    8.713 <0.001 ***
PERIOD2      
30.4970  -0.5038  55.8866    9.040  0.042 *  
PERIOD3      
32.4969   1.3983  57.3040    8.758  0.006 **
PERIOD4      
34.2513   3.4148  59.1423    8.640 <0.001 ***
dim1        
-10.5058 -12.3884  -8.6683   13.407 <0.001 ***
dim2          
-5.0967 -10.9888   1.5600    1.478  0.148    
dim3          
-2.7005  -3.4382  -1.9629   56.175 <0.001 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 1


> HPDinterval(m0$Sol)
                  lower     upper
(Intercept) -61.7318921 -6.290827
PERIOD2      
-0.5037781 55.886554
PERIOD3      
1.3982645 57.304038
PERIOD4      
3.4148215 59.142335
dim1        
-12.3883913 -8.668259
dim2        
-10.9887539  1.560011
dim3        
-3.4382181 -1.962925
attr
(,"Probability")
[1] 0.95


However, I'm not yet familiar with full Bayesian models (and I still don't know if I will go full Bayesian at all).
French_table_small.txt

Matías Guzmán Naranjo

unread,
Mar 24, 2017, 8:41:07 AM3/24/17
to statforli...@googlegroups.com
MCMCglmm will not give you autocorrelation errors, you have to check for autocorrelation yourself.

--

cbechet90

unread,
Mar 24, 2017, 9:39:56 AM3/24/17
to StatForLing with R
I've checked autocorrelation, and unfortunately it returns very high values for almost all predictors.

> autocorr(m0$Sol)
, , (Intercept)


         
(Intercept)    PERIOD2    PERIOD3    PERIOD4        dim1        dim2        dim3
Lag 0      1.0000000 -0.9979636 -0.9987144 -0.9995128 -0.22475550 -0.05202329 -0.06432539
Lag 50     0.9794671 -0.9772533 -0.9780940 -0.9790762 -0.22131822 -0.05023164 -0.05576551
Lag 250    0.9160509 -0.9138273 -0.9145092 -0.9154895 -0.20607164 -0.04531620 -0.05433830
Lag 500    0.8463777 -0.8448114 -0.8440027 -0.8452010 -0.17869369 -0.03714499 -0.05288979
Lag 2500   0.4127582 -0.4134141 -0.4112979 -0.4106319 -0.03658181  0.06187555 -0.00959807


, , PERIOD2


         
(Intercept)   PERIOD2   PERIOD3   PERIOD4       dim1        dim2       dim3
Lag 0     -0.9979636 1.0000000 0.9974794 0.9979766 0.24628615  0.06581125 0.07513141
Lag 50    -0.9774737 0.9773531 0.9767655 0.9775991 0.24308384  0.06427438 0.06838424
Lag 250   -0.9146997 0.9131223 0.9136999 0.9146370 0.22541688  0.05919352 0.06648012
Lag 500   -0.8469621 0.8459833 0.8450712 0.8462735 0.19603440  0.05051837 0.06254785
Lag 2500  -0.4097705 0.4107252 0.4086071 0.4078016 0.04111295 -0.05059724 0.01413806


, , PERIOD3


         
(Intercept)   PERIOD2   PERIOD3   PERIOD4       dim1        dim2       dim3
Lag 0     -0.9987144 0.9974794 1.0000000 0.9989450 0.25138796  0.07306676 0.07760686
Lag 50    -0.9788200 0.9773140 0.9788218 0.9791348 0.24801196  0.07120765 0.07066260
Lag 250   -0.9163190 0.9147659 0.9156771 0.9164639 0.23211295  0.06642308 0.06814145
Lag 500   -0.8475002 0.8464786 0.8457743 0.8469185 0.20148328  0.05866247 0.06588473
Lag 2500  -0.4142678 0.4152285 0.4131917 0.4125087 0.05033845 -0.04440010 0.01736605
.
.
.

Matías Guzmán Naranjo

unread,
Mar 24, 2017, 9:45:02 AM3/24/17
to statforli...@googlegroups.com
Yeah, it's awful. Your model is basically not converging. You have an serious issue with your predictors. I do not have time now to throughly go through your data right now. Maybe Stefan can help.

To unsubscribe from this group and stop receiving emails from it, send an email to statforling-with-r+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

cbechet90

unread,
Mar 24, 2017, 10:02:11 AM3/24/17
to StatForLing with R
I was wondering if I really should use random effects at all, since both levels of the response variable are not present in all authors, let alone in each author + in each time period. Within-author variation is certainly worth looking at. The biggest issue here is complete separation, and conflating the levels of the author variable is certainly not an option, for theoretical reasons. I manage to conflate explanatory variables using Multiple Correspondence Analysis (hence the "dim1", "dim2", "dim3"... in the sample dataset), but it seems to me that reducing dimensions of variability often increases the C-value in logistic regression.
Reply all
Reply to author
Forward
0 new messages