71 views

Skip to first unread message

Apr 27, 2017, 7:16:52 AM4/27/17

to StatForLing with R

Hello R users,

I'v got a question that is not specifically related to R-programming for statistical purposes, but rather it is a conceptual one.

After reading all the papers on the MuPDAR(F) approach, it appears that the dependent variable of *R2* was operationalised differently in each paper. Let me summarise this for each paper in order of appearance.

2014k (Gries & Deshors)

Method: linear regression

Dependent variable in *R2*: all the predicted numerical deviations (*Dev*), **including **cases of no deviation (0s)

How *Dev *is computed: p_observed - 0.5

2014j (Gries & Adelman)

Method: mixed-effects logistic regression

Dependent variable in *R2*: binary response *Correct *(yes v. no)

2015q (Wulff & Gries)

Method: binary logistic regression

Dependent variable in *R2*: binary response *Correct *(TRUE v. FALSE)

2016e (Gries & Bernaisch)

Method: binary logistic regression

Dependent variable in *R2*: all the predicted numerical deviations (*VarietySpecificity*), **including **cases of no deviation (0s)

How *VarietySpecificity *is computed: 0.5 - p_observed

2016a (Deshors & Gries)

Method: random forest

Dependent variable in R2: all the predicted numerical deviations (*DEVIATION*), **excluding **cases of no deviation (0s)

What I would like to know is if there are theoretical motivations for including or excluding zero cases of deviation in 2014k, 2016a and 2016e. I mean, if it depends on the type of question they want to answer:

- exploring HOW WELL the non-native speakers did;

- exploring HOW MUCH the non-native speakers deviate.

I don't know whether I'm missing something here, but if someone could help me understand the difference in the operationalisation of the response variable, I would appreciate.

Thank you very much in advance!

Apr 27, 2017, 7:37:54 AM4/27/17

to StatForLing with R

You forgot one: ;-)

Heller, Bernaisch, & Gries (2017)

Method: random forest

Dependent variable in RF2: all the predicted numerical deviations

(separately), genitive alternation for the former, that

complementation the latter ...

You're not missing anything: this is a question whose exploration is

still work-in-progress and one that I have discussed with many people

already ... The issue is this: most of the time, the

learners/indigenized variety speakers make most choices in a

nativelike fashion, meaning that often >=80% of the DEV scores are 0

and that makes it of course very tricky for any parametric approach

such as regressions to find good structure in the data and so my

collaborators and me have been trying to find (a) good way(s) of

dealing with this, by playing around with the options you're

mentioning:

Question 1: what's the depvar in the second regression/rf/...? Is it

a) something binary or b) something numeric (with the 0s) or c)

something numeric (without the 0s)?

Question 2: if the answer to the Question 1 is b) or c), what's the

statistical methods in R(F)2? Is it something that we believed/hoped

can handle the very skewed distribution arising from the majority of

DEV values being 0 well/better (than regressions like random forests)

or not (so regression might do the trick)?

In a sense, what that also amounts to is a slight change in focus:

- if the answer to the Question 1 is b), you're asking 'what

determines how much the learners/indigenized variety speakers differ

from the native speakers?' and the answer can be 'not at all' namely

when a DEV-value is 0;

- if the answer to the Question 1 is c), you're asking 'what

determines how much the learners/indigenized variety speakers differ

from the native speakers when they deviate?'

At least that's how we always ended up talking about this question

when it reliably came up in our discussions. Hope that helps, but I

don't yet have an ideal answer to this question, that's why we've been

exploring different ways. If you have a better answer, I'd be more

than happy to learn what it is because, like I said, it's something

that comes up every time you don't do R(F)2 on something binary.

STG

--

Stefan Th. Gries

----------------------------------

Univ. of California, Santa Barbara

http://tinyurl.com/stgries

----------------------------------

Heller, Bernaisch, & Gries (2017)

Method: random forest

Dependent variable in RF2: all the predicted numerical deviations

(VarietySpecificity), including cases of no deviation (0s)

Plus, Benedikt Heller and Nicholas Lester are working on these things
(separately), genitive alternation for the former, that

complementation the latter ...

You're not missing anything: this is a question whose exploration is

still work-in-progress and one that I have discussed with many people

already ... The issue is this: most of the time, the

learners/indigenized variety speakers make most choices in a

nativelike fashion, meaning that often >=80% of the DEV scores are 0

and that makes it of course very tricky for any parametric approach

such as regressions to find good structure in the data and so my

collaborators and me have been trying to find (a) good way(s) of

dealing with this, by playing around with the options you're

mentioning:

Question 1: what's the depvar in the second regression/rf/...? Is it

a) something binary or b) something numeric (with the 0s) or c)

something numeric (without the 0s)?

Question 2: if the answer to the Question 1 is b) or c), what's the

statistical methods in R(F)2? Is it something that we believed/hoped

can handle the very skewed distribution arising from the majority of

DEV values being 0 well/better (than regressions like random forests)

or not (so regression might do the trick)?

In a sense, what that also amounts to is a slight change in focus:

- if the answer to the Question 1 is b), you're asking 'what

determines how much the learners/indigenized variety speakers differ

from the native speakers?' and the answer can be 'not at all' namely

when a DEV-value is 0;

- if the answer to the Question 1 is c), you're asking 'what

determines how much the learners/indigenized variety speakers differ

from the native speakers when they deviate?'

At least that's how we always ended up talking about this question

when it reliably came up in our discussions. Hope that helps, but I

don't yet have an ideal answer to this question, that's why we've been

exploring different ways. If you have a better answer, I'd be more

than happy to learn what it is because, like I said, it's something

that comes up every time you don't do R(F)2 on something binary.

STG

--

Stefan Th. Gries

----------------------------------

Univ. of California, Santa Barbara

http://tinyurl.com/stgries

----------------------------------

Apr 27, 2017, 8:44:28 AM4/27/17

to StatForLing with R

Thank you very much for making things clearer, Stefan.

Including 0s in the response is indeed a crucial point, because, as you pointed it, there might be cases with many 0s and it can affect the fit. Actually, I tried both approaches myself and including cases of no deviation could change the amount of variation explained (R^2) drastically, like far below 0.5. Then, I don't know whether the deviations are worth exploring in such cases. I wonder whether using true deviations instead of setting cases of NNS choice = NS choice to 0 wouldn't be as informative as saying whether NNS make the same choice or not.

My humble opinion is that if you are just asking what determines how much the learners/indigenized variety speakers differ from the native speakers, I think you should use actual values of deviation from the start instead of setting cases of no deviation to 0, because you are assuming that the preference of one alternate construction over the other is probabilistic, not binary.

Message has been deleted

Apr 27, 2017, 9:56:28 AM4/27/17

to StatForLing with R

The problem, then, would be that if you choose to use actual values of probabilities, then you couldn't use 0.5 as the threshold value in cases where NNS make the same choice as NS would do in the same context. An alternative would be the following:

1) fit a first model on the NS data;

2) apply this model on the NNS data and store the p-values in a vector P1 for each observation;

3) fit a second model on the NNS data** with exactly the same parameters as in the first model**, since we are not interested in the coefficients of the second model, and store the p-values in a vector P2 for each observation;

4) compute the DIFF value P1-P2 and explore how much the NNS deviate from the NS.

This only applies when you assume a probabilistic deviation between NNS and NS.

What do you think?

Apr 27, 2017, 10:54:06 AM4/27/17

to StatForLing with R

I see what you mean ...

> [...] because you are assuming that the preference of one alternate construction over the other is probabilistic, not binary.

> [...] because you are assuming that the preference of one alternate construction over the other is probabilistic, not binary.

The preference is, but the actual production is binary: speakers produce, say, A rather than B regardless of whether they arrived at A with 60% or 90%. But again, I am not yet sure myself what the best way is so I'm not arguing against you here!

Apr 27, 2017, 10:55:12 AM4/27/17

to StatForLing with R

> 1) fit a first model on the NS data;

> 2) apply this model on the NNS data and store the p-values in a vector P1 for each observation;

> 2) apply this model on the NNS data and store the p-values in a vector P1 for each observation;

> 3) fit a second model on the NNS data with exactly the same parameters as in the first model, since we are not interested in the second model's fit, and store the p-values in a vector P2 for each observation;

> 4) compute the DIFF value P1-P2 and explore how much the NNS deviate from the NS.

> This only applies when you assume a probabilistic deviation between NNS and NS. What do you think?

Questions:

- would step 3 ("with exactly the same parameters") still be possible if there are learners with multiple L1s?

- would step 4 ("explore how much") imply a third model?

- would step 3 ("with exactly the same parameters") still be possible if there are learners with multiple L1s?

- would step 4 ("explore how much") imply a third model?

Apr 27, 2017, 11:19:08 AM4/27/17

to StatForLing with R

- First question: I have never worked on SLA and FLA data, so I can't seem to figure out how the several L1s can be operationalized, but I guess that using exactly the same parameters is not always feasible. Now the question is: can we use probabilities of the second model if that model is based on different interactions and with other parameters added/removed?

- Second question: yes, step 4 implies fitting a linear regression or a random forest on the DIFF scores.

Mentioning random forests, I forgot that we can't decide on the important independent variables beforehand since it is, as its name says, a random procedure.

Apr 27, 2017, 1:21:08 PM4/27/17

to StatForLing with R

> - First question: I have never worked on SLA and FLA data, so I can't seem to figure out how the several L1s can be operationalized, but I guess that using exactly the same parameters is not always feasible. Now the question is: can we use probabilities of the second model if that model is based on different interactions and with other parameters added/removed?

My question entirely ;-)

> - Second question: yes, step 4 implies fitting a linear regression or a random forest on the DIFF scores.

Ok.

> Mentioning random forests, I forgot that we can't decide on the important independent variables beforehand since it is, as its name says, a random procedure.

Not sure that's a problem: random forests sample predictors anyway.

Apr 27, 2017, 1:48:19 PM4/27/17

to StatForLing with R

> - First question: I have never worked on SLA and FLA data, so I can't seem to figure out how the several L1s can be operationalized, but I guess that using exactly the same parameters is not always feasible. Now the question is: can we use probabilities of the second model if that model is based on different interactions and with other parameters added/removed?My question entirely ;-)

Do you mean I rephrased your question or do you mean it's still a matter of debate? In the first case, my apologies. I'm still wondering whether it's good practice to compare two models that are not fit with exactly the same predictors and/or interactions.

> - Second question: yes, step 4 implies fitting a linear regression or a random forest on the DIFF scores.

Ok.

> Mentioning random forests, I forgot that we can't decide on the important independent variables beforehand since it is, as its name says, a random procedure.

Not sure that's a problem: random forests sample predictors anyway.

Agreed, but what if I run a random forest RF1 on the data of the 'model' language, apply the model to the 'replica' language, then run a random forest RF2 on the data of the 'replica' language and...

RF1 imputes more importance to predictors A and B, while RF2 does not? It seems to me that I come up with exactly the same question as above, i.e. can we compute a DIFF score between two predicted probabilities based on two models which are not comparable with regard to variable importance? BTW I'm relatively new to random forests, so I might sound dumb with this question.

Apr 27, 2017, 4:34:01 PM4/27/17

to StatForLing with R

>>> - First question: I have never worked on SLA and FLA data, so I can't seem to figure out how the several L1s can be operationalized, but I guess that using exactly the same parameters is not always feasible. Now the question is: can we use probabilities of the second model if that model is based on different interactions and with other parameters added/removed?

>> My question entirely ;-)

> Do you mean I rephrased your question or do you mean it's still a matter of debate?

The latter :-) Off the top of my head, I'm not sure how one should
>> My question entirely ;-)

> Do you mean I rephrased your question or do you mean it's still a matter of debate?

proceed there, if possible.

> Agreed, but what if I run a random forest RF1 on the data of the 'model' language, apply the model to the 'replica' language, then run a random forest RF2 on the data of the 'replica' language and... RF1 imputes more importance to predictors A and B, while RF2 does not? It seems to me that I come up with exactly the same question as above, i.e. can we compute a DIFF score between two predicted probabilities based on two models which are not comparable with regard to variable importance?

May 5, 2017, 4:01:34 AM5/5/17

to StatForLing with R

I found an interesting article by J. Scott Long (2009) who claims that predicted probabilities are unaffected by residual variation, so the assumption of the equality of the regression
coefficients of some variables shouldn't be a concern when comparing probabilities.

Moreover, the author proposes a follow-up to his flavour of the MuPDAR approach (if we were to call it so), nl. computing a z-score of P(group 1) - P(group 2) to see if the deviation is significant.

I hope this is inspiring work.

Best,

C.B.

May 5, 2017, 4:57:32 AM5/5/17

to StatForLing with R

Thanks for letting us know!

May 26, 2017, 6:36:49 AM5/26/17

to StatForLing with R

Hello,

Maybe some food for thought on how and why MuPDAR(F) approaches should be improved:

- In case of binary choice between linguistic alternatives, choosing another cut-off probability than 0.5 when computing deviations, e.g. by looking at the threshold in a ROC curve;

- Can the cut-off of 0.5 be applied in case of polytomous outcome? I would be tempted to set the threshold to 0.333 (3-class problem) and 0.25 (4-class problem), but what if the probabilities are a: 0.55, b: 0.35, c: 0.1 (3-class problem)?

-. What if one variety of a language only uses 2 alternate cxs out of 3 available cxs in that reference language?

I'm trying myself to deal with these issues, but I hope other souls are interested in these questions too.

Best,

C.B.

May 26, 2017, 1:50:07 PM5/26/17

to StatForLing with R

Quick stuff off the top of my head ...

> - In case of binary choice between linguistic alternatives, choosing another cut-off probability than 0.5 when computing deviations, e.g. by looking at the threshold in a ROC curve;

Definitely a possibility but I think one needs to be careful there and 'motivate' cut-off points other than the default (so as to avoid getting accused of 'cheating'/good-results-hunting at all costs.

> - Can the cut-off of 0.5 be applied in case of polytomous outcome?

You mean in R(F)1? If yes, that would be quite atypical.

> I would be tempted to set the threshold to 0.333 (3-class problem) and 0.25 (4-class problem), but what if the probabilities are a: 0.55, b: 0.35, c: 0.1 (3-class problem)?

No, that's not how it's usually done, see SFLWR2 on multinomial regression modeling.

May 26, 2017, 3:52:57 PM5/26/17

to StatForLing with R

Le vendredi 26 mai 2017 19:50:07 UTC+2, Stefan Th. Gries a écrit :

Quick stuff off the top of my head ...> Maybe some food for thought on how and why MuPDAR(F) approaches should be improved:

> - In case of binary choice between linguistic alternatives, choosing another cut-off probability than 0.5 when computing deviations, e.g. by looking at the threshold in a ROC curve;

> - Definitely a possibility but I think one needs to be careful there and 'motivate' cut-off points other than the default (so as to avoid getting accused of 'cheating'/good-results-hunting at all costs.

Agreed.

> - Can the cut-off of 0.5 be applied in case of polytomous outcome?

> - You mean in R(F)1? If yes, that would be quite atypical.

I mean when we compute Dev (P - threshold).

> I would be tempted to set the threshold to 0.333 (3-class problem) and 0.25 (4-class problem), but what if the probabilities are a: 0.55, b: 0.35, c: 0.1 (3-class problem)?

> - No, that's not how it's usually done, see SFLWR2 on multinomial regression modeling.

I think I tackled the stuff in SFLWR2 and by much reading elsewhere, it seems that the threshold is the probability of the most likely outcome. But I read somewhere that the 0.5 cut-off could be kept in 3+ class cases.

Nov 6, 2017, 5:52:41 AM11/6/17

to StatForLing with R

I explored another option for the computation of DEV scores:

Whatever type of response variable we have (dichotomous or polytomous), one decides on a reference response among the different options and sticks to it hroughout the whole process. So this would be a one versus all analysis.

Taking an example from LCR:

- if the learner made the same choice (say A) as a native speaker in the same situation, we subtract the probability of the occurrence of that choice in learner by the probability of its occurrence in native speaker;

- if the learner made a different choice (B), we subtract the probability of A occurring from the probability of A NOT occurring.

If one wants the probabilities to be 'weighted' by the other events, one can divide the given probability of a chosen event by its complement probability (e.g., if A has a 0.95 probability, one can divide it by 1 - 0.95 = 0.05; so the score would be 19). However, to get symmetrical scores around zero, we should log-transform the scores to get negative as well as positive values.

- if A has 95% chance of occurring (A is favoured) --> ln(0.95/0.05) = ln(19) = 2.94

- if A has 5% chance of occurring (B is favoured) --> ln(0.05/0.95) = ln(0.053) = -2.94 (so we have a nice property here)

If one still wants to compute DEV scores by subtracting a threshold from the probability of a reference outcome, we can extend the present technique as follows:

- a = probability of an outcome in native speaker;

- 1 - a = probability of another outcome in native speaker;

- a' = probability of the same outcome in learner;

- 1 - a' = probability of another outcome in learner.

If the learner made the same choice as a native would have made in the same situation, DEV = ln(a/1-a) - ln (a'/1-a') (which equals 0 if a and a' have the exact same value)

If the learner made a different choice compared to the native speaker, DEV = ln(a/1-a) - ln(1-a'/a') (which is negative or positive, depending on the choice)

Visualizing such DEV scores in a boxplot (DEV scores on y axis, morphological variant on x axis) allows one to uncover overly influential values (outliers) of DEV worthy of further exploration.

Nov 6, 2017, 7:23:38 PM11/6/17

to StatForLing with R

If you're saying that

DEV =

predicted probability of choice X by NNS (from a regression on the NNS data)

minus

probability of choice X by NS (from a regression on the NS data)

then that sounds interesting, but I wouldn't do the boxplot analysis

you're suggesting because it's monofactorial - rather, I'd want some

regression-like method that can handle multiple predictors of the DEV

scores.

Cheers,

DEV =

predicted probability of choice X by NNS (from a regression on the NNS data)

minus

probability of choice X by NS (from a regression on the NS data)

then that sounds interesting, but I wouldn't do the boxplot analysis

you're suggesting because it's monofactorial - rather, I'd want some

regression-like method that can handle multiple predictors of the DEV

scores.

Cheers,

Nov 6, 2017, 8:20:48 PM11/6/17

to statforli...@googlegroups.com

I think we have two options :

__ __

- we either use the same formula throughout for all computations of DEV (e.g. P(X
_{a}) – P(X_{b}) or one of its log-transformations as suggested in my previous mail); - or else we apply a different formula which will produce more extreme results between -1 and 1 if the choice made by the learner is different (e.g. P(X
_{a}) – P(1 – X_{b}) or one of its log-transformations).

__ __

One should test both options when trying to predict the deviations (R2) because they yield different R² and R²_{adj} (I experienced this myself and I would favor the first option in my case).

__ __

I think J. Grafmiller, who is more familiar than me with the MuPDAR techniques, used box plots too to explore the deviations in his paper entitled “Deviant diachrony: Exploring new methods for analyzing language change”. But such a tool should certainly be used for the visual inspection of the deviations only, not if we are to analyze the effects of the independent variables. Note, in this respect, how tricky it becomes when we use random forests instead of logistic models. The probabilities returned in a random forest are in principle computed through the interactions of the independent variables in a way that maximizes the significance of the splits at each node. If no coefficients are returned, the only option left, to the best of my knowledge, is to use the most important variables and their interactions, according to the variable importance scores returned by varimp.

__ __

There’s a lot to experiment, and I hope other researchers will find their way too into these new methods 😊

__ __

Provenance : Courrier pour Windows 10

__ __

**De : **Stefan Th. Gries**Envoyé le :**mardi 7 novembre 2017 01:23**À : **StatForLing with R**Objet :**Re: [StatForLing with R] Re: R2 in MuPDAR(F) approach

--

You received this message because you are subscribed to the Google Groups "StatForLing with R" group.

To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

__ __

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu