> - First question: I have never worked on SLA and FLA data, so I can't seem to figure out how the several L1s can be operationalized, but I guess that using exactly the same parameters is not always feasible. Now the question is: can we use probabilities of the second model if that model is based on different interactions and with other parameters added/removed?My question entirely ;-)
> - Second question: yes, step 4 implies fitting a linear regression or a random forest on the DIFF scores.
Ok.
> Mentioning random forests, I forgot that we can't decide on the important independent variables beforehand since it is, as its name says, a random procedure.
Not sure that's a problem: random forests sample predictors anyway.
Quick stuff off the top of my head ...> Maybe some food for thought on how and why MuPDAR(F) approaches should be improved:
> - In case of binary choice between linguistic alternatives, choosing another cut-off probability than 0.5 when computing deviations, e.g. by looking at the threshold in a ROC curve;
> - Definitely a possibility but I think one needs to be careful there and 'motivate' cut-off points other than the default (so as to avoid getting accused of 'cheating'/good-results-hunting at all costs.
> - Can the cut-off of 0.5 be applied in case of polytomous outcome?
> - You mean in R(F)1? If yes, that would be quite atypical.
> I would be tempted to set the threshold to 0.333 (3-class problem) and 0.25 (4-class problem), but what if the probabilities are a: 0.55, b: 0.35, c: 0.1 (3-class problem)?
> - No, that's not how it's usually done, see SFLWR2 on multinomial regression modeling.
I think we have two options :
One should test both options when trying to predict the deviations (R2) because they yield different R² and R²adj (I experienced this myself and I would favor the first option in my case).
I think J. Grafmiller, who is more familiar than me with the MuPDAR techniques, used box plots too to explore the deviations in his paper entitled “Deviant diachrony: Exploring new methods for analyzing language change”. But such a tool should certainly be used for the visual inspection of the deviations only, not if we are to analyze the effects of the independent variables. Note, in this respect, how tricky it becomes when we use random forests instead of logistic models. The probabilities returned in a random forest are in principle computed through the interactions of the independent variables in a way that maximizes the significance of the splits at each node. If no coefficients are returned, the only option left, to the best of my knowledge, is to use the most important variables and their interactions, according to the variable importance scores returned by varimp.
There’s a lot to experiment, and I hope other researchers will find their way too into these new methods 😊
Provenance : Courrier pour Windows 10
De : Stefan Th. Gries
Envoyé le :mardi 7 novembre 2017 01:23
À : StatForLing with R
Objet :Re: [StatForLing with R] Re: R2 in MuPDAR(F) approach
--
You received this message because you are subscribed to the Google Groups "StatForLing with R" group.
To unsubscribe from this group and stop receiving emails from it, send an email to statforling-wit...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.