Automatic model selection?

Ted Manasa

unread,

May 14, 2013, 11:59:25 PM5/14/13

to wizard...@googlegroups.com

Evan,

I was talking to my researcher friend today and she told me that she uses an R package that permutes through all possible combinations of a model and determines which one(s) have the highest relative quality. It uses the AIC (Akaike information criterion) measure of model quality to determine which model is the best based on a trade-off between model complexity and goodness of fit. http://en.wikipedia.org/wiki/AIC_(statistics)

Think Wizard could do something like that?

THAT could be AMAZING...

Then I wouldn't have to manually permute through my 58 variables to see which combination makes for the best model.

Just a thought!

Ted

Evan Miller

unread,

May 15, 2013, 12:29:04 AM5/15/13

to wizard...@googlegroups.com

Hi Ted,

Most social scientists would abhor this, since it invalidates all of your p-values. ("Data mining" is a dirty word in academia.) But: I am not necessarily opposed. Note though that searching through 58 variables requires estimating 2^58 models! Not quite a particles-in-the-universe number, but certainly prohibitive.

A smaller feature I have been considering would suggest the next variable to include or exclude based on the current model. So not a full permutation search, but a little hint about what to try next. Tentative name is "The Doodlebug" in honor of the art of divination. In the Summary view, this feature would also suggest interesting correlations to examine, and apply Bonferroni correction if desired (to restore balance to the p-values).

Evan

Ted

--
You received this message because you are subscribed to the Google Groups "Wizard User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wizard-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Evan Miller
http://www.evanmiller.org/

Ted Manasa

unread,

May 15, 2013, 10:08:00 AM5/15/13

to wizard...@googlegroups.com

Suggested variable inclusion/excision sounds wonderful. I see why 2^58 wouldn't exactly make for a down-to-earth experience.

So are you thinking it would work kind of like this?

First, Bonferroni suggests initial variables for the model. (Will have to look that one up; only familiar with naive Bayes.)

Second, we build a model from the variables we select.

Third, Doodlebug suggests which variables to include or exclude that could improve the initial model.

Something like that would be phenomenal.

|Ted Manasa

|Find the answer outside your ocean.

Evan Miller

unread,

May 15, 2013, 10:28:54 AM5/15/13

to wizard...@googlegroups.com

That is the basic idea. You'd start by searching for "Top Correlations" in the data set. Because this is bound to turn up "significant" correlations that are strictly due to chance, Bonferroni is a way to adjust the p-values account for the fact that you're testing many hypotheses all at once.

https://en.wikipedia.org/wiki/Bonferroni_correction

A slightly better version is called the Šidák correction:

https://en.wikipedia.org/wiki/Bonferroni_correction#.C5.A0id.C3.A1k_correction

I think the described workflow will leverage Wizard's unique strengths. The statistical routines are optimized to work almost instantly with a millions or so rows, so it will be feasible to do this kind of data-mining on say 40 variables and 50,000 rows and still return in the blink of an eye.

Evan

Ted Manasa

unread,

May 15, 2013, 11:10:20 AM5/15/13

to wizard...@googlegroups.com

WOW. That would make Wizard a proactive data interpretation tool. That kind of puts it into a higher category of software.

|Ted Manasa

|Find the answer outside your ocean.

jurgen.vo...@gmail.com

unread,

Apr 25, 2014, 10:34:02 AM4/25/14

to wizard...@googlegroups.com

What model selection functionality exists in Wizard at present? The best algorithm I have been able to come up with is an ad-hoc stepwise routine where I remove variables, starting with the highest p-value, and observe whether or not there was an increase in adjusted R-sqaured, and/or what effect it has on the p-values of remaining coefficients. Is there an easy way to perform partial f tests that I'm missing?

Evan Miller

unread,

Apr 25, 2014, 12:11:33 PM4/25/14

to wizard...@googlegroups.com

I'm afraid there's not much for model selection at present. I suppose reporting various Information Criterion values would be useful here.