Variable Selection

Bill Raynor

unread,

Feb 28, 2013, 11:57:37 AM2/28/13

to tetrad-us...@googlegroups.com

When dealing with marketing reaserch data, it is common to see boat-loads of similar questions. If you dump those into TETRAD, you get the expected plate of spagetti networks. Are there any common methods for doing variable selection for causal models?

The researchers often see each question as a precious snowflake and resist the idea of doing something along the lines of an Item Analysis and selecting a reasonable subset. One approach I've fiddled with is fitting a sequence of models with nested significance levels and watching for disconnected subsets, which can be eliminated. (for example, setting the pvalue very small, and then slowly relaxing it.) Alternatively, for items that can be represented as a correlation matrix, regularize it using, say, Higham's approach, then get the partial correlation matrix and start eliminating based on the value of the partial correlations. This can be extended to purely ordinal responses using either a Kendall's defintion of a partial tau or Quade's index of matched correlation.

Any other suggestions?

Bill

Joseph Ramsey

unread,

Mar 1, 2013, 3:28:16 PM3/1/13

to tetrad-us...@googlegroups.com

I feel I should refer you to someone who has tried to answer that
question. I'll send your message on.

Joe

> --
> You received this message because you are subscribed to the Google Groups
> "Tetrad Users Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tetrad-users-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
Joseph D. Ramsey
Special Faculty and Director of Research Computing
Department of Philosophy
135 Baker Hall
Carnegie Mellon University
Pittsburgh, PA 15213

jsph....@gmail.com
Office: (412) 268-8063
http://www.andrew.cmu.edu/user/jdramsey

cg...@andrew.cmu.edu

unread,

Mar 1, 2013, 3:47:39 PM3/1/13

to tetrad-us...@googlegroups.com

I am amazed that Joe passed on this question, since he pioneered application of one
of the methods. Markov Blanket search. Its in the TETRAD search box.

If all of your variables are continuous then you can use it as implemented.

Clark

Bill Raynor

unread,

Mar 3, 2013, 5:53:50 PM3/3/13

to tetrad-us...@googlegroups.com

Hello Clark & Joe,

Thanks for your time. MBFS only takes one variable. I have 60. I'm looking for a way to prune that to a smaller number, say, 20. In a different context (completely linked gaussians), I'd be looking for something like Joliffe's or McCabe's principal variables algorithm. Those do not use a causal network of the data and only look at a dependencies.

McCabe's Principal Variables technique works by progressively selecting variables such that the generalized variance of the residual set is smaller and smaller. The selected set is then a reduced set that is a representation of the full-set. An alternate approach would be variable selection through clustering, where one would generate an hierarchial clustering of the variables and pick a variable from each cluster. Neither of those take advantage of the causal graphs of the data.

For example, consider a sub-graph containing discrete variables, A, B, & C, where A-> B, A->C and B->C, B is "contained by" A&C and is redundant. The above techniques would treat that case, and the case where B is a shielded collider with no descendants identically. I would like to distinguish them.

Bill

cg...@andrew.cmu.edu

unread,

Mar 3, 2013, 9:05:43 PM3/3/13

to tetrad-us...@googlegroups.com

Bill,

I need to know more about what you care about. For example, are there particular
variables among the 60 whose causes and effects you specifically care about? Perhaps
(no promises) I could help if I had a fuller description of the problem.

Clark

Bill Raynor

unread,

Mar 4, 2013, 12:15:08 AM3/4/13

to tetrad-us...@googlegroups.com

Clark,

The data are from a large (n=1000), multi-block paired comparison study of everyday products. Each person used one pair out of 30 pairs(a balanced 6 item BIB). They were then asked to judge that pair on a large number of attributes (positive qualities) and problems (negative quantities). For the attributes they were asked to indicate the better item of the pair, and could also choose "No difference". Likewise for the negative quantities, they were asked if that occurred with the first code only, the second code only, both codes, or neither code. That bivariate binary response is folded into a difference to match the structure of the first part (1st, none/both, 2nd). One of the attributes is an overall preference. This is the main response of interest and the remaining variables are used as diagnostics. I have been using graphical models (Tetrad for search, Netica and/or Hugin for the Bayesian model) to model the interrelationships of these variables with a view to suggesting/evaluating intervention/manipulations on the products. (The "usual" approach is to compare marginals across the pairs, without regard for the associations among them).

Since the individual responses are all ordinal, I usually use a polychoric or a partial Kendall correlation matrix as input to Tetrad for the search process, but switch to a fully discrete model in, say, Netica. From time-to-time Tetrad freezes on larger correlation matrices, or goes into a tight loop (never exiting from a PC search, and not producing a log.) If I delete enough variables on an ad-hoc basis the problem usually goes away. I am looking for a better way to do that variable reduction.

Thanks for your time

Bill

On Sun, Mar 3, 2013 at 8:05 PM, <cg...@andrew.cmu.edu> wrote:

Bill Raynor

unread,

Mar 4, 2013, 7:55:59 AM3/4/13

to tetrad-us...@googlegroups.com

Clark,

If I wasn't clear, the "main variable" is the overall preference. However the thrust of the analysis is not a simple prediction problem, which can be handled quite well by a blanket or a classification tree. The thrust is to understand "what causes what?" At any given moment there are various subgroups jockeying for funds to change various product behaviors. Additionally some of the responses are hierarchial, so the prediction problem becomes nested...

On Sun, Mar 3, 2013 at 8:05 PM, <cg...@andrew.cmu.edu> wrote:

cg...@andrew.cmu.edu

unread,

Mar 4, 2013, 12:43:55 PM3/4/13

to tetrad-us...@googlegroups.com

Just to make sure I understand, you eventually have sixty variables, each binary
(or ternary if indifference is allowed. Each variable is obtained from a preference
ordering (or partial ordering) of of a pair of items.

Are items compared multiple times, that is is A compared with B and B compared with
C? If so, then you very likely have some dependencies among variables that result
from transitivities of preferences. I don't know that that is any trouble, and from
what you wrote transitivities might be part of what you aim to find.

First thing I would do is examine whether there are variables that are outliers in
the sense that they are not associated with any others. Since the data are
categorical (or maybe ordinal), I would use a criterion that makes it fairly hard to
reject independence. Then, if I wanted to reduce the number of variables, I would
look for pairs whose associations with other variables are nearly the same--you can
make up a measure for that from chi square or g square values, e.g for each variable
in a pair, the sum of the differences of those association measures over all other
variables--just a suggestion; for pairs whose associations are close in this way, I
would either chose one and discard the other--repeating with different choices if
need be if the analysis does not turn out well in the end. Then, hoping I got some
reduction from the 60 variables. I would run a tetrad search over the variables.

I don't put much stock in categorical clustering, but one could try K-modes; I
think, however, it requires pre-specification of the number of clusters and may be
sensitive to order. There is another algorithm, CD, published about eight years ago
by a guy at York university. I have never used it.

We would be very interested in seeing cases where tetrad gets broken, or logging
fails. If you are willing to send us examples we can see if there is something that
can be fixed.

Clark

Bill Raynor

unread,

Mar 9, 2013, 7:52:11 PM3/9/13

to tetrad-us...@googlegroups.com

Clark & Joe,

Excuse the delay in replying, I ran into some time crunches.

To your points:

The experimental design.

The general layout is as follows:

There are k pairs of products to be evaluated (usually formed from a BIB or PBIB on a smaller # of different products. In the case at hand: 6 products in 15 pairs and two presentation orders yielding 30 "treatments."
Each subject is randomized into one of the cells, so they only see at most two products
Each subject uses the 1st product for a specified period, then switches to the second product for the same specified period
After the end of this last time period, the subject is contacted and a questionaire is administered

The 1st question concerns which one they preferred (the first, the second, or no preference)
The 2nd series is a bunch of "which did you prefer for <insert some attribute or marketing fuzz>" questions, in a randomized order.
The 3rd series asks a bunch of "Did you have <a specific or general> problem with either product", also in a randomized order.

Typically the 2nd and 3rd sections ask at least 50-60 questions, but can go over 100. The current one has 103 attributes and problems.
The overall preference and attribute questions are ordinal, and the problems are bivariate binary. I treat ties (both/none) on each problem as intermediate (ala McNemar) and convert those into ordinal trinomials.
Since I am interested in increasing (decreasing) problems and attributes to increase (decrease) preference, I summarize the data in either a Kendall or Polychoric correlation matrix. Lately, I've been using Kendall tau because it allows me to partial out the products (and cities and trials) which are only of historical interest.
The matrix gets exported to TETRAD, where I run a search, using knowledge tiers, to find a DAG. This can involve some orienting or reorienting to get sensible results. (Orienting a specific problems before more general problems, etc. and chasing cycles)
The graph gets exported to graphviz to get a nice layout which in turn goes to Netica or Hugin, where I estimate the Bayesian network.

Variable Reduction

As I mentioned above, some of these trials end up with large # of questions. They are designed by committees, and the decision makers are managers, not scientists. Thus they just want to keep everyone happy. My initial question was on techniques to reduce the # of questions so that the graph was sensible. After I asked the question, I found a reference that actually addresses graphical models (rather than factor analysis), a variation of McCabe's Principal Variables, developed by Cumming and Wooff. Applying that, and using Jolliffe's 0.7 rule I was able to reduce the # of variables to 27 and get a sensible graph.

Cumming, J.A. and Wooff, D.A. (2007) Dimension Reduction via Principal Variables. Comp. Stat. and Data Analysis 52, 550-565.

Tetrad Crashing and no logs.

I solved that problem Thursday. As a prep to submitting the data as a problem example, I renamed the variables to something short, and retested. Now everything worked. It turns out that my original names must have been too long, and, when truncated, would end up with multiple variables with the same name. Fixing the names fixed the problems.