question about importance maps

Jesse Rissman

unread,

Jul 12, 2008, 12:36:49 PM7/12/08

to mvpa-t...@googlegroups.com

Hi Greg, Sean, and others,

I have a classifier that can discriminate between single trials from
conditions A and B with about 80% accuracy using whole brain data
(16,848 voxels), with no feature selection. When I use
interpret_weights.m to generate importance maps, the maps come out
looking great, highlighting all the regions I'd expect to play a role
in the classification. However, I'm finding myself a little confused
about an aspect of the importance maps, and I was hoping someone could
help clarify the following:

Interpret weights returns a pattern for each of my 10 cross-validation
iterations, and this pattern's .mat field is [16848 x 2]. It was my
understanding that the two columns of this matrix represent the
importance values for each of my two conditions (condition A and
condition B, respectively). However, when I convert them into two
separate brain maps and visualize them, they look virtually identical
(the same regions that are positive in one are positive in the other
and the same regions that are negative in one are negative in the
other, and the values at any given voxel differ only minimally -- in
fact they are correlated at r = .95 over space). This is what's
puzzling me -- I thought that for a binary A vs. B classification,
these maps should essentially be the inverse of each other, since
voxels that load positively onto output unit A will most likely load
negatively onto output unit B, and vice versa. If I were to
conceptualize the positive and negative values of one of the
importance maps as being the loadings on the two output units, then
the map would make complete sense to me, since the regions I expect to
activate outcome A are positively signed and the regions I expect to
activate outcome B are negatively signed. However, it doesn't make
sense to me that the importance maps generated for each condition
(from the two columns of the impmap matrix) are virtually identical --
for both maps the positive importance values are in the regions
associated with condition A and the negative importance values are in
the regions associated with condition B. It seems to me that for a
binary A vs. B classification, it should be possible to create a
single importance map that captures the signed loadings of each voxel
on output A vs. B, since any voxel that loads similarly on both
outcomes should be considered to be unimportant for the
classification. Any guidance on this matter would be much appreciated.

Thanks,
Jesse

-----------------------------------
Jesse Rissman, Ph.D.
Dept. of Psychology
Stanford University
Jordan Hall, Bldg 420
Stanford, CA 94305-2130

Greg Detre

unread,

Jul 14, 2008, 10:54:09 PM7/14/08

to mvpa-t...@googlegroups.com

hey jesse,

sorry for the slowness of the response. the hive mind is thinking.

g

--

---
Greg Detre
cell: 617 642 3902
email: gr...@gregdetre.co.uk
web: http://www.princeton.edu/~gdetre/

Jesse Rissman

unread,

Jul 22, 2008, 5:37:48 PM7/22/08

to mvpa-t...@googlegroups.com

Hi Greg,

Has the hive mind generated any thoughts on this issue yet?

- Jesse

Jesse Rissman

unread,

Aug 7, 2008, 2:57:53 PM8/7/08

to mvpa-t...@googlegroups.com

I just want to clarify a few things about my previous posting regarding importance maps, since my question has not yet been answered, and hopefully this will shed more light on the nature of my confusion:

My classifier is being trained on an equal number of data points from condition A and condition B. I've z-scored the data prior to classification, such that each voxel has a mean activity of 0. What this implies is that the mean activity level that a given voxel exhibits on trials from condition A (e.g., 0.5874) will inherently be the inverse of its mean activity level on trials from condition B (e.g., -0.5874), so that the grand mean is 0. I've confirmed this to be the case in my data.

The interpret_weights.m script, used to generate importance maps (as in Polyn et al., 2005, Science), uses the equation: imp_{i,j} = w_{i,j} * a_{i,j}, where w_{i,j} is the weight of voxel i on output unit j, and a_{i,j} is the average activity of voxel i for training patterns from category j.

So according to that formula, if voxel i (e.g., the voxel described above -- real data by the way) has a weight of 0.0283 to output unit A and a weight of -0.0275 for output unit B, then its importance value for voxel i for activating output unit A would be 0.0283 * 0.5874 = 0.0166, and its importance value for activating output unit B would be -0.0275 * -0.5874 = 0.0162.

As you can see, because of the opposite signs of the average activity values in condition A and condition B (which is inherent in a z-scored dataset with an equal number of condition A and condition B trials), the importance values are coming out almost identical for condition A and condition B, when in reality this voxel should be considered to have a positive importance for activating output unit A and a negative importance for activating output unit B (as dictated by the weights alone, as well as by intuition -- this voxel clearly is more engaged during trials from condition A, and thus it should lead to increased activation of output unit A and decreased activation of output unit B). The issue with this example voxel si the same throughout the brain, such that the resulting importance map for Condition A looks virtually identical to the importance map from Condition B.

So my question is really, why use the average activity from a single condition in determining the importance value? For properly z-scored data (that is, data z-scored immediately before being submitted to the classifier), couldn't the weights alone be used as the importance values?

To reiterate a point I made in my last posting, it seems to me that for a binary A vs. B classification, it should be possible to generate a single importance map, with positive and negative values, since the only way a voxel can be "important" for classification success is for it to differentially activate the condition A vs. condition B output units.

As a side note, this problem may not be apparent to mvpa users who only z-score their raw data (with rest timepoints and conditions-of-no-interest included), but then don't z-score again after doing any of the following common data manipulations: excluding rest timepoints, excluding trials from certain conditions, artificially balancing the number of trials in condition A and condition B, performing temporal averaging, etc. All of these data manipulations that occur after the first z-scoring step will ultimately result in the mean activity of each voxel being a non-zero value. In fact, just the act of excluding rest timepoints after z-scoring will serve to boost the mean value of most task-related voxels in the brain well above zero. This will often result in the mean activity levels in Condition A and Condition B both being positive, and thus, when these values are used by interpret_weights to generate importance maps, there will not be the sign inversion problem that I'm experiencing. Under these circumstances, the importance maps for Condition A and Condition B do end up looking roughly opposite to each other (i.e. positive values in one are negative in the other), but for the wrong reasons, it seems.

Any clarity that mvpa users could provide on these issues would be greatly appreciated.

Thanks,

Jesse

-----------------------------------
Jesse Rissman, Ph.D.
Dept. of Psychology
Stanford University
Jordan Hall, Bldg 420
Stanford, CA 94305-2130

(o) 650-724-9515
(f) 650-725-5699

On Mon, Jul 14, 2008 at 7:54 PM, Greg Detre <gde...@princeton.edu> wrote:

--~--~---------~--~----~------ ------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Princeton MVPA Toolbox for Matlab" group.
To post to this group, send email to mvpa-t...@googlegroups.com
To unsubscribe from this group, send email to mvpa-toolbox-unsubscribe@googlegroups.com
For more options, visit this group at http://groups.google.com/group/mvpa-toolbox?hl=en
-~----------~----~----~----~------~----~------~--~---

Greg Detre

unread,

Aug 13, 2008, 2:56:06 PM8/13/08

to mvpa-t...@googlegroups.com

Dear Jesse + list,

Let me pass on this reply from Sean Polyn:

Dear Jesse,

Greg Detre forwarded your MVPA question to me, perhaps I can help a bit
here. I remember we talked a bunch about the issue of importance values
with a 2-way classification and z-scoring back when I was still at
Princeton. We did come to a similar conclusion as you, that with
certain limiting conditions, the two maps would end up being exactly
inverted from one another. In the end, it is related to the issue that,
for a discriminative classifier, in a binary classification, evidence
for category A is evidence against category B. So in a sense, any
discriminative information is important for both categories. I think
I'm just restating a lot of what you said.

One thing though, looking through your detailed description of the
issue: You say that the average activity for cat B for voxel i is
negative, and it has a negative weight to output B. It seems to me that
this implies that voxel i is trying to activate both A and B (for a
backprop classifier at least, and possibly a logistic regression
classifier)... that doesn't really make sense to me... what kind of
classifier is this? Perhaps I'm just missing something basic though.

Since in the Science paper I was working with a 3-way classification, I
never spent a huge amount of time solving the 2-way case, though it
seems like you've made some headway. I'd be quite interested to hear
Greg's thoughts on the matter, I haven't done anything with importance
maps in 3 years now!

Sean

> email: gr...@gregdetre.co.uk <mailto:gr...@gregdetre.co.uk>
> web: http://www.princeton.edu/~gdetre/
> <http://www.princeton.edu/%7Egdetre/>

>
>
> --~--~---------~--~----~------ ------~-------~--~----~
> You received this message because you are subscribed to the Google
> Groups "Princeton MVPA Toolbox for Matlab" group.
> To post to this group, send email to mvpa-t...@googlegroups.com

> <mailto:mvpa-t...@googlegroups.com>

> To unsubscribe from this group, send email to

> mvpa-toolbox...@googlegroups.com
> <mailto:mvpa-toolbox...@googlegroups.com>

> For more options, visit this group at
> http://groups.google.com/group/mvpa-toolbox?hl=en

> <http://groups.google.com/group/mvpa-toolbox?hl=en>
> -~----------~----~----~----~------~----~------~--~---

Jesse Rissman

unread,

Aug 13, 2008, 3:04:16 PM8/13/08

to mvpa-t...@googlegroups.com

Hi Sean,

Thanks for your reply. What you said in your first paragraph echos my intuitions for what importance maps should theoretically look like for binary classifications. However, the issue that I'm having seems to really hinge on the way in which I z-score my data, which has a dramatic effect on the importance maps. That was essentially what I was trying to convey in my detailed email.

When I only z-score the raw data (all 2030 TRs of my experiment), and then subsequently extract the 400 TRs I wish to use for classification (e.g., the 3rd TR of each of 400 trials), the data going into the classifier (a simple 2-layer backprob network) are no longer truly z-scored. That is, my temporal selection breaks the z-scoring, since I'm picking the peak timepoint for every trial, which will drive up the mean of all task-related (i.e. non-default mode network) voxels to a non-zero positive value. This same thing probably happens with many people's experiments, since they exclude rest timepoints after z-scoring, which will inevitably serve to drive up the means of most voxels in the brain. In the past, I didn't perform a second round of z-scoring, so the data going into the classifier didn't truly have a mean of 0 for each voxel. Under these circumstances, the importance maps for condition A and condition B do come out roughly as the inverse to each other, as one might expect. But I believe that this is happening for the wrong reason. Because most voxels in the brain have positive means for both condition A and condition B, the importance values acquire the sign of the weights, which will typically be positive for one output unit and negative for the other in voxels that differentiate the two conditions (i.e. "important" voxels).

However, when the data are properly z-scored before going into the classifier, the importance maps come out looking virtually identical for condition A and condition B, rather than the inverse. In my previous email, I explained how this problem arises based on your importance maps equation. In my example voxel, the average activity for category B is negative and the weight to category B is negative, but this does not mean that this voxel drives up activity in category B. Rather, it means that the more that activity in this voxel goes up, the more that activity in output unit B goes down. It just so happens that after z-scoring, this voxel has negative activity in most training trials from category B and positive activity in most training trials from category A, but the essence of its contribution is really that when its activity is high/positive (as in most category A training trials), output unit B gets driven down, and when its activity is low/negative (as in most category B training trials), output unit B gets driven up.

I think the essence of my confusion really boils down to your equation for calculating importance. I don't understand why a voxel's importance for activating output unit A is a product of its weight to output A and the average activity of training trials from category A. Why is only the category A mean (or canonical, as it's called in the interpret_weights script) used in deriving the importance score for activating the category A output unit? This only serves to bias the importance maps to mostly mirror the univariate activity differences. It seems to me that a voxel's importance value should be only a function of the weights. After all, the weight matrix is sculpted using all training trials, not just the training trials from one category. When the data are properly z-scored prior to classification, and assuming that there are equal number of category A and category B training trials, then every voxel will have a mean activity of zero (averaged across all training trials), and thus it seems like a scale factor is not necessary. I may be misinterpreting something here, but it seems to me, that if the raw weights do need some sort of scaling to become interpretable, that this scaling should be based on all training trials, and not just the training trials from one category.

What are your thoughts on this? What would be the problem with just saving the weights alone and calling that the importance map?

Thanks,

Jesse

-----------------------------------
Jesse Rissman, Ph.D.
Dept. of Psychology
Stanford University
Jordan Hall, Bldg 420
Stanford, CA 94305-2130
(o) 650-724-9515
(f) 650-725-5699

On Tue, Aug 12, 2008 at 10:47 AM, Sean Polyn <po...@psych.upenn.edu> wrote:

Dear Jesse,

Greg Detre forwarded your MVPA question to me, perhaps I can help a bit here. I remember we talked a bunch about the issue of importance values with a 2-way classification and z-scoring back when I was still at Princeton. We did come to a similar conclusion as you, that with certain limiting conditions, the two maps would end up being exactly inverted from one another. In the end, it is related to the issue that, for a discriminative classifier, in a binary classification, evidence for category A is evidence against category B. So in a sense, any discriminative information is important for both categories. I think I'm just restating a lot of what you said.

One thing though, looking through your detailed description of the issue: You say that the average activity for cat B for voxel i is negative, and it has a negative weight to output B. It seems to me that this implies that voxel i is trying to activate both A and B (for a backprop classifier at least, and possibly a logistic regression classifier)... that doesn't really make sense to me... what kind of classifier is this? Perhaps I'm just missing something basic though.

Since in the Science paper I was working with a 3-way classification, I never spent a huge amount of time solving the 2-way case, though it seems like you've made some headway. I'd be quite interested to hear Greg's thoughts on the matter, I haven't done anything with importance maps in 3 years now!

Sean

On Aug 11, 2008, at 11:15 PM, Greg Detre wrote:

web: http://www.princeton.edu/~gdetre/

--~--~---------~--~----~------ ------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Princeton MVPA Toolbox for Matlab" group.
To post to this group, send email to mvpa-t...@googlegroups.com

To unsubscribe from this group, send email to mvpa-toolbox-unsubscribe@googlegroups.com

For more options, visit this group at http://groups.google.com/group/mvpa-toolbox?hl=en

-~----------~----~----~----~------~----~------~--~---

knorman

unread,

Aug 15, 2008, 12:09:09 PM8/15/08

to Princeton MVPA Toolbox for Matlab

hi jesse,
regarding importance maps: we have definitely encountered the issue
that you described (i.e., in a 2-category discrimination, if you only
include TRs from the 2 categories of interest, excluding rest, and you
z-score before classification, you end up with identical importance
maps for the 2 categories). furthermore, you are absolutely correct
that this is a consequence of multiplying the weights by the average
activation values. i still like the approach of multiplying weights by
average activation values, but (subsequent to the polyn et al., 2005
paper) i have come to believe that it's useful to tease apart
"positive weight, positive average value" from "negative weight,
negative average value". according to the polyn et al., 2005,
importance map scheme, these two situations both yield positive
importance values. below, i have included an excerpt from the methods
section of a more recent (under review) paper, which uses a slightly
different scheme for importance maps. this amended scheme would fix
your problem (insofar as it would assign different importance maps to
the 2 categories in your experiment).

=== begin excerpt ===

A more direct way of gaining insight into how the classifier is
discriminating between encoding tasks is to look at the classifier’s
weights (after it has been trained on study-phase data).
Specifically, we wanted to use classifier weight information to
establish which voxels were most important in activating each task’s
output unit, when that task was present. For example, for scans
associated with the artist task, which voxels played the largest role
in (correctly) activating the artist unit? In our neural network
classifier, the net contribution of a voxel to activating a task unit
is a function of the voxel’s activity, multiplied by the weight
between that voxel and the task unit. Logically speaking, there are
two ways for a voxel to make a net positive contribution to activating
a particular task unit:

1) The voxel could have a positive z-scored average activation value
(indicating that it was more active than usual) for scans associated
with that task, and it could have a positive weight to that task
unit. Voxels meeting this criterion were assigned a positive
importance value imp_ij = w_ij * avg_ij, where w_ij is the weight
between input unit i (corresponding to voxel i) and output unit j
(corresponding to task j), and avg_ij is the average activity of input
unit i while subjects were performing task j.

2) The voxel could have a negative z-scored average activation value
(indicating that it was less active than usual) for scans associated
with that task, and it could have a negative weight to that task unit.
In this case, the “double negative” combination of negative activation
and negative weight results in a net positive contribution. Voxels
meeting this criterion were assigned a negative importance value
imp_ij = -w_ij * avg_ij.

Voxels where the sign of w_ij differed from the sign of avg_ij
(indicating a net negative contribution of that voxel to detecting
that task state) were assigned an importance value of zero.
Importance maps were computed using the above equations for each
individual subject. Crucially, note that (with these equations) both
positive and negative importance values indicate a net positive
contribution of that voxel to activating the task unit (when that task
is present). The sign of the importance value indicates whether the
voxel contributes via a characteristic deactivation that is picked up
by the classifier (via a negative weight), or a characteristic
activation that is picked up by the classifier (via a positive
weight). Computing importance values in this way makes it easier to
compare importance maps to the GLM results discussed in Section 4
(which indicate, for each task/cluster combination, whether the
cluster is more or less active for that task, compared to other
tasks). Note that this procedure for computing importance values
differs from the procedure used by Polyn et al. (2005), which measured
whether each voxel i made a net positive or negative contribution to
the activity output unit j, but did not indicate whether voxels making
net positive contributions did so because they were more or less
active than usual during condition j.

==== end excerpt ==========

so, let me know if this answers your question, or if you have any
follow-up questions. i should have chimed in with this sooner (i was
tuned out for a little while).
take care
ken

> > *From: *Jesse Rissman <riss...@gmail.com>
> > *Date: *August 7, 2008 2:57:53 PM EDT
> > *To: *mvpa-t...@googlegroups.com
> > *Subject: **[mvpa-toolbox] Re: question about importance maps*
> > *Reply-To: *mvpa-t...@googlegroups.com

> ...
>
> read more »

knorman

unread,

Aug 16, 2008, 10:18:00 AM8/16/08

to Princeton MVPA Toolbox for Matlab

hi jesse,

On Fri, Aug 15, 2008 at 6:34 PM, Jesse Rissman <ris...@gmail.com>
wrote:
> I wondering if you could say a bit more about why you favor the approach of multiplying the weights by the average activation values from each condition?

it all depends on what you want to get out of the importance map.
weights to the category A output unit * average activation values for
category A = "the average contribution of a voxel to activating the
category A output unit, when category A is present". as i said in my
previous post, we now find it useful to distinguish between voxels
where the presence of category A is signaled by above-average
activity, and voxels where the presence of category A is signaled by
below-average activity.

a potential issue with just looking at the weights is that, in
multicategory discriminations (i.e., n_categories >= 3), a voxel's
average activity (when category A is present) sometimes ends up being
discordant with its weight to category A. for example, it may end up
with a positive weight but a negative average activation value. if you
just look at weights, you might think that the voxel's contribution to
recognizing category A is positive, when in actuality (when category A
is present) the voxel is making a net negative contribution.

there is no problem with looking at the weights if the point of your
map is to say "hey, here are the weights".

also, including an average activity term does not force the classifier
importance maps to mirror the univariate contrast between Condition A
and Condition B. i see how the average activity term (itself) is kind
of similar to the univariate contrast, but when you multiply it by the
weight you end up with something different.

the most important point about importance maps derived from backprop
classifiers is that they are *not* a map of where information is
located in the brain. there are myriad reasons why a classifier might
end up with a small weight to an informative voxel. importance maps
(either weights alone or weights * average acts) are useful for
gaining insight into *what the classifier is doing*, but you can't
infer (based on a small importance value or weight) that the voxel is
uninformative. if you want to use classifiers for brain mapping, i
recommend the kriegeskorte approach of sliding a spherical spotlight
around the brain, trying to classify the information within that
spotlight, & then making a map of which sphere locations show good
classification performance

- ken

Hunar Ahmad

unread,

Jul 2, 2013, 5:38:31 PM7/2/13

to mvpa-t...@googlegroups.com, jesse....@stanford.edu

Dear Jesse,

Even though this post has been posted long time ago, but I have really got benefit from the discussions posted here. Though I have one question regarding the zscoring, you mentioned that you are doing the zscoreing twice one before and one after elimination of the rest period. When I did that to my data, it looks like it slightly improves the classification score. however the confusion matrix would not show the classification results in a diagonal way as usual. I was just wondering, since you have experience what is your recommendation for zscoreing. Is it better to do it twice as you have mentioned or there is a better way? your help is appreciated
Thanks a lot in advance

Hunar

Jesse Rissman

unread,

Jul 5, 2013, 1:34:08 PM7/5/13

to mvpa-t...@googlegroups.com

Hi Hunar,

That's strange that the confusion matrix would be affected by z-scoring of your data immediately prior to classification. I have not encountered that issue, so I can't really say what's going on. As far as z-scoring once vs. twice, I typically do it twice. My first round of z-scoring occurs when I load the full timeseries of each run into Matlab. The purpose of this step is essentially to equate for run-to-run shifts in the mean signal, so a simple mean correction (i.e., computing the mean of each voxel for a given run and subtracting that value from all timepoints of that run) should suffice. The variance normalization provided by the z-scoring is probably not necessary at this stage, since I deal with that later, but in practice I typically do a full z-scoring at this stage anyway. The reason I later do a second round of z-scoring is that I usually select a subset of my data to use for any given classification analysis, and after doing this, each voxel no longer has a mean of zero. If you run your classification without ensuring that the mean activity for your Class A patterns in a given voxel is the additive inverse of the mean activity for your Class B patterns in that voxel, then your Class A and Class B importance maps will end up being different. I always want my Class A and Class B importance maps to be additive inverses of each other, such that there is only one importance value for each voxel, where positive importance indicates that increased activity is predictive of Class A and negative importance indicates that increased activity is predictive of Class B. I am open to hear what others have to say about this issue.

-- Jesse

--

You received this message because you are subscribed to the Google Groups "Princeton MVPA Toolbox for Matlab" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mvpa-toolbox...@googlegroups.com.
To post to this group, send email to mvpa-t...@googlegroups.com.
Visit this group at http://groups.google.com/group/mvpa-toolbox.
For more options, visit https://groups.google.com/groups/opt_out.

J.A. Etzel

unread,

Jul 16, 2013, 5:52:23 PM7/16/13

to mvpa-t...@googlegroups.com

For what it's worth, my 'default' practice is similar to what Jesse
describes: I nearly always do some sort of voxel-wise detrending
(usually mean-subtraction or normalization) on all of the timepoints
within each run, followed by a second round of detrending (sometimes
z-scoring) immediately before classification (after subsetting voxels or
timepoints).

For troubleshooting, be very sure that the z-scoring (or however the
detrending was done) was not data 'peeking': that it did not introduce a
difference between the classes. It's easy to obtain extremely high (or
low) accuracies when the classes are normalized differently!

Jo

--
Joset A. Etzel, Ph.D.
Research Analyst
Cognitive Control & Psychopathology Lab
Washington University in St. Louis
http://mvpa.blogspot.com/

Reply all

Reply to author

Forward