Discrete Binary variables in STAN?

1,313 views
Skip to first unread message

Carlos

unread,
Apr 26, 2013, 4:53:18 PM4/26/13
to stan-...@googlegroups.com
Hi all!

First, thanks for the 1.3.0 update :) (You cannot imagine how useful has been for me the rep_matrix function...)
Second, I was wondering about the plans to implement discrete binary variables in STAN. I know that in principle this is a hard task (I guess that it is because of the gradient?), but it would be very useful to estimate models that have jump variables, or to do stochastic variable selection. For example, in Korobilis (2013), "Assessing the Transmission of Monetary Policy Using Time-Varying Parameter Dynamic Factor Models", there is a nice way to model changes in volatility (using a discrete variable for the variance at each time period that either turns on or shuts down time variation)... Other example might be http://econpapers.repec.org/article/eeeeconom/v_3a154_3ay_3a2010_3ai_3a1_3ap_3a85-100.htm

Thanks!

Carlos

unread,
Apr 26, 2013, 4:56:06 PM4/26/13
to stan-...@googlegroups.com
PS: I thought that it could be done by using a parameter under a uniform (0,1) instead of using a binary indicator. Thus, the posterior mean of that parameter might be interpreted as the probability that the fictitious binary variable is equal to 1...but I don't know if this is a good approach?

Bob Carpenter

unread,
Apr 26, 2013, 5:18:42 PM4/26/13
to stan-...@googlegroups.com
> On Friday, April 26, 2013 4:53:18 PM UTC-4, Carlos wrote:
>
> First, thanks for the 1.3.0 update :) (You cannot imagine how useful has been for me the rep_matrix function...)

I built rep_matrix after building a bunch of models myself and
getting frustrated that we didn't have it!
I'm going to add indexing by array for subslicing soon
too, and also block operations for vectors.

> Second, I was wondering about the plans to implement discrete binary variables in STAN. I know that in principle
> this is a hard task (I guess that it is because of the gradient?), but it would be very useful to estimate models
> that have jump variables, or to do stochastic variable selection. For example, in Korobilis (2013), "Assessing the
> Transmission of Monetary Policy Using Time-Varying Parameter Dynamic Factor Models", there is a nice way to model
> changes in volatility (using a discrete variable for the variance at each time period that either turns on or shuts
> down time variation)... Other example might be
> http://econpapers.repec.org/article/eeeeconom/v_3a154_3ay_3a2010_3ai_3a1_3ap_3a85-100.htm
> <http://econpapers.repec.org/article/eeeeconom/v_3a154_3ay_3a2010_3ai_3a1_3ap_3a85-100.htm>

We probably won't be getting to discrete variable sampling
any time soon. We know how to do it inefficiently, but doing
it efficiently is going to be a lot of work from where we're
at.

On 4/26/13 4:56 PM, Carlos wrote:
> PS: I thought that it could be done by using a parameter
> under a uniform (0,1) instead of using a binary indicator.

> Thus, the posterior mean of that parameter might be interpreted as the probability
> that the fictitious binary variable
> is equal to 1...but I don't know if this is a good approach?

That's essentially what we suggest in the mixture modeling
chapter in the manual. Marginalize out the actual binary
variable and you're often left with just such a probability
in (0,1) instead of the binary variable. Not sure if that'll
work for the model you care about. And I don't really have
time to dig the details out of the papers you cite. If you
can provide an example of what the model would look like in
Stan if we did have binary parameters, it should be easy
to tell.

- Bob

Carlos

unread,
Apr 27, 2013, 12:20:46 PM4/27/13
to
I built rep_matrix after building a bunch of models myself and 
getting frustrated that we didn't have it! 
I'm going to add indexing by array for subslicing soon 
too, and also block operations for vectors. 

That would be amazing! Also, I recommend implementing the vec and the kronecker operators... :) 

We probably won't be getting to discrete variable sampling 
any time soon.  We know how to do it inefficiently, but doing 
it efficiently is going to be a lot of work from where we're 
at. 

I imagined that... I guess that it is difficult to implement (sigh)

That's essentially what we suggest in the mixture modeling 
chapter in the manual.  Marginalize out the actual binary 
variable and you're often left with just such a probability 
in (0,1) instead of the binary variable.  Not sure if that'll 
work for the model you care about.   And I don't really have 
time to dig the details out of the papers you cite.  If you 
can provide an example of what the model would look like in 
Stan if we did have binary parameters, it should be easy 
to tell. 

Sooo, let me consider two examples. The first one would be an stochastic search, where you have a bunch of variables, and you do not know which variables should enter as a regressors in a model, for example. Imagine that we have some data, Y, and we consider 3 regressors, X1, X2, and X2, with associated coefficients b1,b2 and b3. Then, we would like to model,

Y ~ normal(d1*b1*X1 + d2*b2*X2 + d3*b3*X3, sigma)

d1,d2 and d3 are binary variables that turn on or off the contribution of the variables X1, X2 and X3, respectively. If we model them as Bernouilli distributions, then the posterior mean of the parameters gives us the probability that the variable Xi should be included in the model. This is a very very very simple model, and obviously, can be done using other methods. But if we have tons of variables/combinations of variables, I guess that this procedure might help...

The second example is a state-space model. We can consider for example, a stochastic volatility model, with
Y[n]~normal(mu,exp(h[n]/2))
h
[n]~h[n-1]+J[n]*hnoise
hnoise
~normal(0,sigmah)
In this case, J[n] is again a binary variable. If J[n]=0, then h remains the same as in the previous period, while if J[n]=1, then (log) volatility changes by the amount hnoise

Thanks!
 

Bob Carpenter

unread,
Apr 27, 2013, 3:30:52 PM4/27/13
to stan-...@googlegroups.com


On 4/27/13 12:18 PM, Carlos wrote:
> I built rep_matrix after building a bunch of models myself and
> getting frustrated that we didn't have it!
> I'm going to add indexing by array for subslicing soon
> too, and also block operations for vectors.
>
>
> That would be amazing! Also, I recommend implementing the vec and the kronecker operators... :)
>
> We probably won't be getting to discrete variable sampling
> any time soon. We know how to do it inefficiently, but doing
> it efficiently is going to be a lot of work from where we're
> at.
>
>
> I imagined that... I guess that it is difficult to implement (sigh)
>
> That's essentially what we suggest in the mixture modeling
> chapter in the manual. Marginalize out the actual binary
> variable and you're often left with just such a probability
> in (0,1) instead of the binary variable. Not sure if that'll
> work for the model you care about. And I don't really have
> time to dig the details out of the papers you cite. If you
> can provide an example of what the model would look like in
> Stan if we did have binary parameters, it should be easy
> to tell.
>
>
> Sooo, let me consider two examples. The first one would be an stochastic search, where you have a bunch of variables,
> and you do not know which variables should enter as a regressors in a model, for example. Imagine that we have some
> data, Y, and we consider 3 regressors, X1, X2, and X2, with associated coefficients b1,b2 and b3. Then, we would like to
> model,
>
> |
> Y ~ normal(d1*b1*X1 + d2*b2*X2 + d3*b3*X3, sigma)
> |

This is the one we haven't figured out how to do. It's
easy enough in a small dimensional case, but the marginalization
grows exponentially with the number of possibly-zero variates.

For example, with:

Y ~ normal(d1 * b1 * X1 + d2 * b2 * X2,
sigma);

you can marginalize out the indicator variables d1 and d2
based on their probabilities of being included, p_d1 and p_d2.
Hang on, it's ugly.

parameters {
real<lower=0,upper=1> p_d1;
real<lower=0,upper=1> p_d2;
}
model {
real log_p_12;
real log_p_1;
real log_p_2;
real log_p;
log_p_12 <- log(p_d1) + log(p_d2);
log_p_1 <- log(p_d1) + log1m(p_d2);
log_p_2 <- log1m(p_d1) + log(p_d2);
log_p <- log1m(p_d1) + log1m(p_d2);

for (n in 1:size(Y)) {
real log_p[4];
log_p[1] <- log_p_12 + normal_log(Y[n], b1 * X1[n] + b2 * X2[n], sigma);
log_p[2] <- log_p_1 + normal_log(Y[n], b1 * X1[n], sigma);
log_p[3] <- log_p_2 + normal_log(Y[n], b2 * X2[n], sigma);
log_p[4] <- log_p_ + normal_log(Y[n], 0, sigma);
lp__ <- lp__ + log_sum_exp(log_p);
}
...

where the built-in function log_sum_exp is defined as:

log_sum_exp(xs) == log(sum(exp(xs));

It's faster more arithmetically stable than the direct implementation.

There's not a good way to vectorize over the data items Y[n],
so it's really inefficient on a per-iteration basis compared
to doing discrete sampling. Having said that, it should still
work in Stan, and it may wind up giving you higher effective
sample sizes per iteration than sampling the discrete variable.

It would be possible to calculate all of this in a loop, but
I thought it'd be easier to understand blown out. It won't
be any more efficient in a loop.

For some other alternatives, see the following blog post
and discussion:

http://andrewgelman.com/2013/03/18/tibshirani-announces-new-research-result-a-significance-test-for-the-lasso/


> d1,d2 and d3 are binary variables that turn on or off the contribution of the variables X1, X2 and X3, respectively. If
> we model them as Bernouilli distributions, then the posterior mean of the parameters gives us the probability that the
> variable Xi should be included in the model.
>
> The second example is a state-space model. We can consider for example, a stochastic volatility model, with
> |
> Y[n]~normal(mu,exp(h[n]/2))
> h[n]~h[n-1]+J*hnoise
> hnoise~normal(0,sigmah)
> |
> In this case, J is again a binary variable. If J=0, then h remains the same as in the previous period, while if J=1,
> then (log) volatility changes by the amount hnoise

This one's much easier because the mixture is at the data item level.
But do you really mean it doesn't move at all if J = 0? If so, then
you get this:

parameters {
real<lower=0,upper=1> lambda;
...
}
model {
for (n in 2:N) {
if (h[n] == h[n-1])
lp__ <- lp__ + log_sum_exp(log(1 - lambda),
log(lambda) + normal_log(h[n],h[n-1],sigmah));
else
lp__ <- lp__ + log(lambda) + normal_log(h[n],h[n-1],sigmah);
}
...
}

where log_sum_exp(x,y) = log(exp(x) + exp(y)).

I think that captures the model, but someone should doublecheck
my work here. The above model has a log likelihood equivalent
to:

log PROD_n (
lambda * Normal(h[n] | h[n-1], sigmah)
+ (1 - lambda) * h[n] == h[n-1]
)


- Bob

Carlos

unread,
Apr 27, 2013, 6:23:58 PM4/27/13
to stan-...@googlegroups.com
Mmmmm just 3 letters... O.M.G. It is more complicated than I expected. For the first example, yes, that totally makes sense, although it could be quite a task to implement it for more than 3-4 variables. I guess that the approach that I suggested was too naive? That is, defining d1, d2 and d3 as parameters, and then impose priors,

d1 ~ uniform(0,1);
d2
~ uniform(0,1);
d3
~ uniform(0,1);
 and then interpreting the posterior as the probability that the variable should be included. 

For the second example, I should have explained better. The variable h does not come from the data. It is a parameter vector of size N that changes over time, so that it is unobservable. Then, every period, we have,

h[n]<-h[n-1] +J[n]*hnoise;

So that J is also a size N vector that in each entry activates or deactivates the change at time n. 

Thanks again for everything!

Bob Carpenter

unread,
Apr 27, 2013, 6:55:00 PM4/27/13
to stan-...@googlegroups.com


On 4/27/13 6:23 PM, Carlos wrote:
> Mmmmm just 3 letters... O.M.G. It is more complicated than I expected. For the first example, yes, that totally makes
> sense, although it could be quite a task to implement it for more than 3-4 variables.

Indeed. That's why I only did two. You could put it in a
loop, but it's going to be slow.

> I guess that the approach that I
> suggested was too naive? That is, defining d1, d2 and d3 as parameters, and then impose priors,
>
> |
> d1 ~ uniform(0,1);
> d2 ~ uniform(0,1);
> d3 ~ uniform(0,1);
> |
> and then interpreting the posterior as the probability that the variable should be included.
>
> For the second example, I should have explained better. The variable h does not come from the data. It is a parameter
> vector of size N that changes over time, so that it is unobservable. Then, every period, we have,
>
> |
> h[n]<-h[n-1] +J[n]*hnoise;
> |
>
> So that J is also a size N vector that in each entry activates or deactivates the change at time n.
>
> Thanks again for everything!

That's much trickier -- Stan won't let you assign
to parameters.

away. The obvious approach won't work because we can't
assign to parameters. And then there are long range
dependencies rather than local ones.

- Bob

Reply all
Reply to author
Forward
0 new messages