Model design: detect latent temporal change points in regression parameters

314 views
Skip to first unread message

Thomas Moran

unread,
Jun 18, 2014, 10:50:04 PM6/18/14
to stan-...@googlegroups.com

Hi everyone, 

I’m here to ask for some guidance with design of a model that detects latent temporal change points in regression parameters. I’ve been dabbling with RStan for a few months and am really impressed with the documentation and support, but as a novice Bayesian I’ve run into a sticking point that, with any luck, will be trivial to someone more experienced. 

tl;dr - looking for Stan approach to infer step-change of regression parameters before and after a latent temporal change point.

The simple case I’ll use for illustration is univariate linear regression of response variable y as a function of predictor x, which is a random variable drawn once per time step from a normal distribution N(.) . There is also a Gaussian noise term in the response, with standard deviation sig_y. 

y = a + b*x + noise —> y ~ N(a + b*x, sig_y)

x ~ N(.)

Could hardly be simpler. In the system under study, there is reason to believe that the system undergoes a step change in functional response, and thus regression parameters, at some change point time Tcp, such that:

y ~ N(a[1] + b[1]*x, sig_y)  for t <= Tcp

y ~ N(a[2] + b[2]*x, sig_y) for t >   Tcp

My objective is to infer Tcp, a[1:2], and b[1:2]. 

As posed, Tcp is discrete and thus can’t be directly represented in Stan. I implemented a hack solution in Stan that just pretends Tcp is continuous, which essentially treats any value of Tcp between time steps as equally likely. This actually works sometimes (meaning it converges and correctly estimates synthetic data parameters) but other times it fails badly (doesn’t converge, incorrect inference). I imagine this abuse of parameters has any number of undesirable consequences. 

I've attached the Stan code for this hack, as well as synthetic data (n=80, true parameters: Tcp=48, a[1] = 0, b[1] = 1, a[2] = 1, b[2] = 0.5, sig_y = 0.5) and the R code for generating random data.

The manual and other discussions recommend “summing out” latent variables for mixing problems, but this isn’t really a mixing problem because the grouping is a deterministic function of Tcp, and anyway Tcp is the sought parameter. For this phase of my analysis I would be happy to estimate Tcp alone, and perhaps if I were more clever it would be obvious how to sum out the other parameters and end up with a continuous representation of the change point. But so far I’m not that clever. 

Many Thanks,

-Thomas


Note:

My actual model is considerably more complicated, accounting for multiple error sources following the hierarchical approach described in 

Kelly, B. C. (2007), Some Aspects of Measurement Error in Linear Regression of Astronomical Data, Astrophys. J., 665(2), 1489, doi:10.1086/519947.

and extended to a constrained segmented linear model that is applied to hydrologic data with threshold behavior. The approach works well for lumped data (no temporal change points), but the small successes I’ve had with estimating Tcp for the simplified example above entirely fail for the more complex model. 


R version 3.0.2 (2013-09-25)

Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:

[1] rstan_2.2.0   inline_0.3.13 Rcpp_0.11.1  

loaded via a namespace (and not attached):

[1] codetools_0.2-8    KernSmooth_2.23-12 stats4_3.0.2       tools_3.0.2 

linfit_data.csv
linfit.stan
linfit_data_1cp.R

Luc Coffeng

unread,
Jun 19, 2014, 11:23:55 AM6/19/14
to
Although your hack may seem to work (sometimes), the posterior you're simulating is not smooth anymore because you are treating Tcp with a step-function. This causes a discontinuity in the posterior wherever the sampler crosses a value of Tcp that is also present in the dataset.

Solution: summing out things after all. What you need to sum out are all possible relevant values of Tcp. I.e., if in your dataset you have observations at time 1, 2, 3, 4, 5, and 6, you need to marginalize out over the unobserved values of Tcp > 1 Tcp <= 2, Tcp > 2 Tcp <= 3, Tcp > 3 Tcp <= 4, Tcp > 4 Tcp <= 5, and Tcp > 5 Tcp <= 6. In other words, you won't ever be able to estimate the exact value of Tcp, as it is interval censored. You can however estimate in which interval Tcp is most likely to be, given the intervals at which you observe the data.

Luc

Bob Carpenter

unread,
Jun 19, 2014, 2:51:41 PM6/19/14
to stan-...@googlegroups.com
What Luc says! And the reason it doesn't work well is that
HMC doesn't deal with discontinuity --- it's following
gradients and resorts to a poor random walk (based solely
on the random momentum) if the function's not differentiable.

The marginalization is possible, but may not be very efficient if
the number of time steps is large. You'll need to create a vector
of the log probs from each choice and use log_sum_exp as in the
mixture chapter of the manual. On the other hand, other
methods are also not likely to be very efficient at this, either.
For instance, you'd need the same sort of computation to do Gibbs.

- Bob



On Jun 19, 2014, at 5:23 PM, Luc Coffeng <lucco...@gmail.com> wrote:

> Although your hack may seem to work (sometimes), the posterior your simulating is not smooth anymore because you are treating Tcp with a step-function. This causes a discontinuity in the posterior wherever the sampler crosses a value of Tcp that is also present in the dataset.
>
> Solution: summing out things after all. What you need to sum out are all possible relevant values of Tcp. I.e., if in your dataset you have observations at time 1, 2, 3, 4, 5, and 6, you need to marginalize out over the unobserved values of Tcp > 1 Tcp <= 2, Tcp > 2 Tcp <= 3, Tcp > 3 Tcp <= 4, Tcp > 4 Tcp <= 5, and Tcp > 5 Tcp <= 6. In other words, you won't ever be able to estimate the exact value of Tcp, as it is interval censored. You can however estimate in which interval Tcp is most likely to be, given the intervals at which you observe the data.
>
> Luc
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Thomas Moran

unread,
Jun 20, 2014, 12:30:03 AM6/20/14
to stan-...@googlegroups.com
Thanks Luc & Bob, I'll mull this over for a bit and try an implementation. I don't have many time steps to deal with, so it should be manageable. 
Cheers,
-Thomas

Emmanuel Charpentier

unread,
Jun 20, 2014, 3:59:54 AM6/20/14
to stan-...@googlegroups.com
A few remarks below :


Le jeudi 19 juin 2014 04:50:04 UTC+2, Thomas Moran a écrit :

Hi everyone, 

I’m here to ask for some guidance with design of a model that detects latent temporal change points in regression parameters. I’ve been dabbling with RStan for a few months and am really impressed with the documentation and support, but as a novice Bayesian I’ve run into a sticking point that, with any luck, will be trivial to someone more experienced. 

tl;dr - looking for Stan approach to infer step-change of regression parameters before and after a latent temporal change point.

The simple case I’ll use for illustration is univariate linear regression of response variable y as a function of predictor x, which is a random variable drawn once per time step from a normal distribution N(.) . There is also a Gaussian noise term in the response, with standard deviation sig_y. 

y = a + b*x + noise —> y ~ N(a + b*x, sig_y)

x ~ N(.)

Could hardly be simpler. In the system under study, there is reason to believe that the system undergoes a step change in functional response, and thus regression parameters, at some change point time Tcp, such that:

y ~ N(a[1] + b[1]*x, sig_y)  for t <= Tcp

y ~ N(a[2] + b[2]*x, sig_y) for t >   Tcp

My objective is to infer Tcp, a[1:2], and b[1:2]. 

Ahem ! There is no a[2] to estimate : at x=Tcp, y=a[1]+b[1]*Tcp=a[2]+b[2]*Tcp ==> a[2]=a[1]+Tcp*(b[1]-b[2]).

Introducing an unnecessary parameter may lead to serious convergence issues...
 

As posed, Tcp is discrete and thus can’t be directly represented in Stan. I implemented a hack solution in Stan that just pretends Tcp is continuous, which essentially treats any value of Tcp between time steps as equally likely. This actually works sometimes (meaning it converges and correctly estimates synthetic data parameters) but other times it fails badly (doesn’t converge, incorrect inference). I imagine this abuse of parameters has any number of undesirable consequences. 

See above.

I've attached the Stan code for this hack, as well as synthetic data (n=80, true parameters: Tcp=48, a[1] = 0, b[1] = 1, a[2] = 1, b[2] = 0.5, sig_y = 0.5) and the R code for generating random data.

The manual and other discussions recommend “summing out” latent variables for mixing problems, but this isn’t really a mixing problem because the grouping is a deterministic function of Tcp, and anyway Tcp is the sought parameter. For this phase of my analysis I would be happy to estimate Tcp alone, and perhaps if I were more clever it would be obvious how to sum out the other parameters and end up with a continuous representation of the change point. But so far I’m not that clever. 


What would be wrong with a continuous time and modelling by y~a+log(e^(S*k1*(x_Tcp))/(1+e^(S*k2*(x-Tcp))))/S+\epsilon ? This is smooth, has slopes k1 for x<<Tcp, k1-k2 for x>>Tcp and the junction of the two segments becomes sharper with increasing S.

HTH,

--
Emmanuel Charpentier

Thomas Moran

unread,
Jun 20, 2014, 4:49:42 AM6/20/14
to stan-...@googlegroups.com
Hi Emmanuel, thanks for your remarks. I was perhaps not explicit enough that my change point Tcp is for the latent variable t (time), not x. That is, the functional response y = f(x) changes at some time t = Tcp.

However, the more extensive model that I'm developing does involve change points for x as well, and I'll think about whether the continuous form you suggested could be used there. 



--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/QlgX00f6JG8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Emmanuel Charpentier

unread,
Jun 20, 2014, 5:12:35 AM6/20/14
to stan-...@googlegroups.com
'Morning, Thomas !


Le vendredi 20 juin 2014 10:49:42 UTC+2, Thomas Moran a écrit :
Hi Emmanuel, thanks for your remarks. I was perhaps not explicit enough that my change point Tcp is for the latent variable t (time), not x. That is, the functional response y = f(x) changes at some time t = Tcp.

Ah ! Okay. But you still don't have to estimate a[2] if your latent "time" has  no discontinuities.

Thomas Moran

unread,
Jun 23, 2014, 7:22:35 PM6/23/14
to stan-...@googlegroups.com
Hi All,

Following Emmanuel’s lead, I found it simpler to use a smooth
analytical approximation of a step function to model the latent change
point. For the simple case of a linear regression y(x,t) with
intercept at 0 and change in slope only, the model can be stated as:

y = (b + db * H(t) )*x + noise

where b is slope, db is the change in slope, and

H(t) = 1/( 1 + exp( -(t - Tcp)*A) )

is a logistic function approximation to the Heaviside function with
that is ~0 for t < Tcp and ~1 for t > Tcp, with transition shape
parameter A.

The implementation in Stan:

data {
int<lower=0> N; // data count
vector[N] x; // predictor (covariate)
vector[N] y; // outcome (variate)
vector[N] t; // time
}
parameters {
real b; // slope
real db; // slope change
real<lower=0> sigma; // noise
real<lower=1,upper=N> Tcp; // change point time
}
model {
vector[N] mu_y; // modeled y
real H; // smooth logistic
approximation to Heaviside function
for (n in 1:N){
H <- 1/(1+exp(-(t[n]-Tcp)*10)); // sampling sensitive to
exponent coefficient
mu_y[n] <- (b + db*H)*x[n]; // modeled y
}
y ~ normal(mu_y,sigma); // observed y with noise
}

An example data file is attached, and Stan recovers the true values b
= 1, db = -0.5, sigma = 0.5, Tcp = 48, with seemingly good
convergence.

So, am I missing any gotchas or pitfalls with this approach?


Incidentally, I made a couple attempts at marginalizing the latent
variable t, but besides the inefficiency that Bob mentioned I also
found it to be conceptually more difficult than the analytic approach
described above. I wonder if could convince someone (perhaps Luc?) to
apply this type of marginalization of a latent discrete change point
to the Stan manual mixture model as an example case?

Cheers,
-Thomas

On Fri, Jun 20, 2014 at 2:12 AM, Emmanuel Charpentier
data.csv

Luc Coffeng

unread,
Jun 24, 2014, 12:14:05 AM6/24/14
to
Hi Thomas,

I like the idea, and think it could work, provided that the shape parameter for the logistic transition curve (A) is high enough to get a "close to clean" distinction in terms of time (relatively to the time interval between observations). Still, I think it should be possible to marginalize out things, which I would also prefer as being "cleaner". I'd love to help writing up a model for this, however, I'm in the middle of trying to finish a dozen other things. You've probably seen it already, but just so you know: page 123 of the reference manual for Stan 2.2.0 provides the backbone for mixture models.

Oh and I'm not on the Stan team btw :). Just another Stan enthousiast who has benefitted enormously from this group so far, and is trying to spread the word :).

Luc

Luc Coffeng

unread,
Jun 24, 2014, 12:28:17 AM6/24/14
to
On second thought, I suspect that the smooth ridge that you have introduced in the posterior with your logistic function also hampers sampler efficiency, as it will strongly change the curvature of the log posterior in a very localized area. I'd be interested to see the final estimates based on multiple chains and different starting values of Tcp. Hopefully, all chains converge to the same estimate. However, there might be a chance that your Hamiltonian particle (if you will), does not always have enough momentum to climb up that steep logistic slope of yours, and explores only a part of the posterior for Tcp. Input from the dev people required here :). Anyway, this wouldn't be an issue in a mixture model :).

Bob Carpenter

unread,
Jun 24, 2014, 1:59:44 PM6/24/14
to stan-...@googlegroups.com
There is definitely a limit to Euclidean HMC's ability to
explore posteriors that vary dramatically in density. Michael's
written about it pretty extensively in his papers and I think
there's a discussion in Radford Neal's overview in the MCMC Handbook.

Since the potential energy is negative log density, it's actually
falling "down" toward the mode, not climbing "up" a hill.

- Bob


On Jun 24, 2014, at 6:28 AM, Luc Coffeng <lucco...@gmail.com> wrote:

> On second thought, I suspect that the smooth ridge that you have introduced in the posterior with your logistic function also hampers sampler efficiency, as it will strongly change the curvature of the log posterior in a very localized area. I'd be interested to see the final estimates based on multiple chains and different starting values of Tcp. Hopefully, all chains converge to the same estimate. However, there might be a chance that your Hamiltonian particle (if you will), does not always have enough momentum to climb up that steep logistic slope of yours. Input from the dev people required here :). Anyway, this wouldn't be an issue in a mixture model :).
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Luc Coffeng

unread,
Jun 24, 2014, 6:51:05 PM6/24/14
to


On Tuesday, 24 June 2014 10:59:44 UTC-7, Bob Carpenter wrote:

Since the potential energy is negative log density, it's actually
falling "down" toward the mode, not climbing "up" a hill.

Yup! As I was writing that, I was thinking of the situation where the particle just explores the area directly round the mode and doesn't reach the tails of the posterior.

Thomas Moran

unread,
Jun 25, 2014, 1:38:29 AM6/25/14
to stan-...@googlegroups.com
Hi Luc,

Thanks for the thoughts. I've had some (apparent) success with the
smooth ridge approach, which the astute Stan user will recognize as
the built-in inv_logit() function. That is, multiple chains with
different starting values converge to the same, accurate estimate.
However, vis-a-vis your second thought, the sampler is indeed
sensitive to the slope of the ridge; sampling time increases
drastically with transition slope. My heuristic solution to this has
been to keep the slope as shallow as possible, i.e. set the shape
parameter so that the ridge transition span is comparable to the data
interval. So far, so good, but I'll need to check out Bob's suggested
reading to convince myself I really understand what's going on.

Cheers,
-Thomas

On Mon, Jun 23, 2014 at 9:28 PM, Luc Coffeng <lucco...@gmail.com> wrote:
> On second thought, I suspect that the smooth ridge that you have introduced
> in the posterior with your logistic function also hampers sampler
> efficiency, as it will strongly change the curvature of the log posterior in
> a very localized area. I'd be interested to see the final estimates based on
> multiple chains and different starting values of Tcp. Hopefully, all chains
> converge to the same estimate. However, there might be a chance that your
> Hamiltonian particle (if you will), does not always have enough momentum to
> climb up that steep logistic slope of yours. Input from the dev people
> required here :). Anyway, this wouldn't be an issue in a mixture model :).
>

Luc Coffeng

unread,
Jun 25, 2014, 1:49:16 AM6/25/14
to stan-...@googlegroups.com
Hi Thomas,

Like Bob said (and I should have realized this as well), all chains will converge to the same mode as the ridges do not prohibit this. However, HMC will probably oversample the area directly around the mode, and undersample values further away from the mode. I.e. the uncertainty of the parameter estimates (not just that of the latent temporal point) will be underestimated. There is no guarantee that decreasing the slope of your logistic function allows appropriate sampling of all regions of the posterior. One more vote for a mixture model :).

Luc

Thomas Moran

unread,
Jun 25, 2014, 2:15:53 AM6/25/14
to stan-...@googlegroups.com
Hi Luc,

Ah, thanks, I didn't catch the full implication of what Bob said.
Ideally I'll compare the two approaches and if there is agreement in
the sampling space then implement the more efficient method. However,
that will have to wait until I develop a decent marginalization
approach. I've indeed been using the mixture model section of the
manual as a guide, and that example makes perfect sense, but I'm
having conceptual difficulty extending it to my case. The problem is
almost certainly operator error: I'm a good example of the perils of
putting powerful and accessible Bayesian modeling tools in the hands
of someone with merely passable theoretical background. But man, what
a great tool it is, and hopefully it wont' be long before I can give
something back to this community.

Cheers,
-Thomas

Michael Betancourt

unread,
Jun 25, 2014, 4:08:09 AM6/25/14
to stan-...@googlegroups.com
Multimodality is rarely the answer and almost always indicates a poorly designed model.  In particular, a multimodal model makes sense only if you want to average over all the modes — if you want to condition on a single model and look at averages in that neighborhood then your model is inconsistent.

If we had an exact integrator then HMC would sample the mode and the tails exactly as necessary (technical note — perhaps not with geometric ergodicity but at the very least with uniform ergodicity).  In practice where we use a numerical integrator the finite step size may cause problems but these are readily diagnosed.  See, for example, http://arxiv.org/abs/1312.0906.  This is exactly what we would see if the ridge were to become very tight but by decreasing the step size appropriately the sampler would be able to explore everything (modulo a few technical details).

In any case, just make the slope a hyperparameter and give it a Gaussian hyperprior around where you are currently setting it by hand.  If the posterior for the slope is in tension with the prior then you know that there is trouble with the smooth turn on and can reevaluate your model.

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,
Jun 25, 2014, 5:21:45 AM6/25/14
to stan-...@googlegroups.com
Mixture models have their own multimodality, where
the different modes are the same thing with labels
swapped. You get the same kind of behavior in IRT
models with sign switching of alpha, beta, and delta


y ~ bernoulli_logit(delta * (alpha - beta));

One approach is to sample around a single mode --- it's easy
enough to do in BUGS or Stan. The mass doesn't normalize,
but that's not an issue. What does seem like an issue is that
the shape of the density is wrong --- it'll have a trough
and then start heading up toward the other mode, whereas a
proper model's probability should continue to decline.
Now whether there's enough mass in that region to make a
difference is another matter --- if there were, the sampler
will bounce around modes and you have bigger problems.

So my question is how do we deal with this? In IRT models,
we usually just constrain delta > 0, which gets rid of
the multimodality. Is that always the right thing to do?
What's the equivalent for a mixture model?

- Bob

Michael Betancourt

unread,
Jun 25, 2014, 5:32:32 AM6/25/14
to stan-...@googlegroups.com
So my question is how do we deal with this?  In IRT models,
we usually just constrain delta > 0, which gets rid of
the multimodality.  Is that always the right thing to do?
What's the equivalent for a mixture model?

That depends on the model.

The point is that the permutation invariance you see in mixture models has no meaning -- it's
just a consequence of having too many degrees of freedom.  Consequently the "right" answer
would be to find the correct minimal degrees of freedom but that doesn't exactly give a general
algorithm so it falls more into the domain of the folk theorem.  

Multimodal models are hard to fit, but that's usually because you're not fitting the model you're
actually after.

Allen Riddell

unread,
Jun 25, 2014, 7:16:03 AM6/25/14
to stan-...@googlegroups.com
On 06/25, Bob Carpenter wrote:
> Mixture models have their own multimodality, where
> the different modes are the same thing with labels
> swapped. You get the same kind of behavior in IRT
> models with sign switching of alpha, beta, and delta
>
>
> y ~ bernoulli_logit(delta * (alpha - beta));
>
> One approach is to sample around a single mode --- it's easy
> enough to do in BUGS or Stan. The mass doesn't normalize,
> but that's not an issue. What does seem like an issue is that
> the shape of the density is wrong --- it'll have a trough
> and then start heading up toward the other mode, whereas a
> proper model's probability should continue to decline.
> Now whether there's enough mass in that region to make a
> difference is another matter --- if there were, the sampler
> will bounce around modes and you have bigger problems.
>
> So my question is how do we deal with this? In IRT models,
> we usually just constrain delta > 0, which gets rid of
> the multimodality. Is that always the right thing to do?
> What's the equivalent for a mixture model?
>

In simple mixtures you can demand that the mixture proportions be strictly
increasing (or decreasing). I recall this working well in practice.

Best,

-a

Bob Carpenter

unread,
Jun 25, 2014, 11:44:46 AM6/25/14
to stan-...@googlegroups.com
There's a discussion in the problematic posteriors chapter of
the manual about how this can bias inference unless the mixture
components are well separated. Andrew originally drew the example
out and we should really have that picture.

For instance, it fixes the order statistics by prohibiting draws where
index 1 is on both sides of index 2, but it would make sense in the
posterior if they were not well separated.

- Bob

Thomas Moran

unread,
Jun 25, 2014, 2:30:08 PM6/25/14
to stan-...@googlegroups.com
I'm glad to have stimulated a discussion more interesting than my
original question.

My naive expectation is that the posterior of the ridge transition
shape hyperparameter should tell me something about my hypothesis that
the transition between states is abrupt, but it's clear I also need to
consider how the sampler and model are interacting. Off to study
Michael's paper...
> You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/QlgX00f6JG8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Luc Coffeng

unread,
Jun 26, 2014, 1:26:25 AM6/26/14
to
Hi Thomas,

Because I find your modeling problem so interesting, I took a go at writing up some code for a mixture model. See attachment.

An assumption we did not discuss so far is the minimum number of data points that is believed to be in either subset of the data (or whereabouts the changing points might be). As we are doing linear regression, I thought a practical starting point was to assume that there are at least three data points in both subsets of the data.

The code requires that the data is sorted in ascending or descending order of x and assumes that x is measured without error. In your original post, you mentioned that x is measured with error. I tried to incorporate this in my original setup, but quickly concluded that it was impossible (or beyond my understanding of math) as I ended up with an infinite number of possible changing points and therefore, infinite mixture components (the true values of the x's can take on infinitely many values). To address this, I came up with another wild idea, though I'm not sure it would work. Maybe some of the more math-savvy folk may want to chime in:

Assume that each data point (x,y) is sampled from either of two bivariate normal distributions MVN1 and MVN2, each with the same scale of measurement error (synonymous to the original model), but with different locations and correlations (synonymous to the intercepts and slopes in the original model). The bivariate normal distributions are just another way of parameterizing a linear regression in which you take account of the measurement error in the independent variable (here: x). In the generated quantities block, regression lines can be cooked up through standard formulae for the conditional probability P(Y | X = x) = mu_y + rho * sigma_y / sigma_x * (x - mu_x)

N.B.: label swapping alert!! To identify the two mixture components, use an ordered vector for the location parameters on the x-axis or y-axis.

We'd like the two bivariate normal distributions to best explain the variation in the data (easy, that's just prior * likelihood for each mixture component), while also making the probability that a data point is actually from a mixture component is inversely proportional to the density of that data point conditional on the other mixture component: P(x,y comes from MVN1) propto 1 / P(x,y | MVN2). This way, points far away from the location of MVN2 are said to probably come from MVN1 (and if this is unlikely, MVN2 should better accommodate that data point in terms of mu and sigma). And vice versa of course. If a data point is right smack in the middle of the two bivariate distributions, it will contribute equally to both mixture components. Sounds a bit like a lasso, now I think of it. There is no opportunity for either mixture component to shrink away such that it only explains one data point as the penalty for all the other data points would be huge and outweigh the likelihood.

This would lead to the following code:

data {
  int<lower=1> N;  // number of data points
  vector[2] Y[N];    // N pairs of data points (x,y)
}

parameters {
  ordered[2] mu_x;  // ordered location of two mixture component along x-axis
  vector[2] mu_y;  // unordered location of two mixture component along y-axis
  real<lower=0> sigma[2];  // measurement error in x and y (same for all mixture components)
  corr_matrix[2] C1;  // correlation matrix for first mixture component
  corr_matrix[2] C2;  // correlation matrix for second mixture component
}

transformed parameters {
  vector[2] Mu1;  // location of first mixture component
  vector[2] Mu2;  // location of second mixture component
  matrix[2,2] Sigma1; // covariance matrices for first mixture component
  matrix[2,2] Sigma2; // covariance matrices for first mixture component
 
  Mu1[1] <- mu_x[1];
  Mu1[2] <- mu_y[1];
  Mu2[1] <- mu_x[2];
  Mu2[2] <- mu_y[2];

  Sigma1 <- diag_pre_multply(sigma,C1);
  Sigma1 <- diag_post_multply(Sigma1,sigma);
  Sigma2 <- diag_pre_multply(sigma,C2);
  Sigma2 <- diag_post_multply(Sigma2,sigma);
}

model {
  real ps[2];

  for (n in 1:N) {
    ps[1] <- multi_normal_log(Y[n], Mu1, Sigma1) - multi_normal_log(Y[n], Mu2, Sigma2);
    ps[2] <- multi_normal_log(Y[n], Mu2, Sigma2) - multi_normal_log(Y[n], Mu1, Sigma1);
    increment_log_prob(log_sum_exp(ps));
  }
}

generate quantities {
  real B0[2];  // intercepts for the two regression lines describing Y|X in the two mixture components
  real B1[2];  // slopes for the two regression lines describing Y|X in the two mixture components
 
  B1[1] <- C1[1,2] * Sigma1[2,2] / Sigma1[1,1];
  B1[2] <- C2[1,2] * Sigma2[2,2] / Sigma2[1,1];
  B0[1] <- Mu1[2] - B1[1] * Mu1[2];
  B0[2] <- Mu2[2] - B1[2] * Mu2[2];

}

Fat chance I botched things at some point. Would be interested to hear from others whether this makes any sense :). I"m guessing the model still needs a tuning parameter for the lasso, i.e. a scalar to multiply the log-penalty with (the exponent of the penalty weight). This tuning parameter will determine the scale of the two MVN distributions. Maybe there is even some way to have the tuning parameter depend on the estimated scales of MVN1 and MVN2, and get a reasonable estimate of the scales, but that is really beyond me at this point.

And Thomas, let me know whether the attached code works for you!

Luc
linfit_mix.stan

Luc Coffeng

unread,
Jun 26, 2014, 3:18:13 AM6/26/14
to
I just realized that with that last piece of code (the bivariate normal distributions), I was trying to apply a clustering method without realizing it: generalized soft K-means! This method is also described in the Stan reference manual, p. 130 (v.2.2.0). Adapting the code from the manual to Thomas' problem leads to the code below. I think the major difference is that there is no lassoing in soft K-means as a priori data points are assumed to come from K different mixture components with equal probability. I would be happy to hear comments on the appropriateness of lassoing.


data {
  int<lower=1> N; // number of data points
  int<lower=2> D; // number of dimensions
  int<lower=2> K; // number of clusters
  vector[D] Y[N];  // N sets of D-dimensional data points
}

transformed data {
  real<upper=0> neg_log_K;
  neg_log_K <- -log(K);
}

parameters {
  ordered[K] mu_y;  // ordered location of mixture component along the last dimension
  real mu_x[K,D-1];  // unordered location of mixture components along remaining dimensions
  vector<lower=0> sigma[D];  // measurement error
  corr_matrix[D] C[K];  // correlation matrices for mixture components
}

transformed parameters {
  vector[D] Mu[K];  // location of mixture components
  matrix[D,D] Sigma[K]; // covariance matrices for mixture component
 
  for (k in 1:K) {
    for (d in 1:(D-1)) {
      Mu[k,d] <- mu_x[k,d]
    }
    Mu[k,D] <- mu_y[k]
    Sigma[K] <- diag_pre_multply(sigma,C[K]);
    Sigma[K] <- diag_post_multply(Sigma[K],sigma);
  }
 
}

model {
  vector[K] soft_z;


  for (n in 1:N) {
    for (k in 1:K) {
      soft_z[k] <- neg_log_k - multi_normal_log(Y[n], Mu[k], Sigma[k]);
    }
    increment_log_prob(log_sum_exp(soft_z));
  }
}

generated quantities {
  // The following parameters are to be filled from a permutated version of Sigma, such that the last row and column of Sigma
  // have become the first one, and the rest is shifted down and to the right. Given this permutation, we can easily calculate
  // the expected value of the last dimension, conditional on all other dimensions, expresses in the form of a linear regression
  // function, including a conditional error term. To arrive at this expression, we use formulae for the conditional distribution
  // of mu1 | mu2, given the full covariance matrix for the joint distribution of mu1 and mu2. This expression is calculated for
  // each cluster to arrive at a cluster-specific regression line.

  matrix[D-1,D-1] Sigma22[K];  // marginal covariance matrix for the first D-1 dimensions, which are intended as predictors
  real Sigma11[K];  // marginal covariance of the last dimension, which is intended to be predicted
  vector[D-1] Sigma21[K];  //  marginal covariance between first D-1 dimensions and the last dimension
  vector[D-1] mu2[K];  // the means of the clusters on all but the last dimensions
  real mu1[K];  // the means of the clusters on the last dimension
  real sigma_cond[K]; // the conditional standard error of the residuals on the last dimension, given all other dimensions
 
  vector[D-1] B0[K];  // cluster-specific intercepts for the regression line through the last dimension, conditional on all other dimensions
  vector[D-1] B1[K];  // cluster-specific slopes for the regression line through the last dimension, conditional on all other dimensions

  for (k in 1:K) {
    mu1[k] <- mu[k,D];
   
    for (d in 1:(D-1)) {
      mu2[k,d] <- mu[k,d];
      Sigma21[k,d] <- Sigma[k,D-d,1];  // D-d accounts for permutation of last dimension to first position
     
      for (dd in 1:(D-1)) {
        Sigma22[k,d,dd] <- Sigma[k,d,dd];
      }
     
    }
    Sigma11[k] <- Sigma[k,D,D];
   
    B1[k] <- Sigma21[k]' / Sigma11[k];
    B0[k] <- Mu1[k] - dot_product(B1[k], Mu2[k]);
    sigma_cond[k] <- Sigma11[k] - B1[k] * Sigma21[k];
  }
}

If anything crashes, I expect it to be the generated quantities. Apologies in advance :)!.

Bob Carpenter

unread,
Jun 26, 2014, 6:10:47 AM6/26/14
to stan-...@googlegroups.com
You can assume the mixture probability is not
uniform and fit it.

I'm not sure what you mean by lassoing. Do you
want to eliminate dimensions from the clustering?
You can put hierarchical priors on the variances.
We can't technically do the L1 lasso because it's
not a Bayesian method. Our optimization won't deal
with it either because it's gradient based and we
don't do anything special around 0 like truncation.

- Bob
> for (n in 1:N) {
> for (k in 1:K) {
> soft_z[k] <- neg_log_k - multi_normal_log(Y[n], Mu[k], Sigma[k]);
> }
> increment_log_prob(log_sum_exp(soft_z));
> }
> }
>
> generate quantities {
> matrix[D-1,D-1] Sigma11[K]; // marginal covariance matrix for the first D-1 dimensions, which are intended as predictors
> real Sigma22[K]; // marginal covariance of the last dimension, which is intended to be predicted
> vector[D-1] Sigma21[K]; // marginal covariance between first D-1 dimensions and the last dimension
> vector[D-1] mu11[K]; // the means of the clusters on all but the last dimensions
> real mu22[K]; // the means of the clusters on the last dimension
> real sigma_cond[K]; // the standard error of the residuals on the last dimension, conditional on all other dimensions
>
> vector[D-1] B0[K]; // cluster-specific intercepts for the regression line through the last dimension, conditional on all other dimensions
> vector[D-1] B1[K]; // cluster-specific slopes for the regression line through the last dimension, conditional on all other dimensions
>
> for (k in 1:K) {
> mu22[k] <- mu[k,D];
>
> for (d in 1:(D-1)) {
> mu11[k,d] <- mu[k,d];
> Sigma21[k,d] <- Sigma[k,d,1];
>
> for (dd in 1:(D-1)) {
> Sigma11[k,d,dd] <- Sigma[k,d,dd];
> }
>
> }
> Sigma22[k] <- Sigma[k,D,D];
>
> B1[k] <- Sigma12[k] / Sigma11[k];
> B0[k] <- Mu22[k] - dot_product(B1[k], Mu11[k]);
> sigma_cond[k] <- Sigma22[k] - B1[k] * Sigma21[k];
> }
> }
>
> If anything crashes, I expect it to be the generated quantities. Apologies in advance :)!.
>

Luc Coffeng

unread,
Jun 26, 2014, 10:43:45 AM6/26/14
to
Sorry for the vague use of lassoing.By lassoing I meant that the model tries to find two non-trivial mixture components that are as far away from eachother as possible, but that also still explain the data, such that for each data points either of the following is as large as possible.
  • multi_normal_log(Y[n], Mu1, Sigma1) - multi_normal_log(Y[n], Mu2, Sigma2);  // Log-probability of occurring under MVN1, penalized propto 1 / P(occurring under MVN2)^lambda
  • multi_normal_log(Y[n], Mu2, Sigma2) - multi_normal_log(Y[n], Mu1, Sigma1);  // Log-probability of occurring under MVN2, penalized propto 1 / P(occurring under MVN1)^lambda

So maybe lasso is not really the best word here. Inverse lasso? In my previous code, the tuning parameter lambda is not specified and there assumed to be one (this goes beyond your question).

Apologies for the wild thoughts.

Luc

Bob Carpenter

unread,
Jun 26, 2014, 11:48:49 AM6/26/14
to stan-...@googlegroups.com
I would think you'd want to let the data induce the
correct amount of separation in the means.

For K-means, there's classically an issue with initialization.
Strategies like K-means++ attempt to get good coverage in
initialization by maximizing the distance between the
initialized means. But it's a chicken-and-egg problem, because
if you knew where good centers were to start, you wouldn't need
K-means in the first place. K-means++ is neat in that it tries
to balance coverage with distance apart, but it's still pretty
crude. More principled ways to do this are well beyond what
I know how to do.

- Bob

Thomas Moran

unread,
Jun 26, 2014, 2:43:19 PM6/26/14
to stan-...@googlegroups.com
Hi Luc,

Thanks for the efforts, I'll work through the code to make sure I
understand everything then get back with some observations.

Cheers,
-Thomas

On Thu, Jun 26, 2014 at 8:47 AM, Bob Carpenter <ca...@alias-i.com> wrote:
> I would think you'd want to let the data induce the
> correct amount of separation in the means.
>
> For K-means, there's classically an issue with initialization.
> Strategies like K-means++ attempt to get good coverage in
> initialization by maximizing the distance between the
> initialized means. But it's a chicken-and-egg problem, because
> if you knew where good centers were to start, you wouldn't need
> K-means in the first place. K-means++ is neat in that it tries
> to balance coverage with distance apart, but it's still pretty
> crude. More principled ways to do this are well beyond what
> I know how to do.
>
> - Bob
>
> On Jun 26, 2014, at 4:43 PM, Luc Coffeng <lucco...@gmail.com> wrote:
>
>> Sorry for the vague use of lassoing.By lassoing I meant that the model tries to find two non-trivial mixture components that are as far away from eachother as possible, but that also still explain the data, such that for each data points either of the following is as large as possible.
>> * multi_normal_log(Y[n], Mu1, Sigma1) - multi_normal_log(Y[n], Mu2, Sigma2); // Log-probability of occurring under MVN1, penalized propto 1 / P(occurring under MVN2)^lambda
>> * multi_normal_log(Y[n], Mu2, Sigma2) - multi_normal_log(Y[n], Mu1, Sigma1); // Log-probability of occurring under MVN2, penalized propto 1 / P(occurring under MVN1)^lambda
> You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/QlgX00f6JG8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages