Successful unsupervised hidden Markov model

jonny230774

unread,

Jun 21, 2015, 8:06:43 PM6/21/15

to stan-...@googlegroups.com

Hello All,

I have recently implemented an unsupervised HMM in Stan very successfully. In the manual a fully unsupervised HMM is not discussed except to say that the posterior is typically very multi-modal. Having read this I am surprised I got it to work so well. I am writing this post for three reasons:

1) If anyone is interested in the model then an example is attached

2) To get feedback on whether other more experienced Stan users and better statisticians than myself are surprised I got it to work - is there something about my case that makes it work?

3) To get feedback on the model in general - is there anything I could do to make it more efficient

I ran my model using rstan in R. I attach the R script. The script is completely self-contained. The model is running on simulated data.

My data is a large number (1000) of short timeseries (length=10). I have two latent groups. The data emitted at each timestep is normal with different parameters for the two groups. I have specified time invariant transition probabilities. I am interested in estimates of the model parameters (likelihood parameters, transition probabilities and prior for the first timestep) and group membership for each timestep of each timeseries.

My strategy is to embed the forward-backward algorithm in the transformed parameters block. From this block I keep the estimates of group membership (because I am interested in them) and the one-step prediction probabilities from the forward algorithm (which I need to increment the log_probability).

Hopefully this is clear. I suspect my coding of the forward-backward algorithm might be a bit non-standard. It certainly could be made more general (e.g. I have hard-coded in two groups which is all I will ever need for this project).

Thanks,

Jon.

HMMStanV2.R

Daniel Lee

unread,

Jul 20, 2015, 4:45:50 PM7/20/15

to stan-...@googlegroups.com, jonny_...@hotmail.com

Thanks! Mind if we put it in the example models repository?

There are a couple weird things going on with an element of prob_grp being NaN, but otherwise, nicely done!

Daniel

P.S. I like your warning.

jonny230774

unread,

Jul 20, 2015, 5:47:27 PM7/20/15

to stan-...@googlegroups.com, jonny_...@hotmail.com

Hi Daniel,

Very happy for this to be shared however you see fit.

Cheers,

Jon.

Jeffrey Arnold

unread,

Jul 20, 2015, 11:27:18 PM7/20/15

to stan-...@googlegroups.com

Nice. I've found the forward-backwards type algorithms marginalizing over latent states work pretty well with Stan, although I've found it challenging to extend to greater than 2 discrete states. You can also use the backwards sampling algorithm in the generated quantities block to sample the discrete states. Together, I am almost certain that these constitute a partially collapsed Gibbs sampler. Here's an example of a general HMM model with Poisson observations (https://github.com/jrnold/ssmodels-in-stan/blob/master/stan/foo.R) applied to the earthquakes model in Zucchini and MacDonald, Hidden Markov Models for R (https://github.com/jrnold/ssmodels-in-stan/blob/master/stan/foo.R).

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bob Carpenter

unread,

Jul 28, 2015, 3:40:04 PM7/28/15

to stan-...@googlegroups.com

Would you mind including that on our example-models repo?

There's a "misc" directory now for things that aren't parts
of books, or you could start something else. I think you may
have mentioned you had some Kalman filter examples up your sleeve,
too?

- Bob

Jeffrey Arnold

unread,

Jul 28, 2015, 9:51:28 PM7/28/15

to Stan users mailing list

Sorry, I've been meaning to write those up forever and haven't had time to write up clear examples. Everything is in the repo: https://github.com/jrnold/ssmodels-in-stan/. It includes: forward-backward for HMMS, including a version specifically for change points, and Kalman filters and simulation smoothers calculated the usual way (with and without missing observations), and using sequential processing, (with and without missing values). I wrote them before user-defined functions, so I may be able to simplify the code. I'll try to get them cleaned up and included in the examples, but it may not be in the immediate future.

Weijia You

unread,

Dec 18, 2016, 2:52:50 PM12/18/16

to Stan users mailing list

Dear Jon,

Thank you very much for your code for I'm working on an unsupervised HMM model estimation now. But I have two questions here:

1) it seems that only Forwards algorithm is applied. In the model part, only the variable "pred" is called and "prob_grp" is never used, am I right?

2) I tried to run the code but the estimation seems to be a bit far away from the true value.

TP<-matrix(c(0.8,0.2,0.3,0.7),nrow=2,byrow=TRUE);mu = (2,5), but the estimation I got is:

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat

p0 0.50 0.01 0.02 0.46 0.49 0.50 0.51 0.54 10 1.10

TP[1] 0.75 0.03 0.04 0.69 0.71 0.75 0.79 0.80 2 6.07

TP[2] 0.75 0.03 0.04 0.69 0.71 0.75 0.79 0.80 2 6.12

mu[1] 3.47 1.07 1.51 1.94 1.97 3.48 4.98 5.02 2 82.61

mu[2] 3.47 1.06 1.51 1.94 1.97 3.46 4.98 5.01 2 83.98

sigma[1] 1.00 0.02 0.03 0.96 0.98 1.00 1.02 1.04 3 1.90

sigma[2] 1.00 0.02 0.03 0.96 0.98 1.00 1.02 1.04 3 1.95

Also I noticed that Rhat is quite far from the ideal value around 1.

Any hints or suggestions will be much appreciated!

Weijia

Bob Carpenter

unread,

Dec 18, 2016, 6:50:38 PM12/18/16

to stan-...@googlegroups.com

> On Dec 18, 2016, at 2:52 PM, Weijia You <weij...@gmail.com> wrote:
>
> Dear Jon,
>
> Thank you very much for your code for I'm working on an unsupervised HMM model estimation now. But I have two questions here:
>
> 1) it seems that only Forwards algorithm is applied. In the model part, only the variable "pred" is called and "prob_grp" is never used, am I right?

That is right. A Stan program codes up the log posterior
up to an additive constant --- that's usually just the log
joint coded as the log prior plus the log likelihood.
The forward algorithm is an O(N * K^2) algorithm to compute
the log likelihood (N is input length and K is number of
states). If you look at the meaning of the nodes in
the forward algorithm, you'll see that for the final row if you
do log_sum_exp, you get the log density for the data. That is,
it marginalizes out all the hidden state choices leaving you
with the density of the input given the parameters.

The backward algorithm lets you also compute all the marginals
on a per-input basis and also all the transition marginals,
which can really accelerate training the log density and
in many cases makes it trivial due to conjugacy. Then what
you get is what looks like the second step of the EM algorithm.

I'm going to write all this up in a case study when I get
some time unless someone else beats me to it.

> 2) I tried to run the code but the estimation seems to be a bit far away from the true value.
> TP<-matrix(c(0.8,0.2,0.3,0.7),nrow=2,byrow=TRUE);mu = (2,5), but the estimation I got is:
> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
> p0 0.50 0.01 0.02 0.46 0.49 0.50 0.51 0.54 10 1.10
> TP[1] 0.75 0.03 0.04 0.69 0.71 0.75 0.79 0.80 2 6.07
> TP[2] 0.75 0.03 0.04 0.69 0.71 0.75 0.79 0.80 2 6.12
> mu[1] 3.47 1.07 1.51 1.94 1.97 3.48 4.98 5.02 2 82.61
> mu[2] 3.47 1.06 1.51 1.94 1.97 3.46 4.98 5.01 2 83.98
> sigma[1] 1.00 0.02 0.03 0.96 0.98 1.00 1.02 1.04 3 1.90
> sigma[2] 1.00 0.02 0.03 0.96 0.98 1.00 1.02 1.04 3 1.95
>
> Also I noticed that Rhat is quite far from the ideal value around 1.

This looks like a label-switching problem. Look at the traceplots
for mu[1] and mu[2] and see what it looks like --- they're probably mirror
images. You can try just running one chain or including something in the
model that identifies the parameters, like setting mu[1] < mu[2] by
declaring mu as an ordered vector. Sometimes you can solve the problem
with inits near the posterior mode you want. In this case (label
switching), the posterior modes are all truly symmetric.

- Bob

jonny230774

unread,

Dec 18, 2016, 10:03:20 PM12/18/16

to Stan users mailing list

Hi Weijia,

Very happy to hear that someone is making use of my example!

1) "pred" is our prior which we need in to compute the posterior density. "prob_grp" is calculated because it is a parameter of interest. "pred" is the probability of the value of the state at time t considering data up to time t-1 (i.e. a 1-step prediction). "prob_grp" is the probability of the state considering the data from all time points.

2) Label switching is almost certainly your issue. Try to examine a summary of the samples generated by each chain separately and the issue should become clear.

I hope that this is useful,

Jon.

Weijia You

unread,

Dec 19, 2016, 4:18:44 PM12/19/16

to Stan users mailing list

Dear Bob and Jon,

I do appreciate your kind replies and they are of great help!

I got your points on label switching and I do see the perfect estimation once I set the chain to 1.

But for the first question, I'm still a bit puzzled. I find that the variable "prob_grp" in the section of transformed parameters never used in the section "model". I think it means that only forward algorithm is used in the estimation. To test my guess, I revised the code (in the attachment) by just deleting the section of "//backwards algorithm" and the section "// put it all together" and rerun the code. It shows that the estimation is still as perfect as it was.

> fit2<-stan(model_code=stan.code,data=stan.data,iter=2000,chains=1)

DIAGNOSTIC(S) FROM PARSER:

Warning (non-fatal): assignment operator <- deprecated in the Stan language; use = instead.

Warning (non-fatal): increment_log_prob(...); is deprecated and will be removed in the future.

Use target += ...; instead.

C:/Users/R/win-library/3.2/StanHeaders/include/stan/math/rev/core/set_zero_all_adjoints.hpp:14:17: warning: 'void stan::math::set_zero_all_adjoints()' defined but not used [-Wunused-function]

SAMPLING FOR MODEL 'ea01d2f14349744528258b19753f0d36' NOW (CHAIN 1).

Chain 1, Iteration: 1 / 2000 [ 0%] (Warmup)

Chain 1, Iteration: 200 / 2000 [ 10%] (Warmup)

Chain 1, Iteration: 400 / 2000 [ 20%] (Warmup)

Chain 1, Iteration: 600 / 2000 [ 30%] (Warmup)

Chain 1, Iteration: 800 / 2000 [ 40%] (Warmup)

Chain 1, Iteration: 1000 / 2000 [ 50%] (Warmup)

Chain 1, Iteration: 1001 / 2000 [ 50%] (Sampling)

Chain 1, Iteration: 1200 / 2000 [ 60%] (Sampling)

Chain 1, Iteration: 1400 / 2000 [ 70%] (Sampling)

Chain 1, Iteration: 1600 / 2000 [ 80%] (Sampling)

Chain 1, Iteration: 1800 / 2000 [ 90%] (Sampling)

Chain 1, Iteration: 2000 / 2000 [100%] (Sampling)

Elapsed Time: 233.459 seconds (Warm-up)

228.165 seconds (Sampling)

461.624 seconds (Total)

> print(fit2, pars = c("p0","TP","mu","sigma"))

Inference for Stan model: ea01d2f14349744528258b19753f0d36.

1 chains, each with iter=2000; warmup=1000; thin=1;

post-warmup draws per chain=1000, total post-warmup draws=1000.

mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat

p0 0.51 0 0.02 0.47 0.50 0.51 0.52 0.54 523 1

TP[1] 0.70 0 0.01 0.69 0.70 0.70 0.71 0.72 681 1

TP[2] 0.80 0 0.01 0.79 0.80 0.80 0.81 0.82 698 1

mu[1] 5.02 0 0.02 4.98 5.01 5.02 5.04 5.06 458 1

mu[2] 2.01 0 0.02 1.98 2.00 2.01 2.02 2.04 436 1

sigma[1] 1.00 0 0.02 0.97 0.99 0.99 1.01 1.03 483 1

sigma[2] 1.00 0 0.01 0.98 0.99 1.00 1.01 1.03 644 1

Samples were drawn using NUTS(diag_e) at Mon Dec 19 12:19:04 2016.

For each parameter, n_eff is a crude measure of effective sample size,

and Rhat is the potential scale reduction factor on split chains (at

convergence, Rhat=1).

So my questions are:

1) Only by " //Forwards algorithm" part alone, we are already able to do the unsupervised HMM estimation?

2) When should I include the "//backwards algorithm" and "// put it all together" part?

Thanks a lot!

Weijia

HMMStanV2-noback.R

jonny230774

unread,

Dec 19, 2016, 6:00:05 PM12/19/16

to Stan users mailing list

Hi Weijia,

You are right - you only need the forward part to calculate the parameters of the HMM (inital state probabilities, transition probabilities and likelihood function parameters).

The "backwards" and "put it all together" parts are needed if you want estimates of the probability of state membership at each timestep that take all of the data into account.

From what Bob has written above there may also be opportunities to use the state estimates to improve the performance of the sampling algorithm.

Cheers,

Jon.

Weijia You

unread,

Dec 20, 2016, 1:39:57 AM12/20/16

to stan-...@googlegroups.com

Dear Jon,

Thank you very much for your patient explanations.

It's great that I'm on the right track.

Thank you again for your code!

Best!

Weijia

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/Uxs7fPW0Lks/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.

Reply all

Reply to author

Forward