use stan for time series clustering

462 views
Skip to first unread message

Bin Han

unread,
May 29, 2016, 6:06:17 PM5/29/16
to Stan users mailing list
I am trying to use stan to do a time series clustering, using mixture model kind of idea. An example usage is given economic metric time series for a set of states in US, cluster the states based on similarity of their time series. The idea is having a mixture of AR(p) model for each cluster. The prior for cluster can dirichlet, and the prior for AR parameter phi can be multivariate normal.

The model is following:

N: # of states,
T: # of ts data points,
K: # of clusters,
p: parameter for AR(p),
cluster ~ dirichlet, (for each state n)
phi ~ multi_normal(psi0, omega0), (for each cluster k )

likelihood: y_{n, t} ~ product_over_k (normal(mu_k, sigma_k) * prob(cluster_{n, k})), (this is kinda like the soft k mean likelihood)

where mu = phi_{k, 1} + phi_{k, 2} * y_{n, t-1} + ... phi_{k, p+1} * y_{n, t-p},
and sigma ~ gamma(a0, b0).

The stan code is as follows:

data {
  int K; int P; int N; int T;  # K clusters, P auto-reg, N items, T ts data points
  vector[T] y[N];  # ts data
  vector[P + 1] psi0[K];  # prior for psi
  vector<lower=0>[K] e0;  # prior for cluster 
}

parameters {
  vector[P + 1] psi[K];
  real<lower=0> sigma[K];  # std for data
  simplex[K] c[N];
}

model {
  real ps[K];
  
  for (n in 1:N){
    for (t in (P + 1):T){
      for (k in 1:K) {
        ps[k] <- log(c[n, k]) + normal_log(y[n, t], psi[k, 1] + dot_product(psi[k, 2:(P + 1)], y[n, (t - 1):(t - P)]), sigma[k]);
      }
      increment_log_prob(log_sum_exp(ps));  
    }
  }

  sigma ~ gamma(2, 0.5);
  for (k in 1:K)
    psi[k] ~ normal(psi0[k], 100);
  for (n in 1:N)
    c[n] ~ dirichlet(e0);
}

I also attached the data "m-unempstatesAdj.txt".

The R code is here 

library(rstan); library(shinystan)
library(dplyr); library(tidyr)

K = 4
P = 2

da=read.table("m-unempstatesAdj.txt",header=T)
zt=apply(da,2,diff)

# set intial prior from k mean
init = kmeans(t(zt), K)
center = init$center
psi0 = matrix(0, K, P + 1)
for (k in 1:K){
  mfit = arima(center[k,], order=c(P, 0, 0), method='CSS')
  # 1st element of phi is unconditional mean, 
  # 2nd to (p+1)th elements are ar-parameters
  psi0[k, 1] = mfit$coef[P + 1]
  psi0[k, 2:(P + 1)] = mfit$coef[1:P]
}

data_list = list(
  N = dim(zt)[2],
  T = dim(zt)[1],
  K = K,
  P = P,
  y = t(zt),
  psi0 = psi0,
  e0 = rep(3, K)
)


m <- stan('tscluster.stan', 
           data=data_list, 
           chains=2, iter=500+500, warmup=500)


With fairly small amount of data, this code takes forever to run.

Since I'm new to stan, I have the following question:

1. is this code doing what I want at all?
2. how to make it more efficient?
3. better prior settings?
4. better idea for the entire modeling approach?

Really appreciate the help!
m-unempstatesAdj.txt

Bob Carpenter

unread,
May 29, 2016, 7:41:49 PM5/29/16
to stan-...@googlegroups.com
Yes, what you want to do is possible. But you should
think about formulating the entire time series model,
then mixing that. You're mixing at the time point level,
which probably isn't what you're after.

p(y[n] | c, theta) = SUM_k Categorical(k | c) Foo(y[n] | theta[k])

where each y[n] is a time series and theta[k] is a complete
set of parameters for mixture component k.

Then just do that with log_sum_exp() on the log scale.

log p(y[n] | c, theta)
= LOG_SUM_EXP_k log Categorical(k | c) + log Foo(y[n] | theta[k])

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <m-unempstatesAdj.txt>

Bin Han

unread,
May 29, 2016, 9:42:31 PM5/29/16
to stan-...@googlegroups.com
Thank you for prompt response! Taken your suggestion, I changed the model into following: (moving K loop out of T loop)

model {
  real ps[K];
  real mu;

  for (n in 1:N){
    for (k in 1:K) {
      ps[k] <- 0;
      for (t in (P + 1):T){
        mu <- psi[k, 1];
        for (p in 2:(P + 1))
          mu <- mu + psi[k, p] * y[n, (t - p + 1)];
        ps[k] <- ps[k] + normal_log(y[n, t], mu, sigma[k]);
      }
      ps[k] <- ps[k] + log(c[n, k]);  
    }
    increment_log_prob(log_sum_exp(ps));
  }

  sigma ~ gamma(2, 0.5);
  
  for (k in 1:K)
    psi[k] ~ normal(psi0[k], 100);
  for (n in 1:N)
    c[n] ~ dirichlet(e0);
}

It is still super slow. Is it because of those for loops? If so, is there a way to optimize/vectorize those? 



You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/vF2ZiH-CTQQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,
May 29, 2016, 10:25:50 PM5/29/16
to stan-...@googlegroups.com
If you could express mu as a dot-product that'd be better.
Or can you calculate it as a running average?

You don't want to keep setting ps[k] <- 0 each time in the
loop.

But I still don't see the structure of what you're doing here,
but I'm not an expert on time series.

- Bob

Bin Han

unread,
May 30, 2016, 8:50:55 AM5/30/16
to stan-...@googlegroups.com
The idea of this time series clustering comes from here:

which is in chap 6 of the book

Multivariate Time Series Analysis with R and Financial Applications" 
by Ruey S. Tsay

I’m really having a hard time translating this into stan (after spending ~10 hours). If someone could help on writing the modeling part it will be greatly appreciated!

  

Bin Han

unread,
May 30, 2016, 8:53:43 AM5/30/16
to stan-...@googlegroups.com
The book is attached. Please see page 386 - 393 (in chap 6).

Bin Han

unread,
May 30, 2016, 9:00:51 AM5/30/16
to stan-...@googlegroups.com
sorry.. here it is.

tscluster.pdf

Bob Carpenter

unread,
May 30, 2016, 10:47:50 AM5/30/16
to stan-...@googlegroups.com
Each series k is drawn from its own mixture component
according to theta. The draw of mixture component is
marginalized out. I used the notation from the paper.

- Bob

/**
* AR(1) version
*/
data {
int<lower=0> K; // # series
int<lower=0> T; // # observations / series
int<lower=1> H; // # clusters
matrix[K, T] z; // observations
}
parameters {
vector[H] phi0; // AR(1) coefficients
vector[H] phi1;
vector<lower=0>[H] sigma; // innovations scales
simplex[H] theta; // mixture proportions
}
model {
vector[H] lp;
vector[H] log_theta;
log_theta <- log(theta);
for (k in 1:K) {
lp <- log_theta;
for (h in 1:H)
for (t in 2:T)
lp[h] <- lp[h] + normal_log(z[k, t], phi0[h] + phi1[h] * z[k, t-1],
sigma);
increment_log_prob(lp);
}
}


You can make it more efficient by replacing the inner loop over T
with:

lp[h] <- lp[h]
+ normal_log(z[k, 2:T], phi0[h] + phi1[h] * z[K, 1:(T-1)]);

There's no model for z[k, 1] here. But you could add one if you
wanted.

I'd also recommend adding at least weakly informative priors.

- Bob


> On May 30, 2016, at 9:00 AM, Bin Han <danie...@gmail.com> wrote:
>
> sorry.. here it is.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <tscluster.pdf>
>
>> On May 30, 2016, at 8:54 AM, Bin Han <danie...@gmail.com> wrote:
>>
>> Sorry.. here it is
>> <Multivariate_Time_Series_Analysis_nodrm.pdf>
>>
>>> On May 30, 2016, at 8:50 AM, Bin Han <danie...@gmail.com> wrote:
>>>
Reply all
Reply to author
Forward
0 new messages