Slow convergence of high-dimensional gaussian process model

Luis Usier

unread,

Mar 23, 2015, 5:35:48 PM3/23/15

to stan-...@googlegroups.com

Hi all,

I am working on a model to predict soccer matches; each team has an ability theta that varies in time as a gaussian process. The match results are then an ordered logistic regression where the predictor variable is the difference between the home team's ability and the away team's ability. The full model and R code used to process it are below:

footbayes <- "
data {
  int matches;
  real rho_sq;
  int home[matches];
  int away[matches];
  vector[matches] date;
  int result[matches];
  int levels;
  ordered[levels-1] cuts;
}
transformed data {
  int team[2*matches];
  vector[2*matches] date_t;
  vector[2*matches] mu;
  matrix[2*matches,2*matches] Sigma;
  matrix[2*matches,2*matches] L;
  for (i in 1:matches){
    team[i] <- home[i];
    team[i+matches] <- away[i];
    date_t[i] <- date[i];
    date_t[i+matches] <- date[i];
  }
  for (i in 1:(2*matches))
    mu[i] <- 0;
  for (i in 1:(2*matches))
    for (j in 1:(2*matches))
      if (team[i]==team[j])
        Sigma[i,j] <- exp(-rho_sq*pow(date_t[i]-date_t[j],2));
      else
        Sigma[i,j] <- 0;
  L <- cholesky_decompose(Sigma);
}
parameters {
  vector[2*matches] theta;
}
model {
  theta ~ multi_normal_cholesky(mu,L);
  for (i in 1:matches)
    result[i] ~ ordered_logistic(theta[i] - theta[matches+i],cuts);
}"

footbayes <- stan_model(model_code=footbayes)


england2014 <- read.csv("E0.csv")
england2014$Date <- decimal_date(dmy(england2014$Date))


footbayes_data <- function(database,matches){
  data <- list()
  data$matches <- matches
  data$rho_sq <- 0.15
  data$home <- as.numeric(database$HomeTeam)[1:matches]
  data$away <- as.numeric(database$AwayTeam)[1:matches]
  data$date <- database$Date[1:matches]
  data$result <- database$FTGD[1:matches] + 8
  data$levels <- 16
  data$cuts <- c(-9.2,-7.4,-5.9,-4.5,-3.4,-2.3,-1.1,0.1,1.3,2.5,3.5,4.65.6,6.9,7.8)
  return(data)
}


footbayes_sample <- sampling(footbayes,footbayes_data(england2014,20),chains=1,iter=11000,warmup=1000)

The model compiles correctly and runs correctly. When the correlation between the thetas is zero, i.e., when there is only one day of matches, it runs in a fraction of a second, converging to the correct values. However, as soon as I introduce any sort of correlation in the multi_normal sampling statement, things go awry. Using the settings above, where each team plays two games, it takes me over ten minutes to run 1000 iterations and the effective sample sizes range from 3 to 120. When I try to fit a whole 38-game season, the sampler doesn't move at all; I've given up on that after trying it for 4 hours. Is there any way to speed up the sampling? I believe my stan code is as optimized as it can possibly be; my hunch is that I will have to tinker with step sizes and acceptance rates, but I have no idea what I should be aiming for when changing those (and to be honest, I'm not even entirely sure on how to change those).

Oh, and I'm using R version 3.1.2 and rstan 2.6.0 on a 2.5 GHz Core i5 MacBook.

Luis Usier

unread,

Mar 23, 2015, 5:37:48 PM3/23/15

to stan-...@googlegroups.com

And, of course, here is the dataset I am using:

E0.csv

He-chien Tsai

unread,

Mar 24, 2015, 1:45:16 PM3/24/15

to stan-...@googlegroups.com

Hi, I think because your parameters varies through time, making them random variables are overkilled.
I read several papers about sports quant and found that most of time-vary models don't use MCMC at all because, I guess, MCMC is slow for a great amount of parameters, they just apply MLE to dynamic models. You may try MAP by stan's optimizing.
If you still want to apply sampling. You may also want to check hourly charging vps solutions. Linode, digitalocean and vultr provide high-cpu solutions.
I sample on vps and save the results when my laptop's performance is not applicable

Luis Usier

unread,

Mar 24, 2015, 4:27:06 PM3/24/15

to stan-...@googlegroups.com

Hi He-Chien,

Thanks for your suggestions. I am aware that Stan can do optimization too, however, I found the point estimates to be much inferior to full bayesian inference for my purposes (predicting the rest of the season). Additionally, the optimizer works very erratically for gaussian process models (maybe I'm doing something wrong?). I just ran the optimizer on this model, and it works fine for up to 30 matches, but when I tried it with 40 matches I got this message:

seed = 1586474564
initial log joint probability = -1.961e+13
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
       1  -1.96136e+13             0   1.29671e+13       1e-12       0.001       40   
Optimization terminated with error: 
  Line search failed to achieve a sufficient decrease, no more progress can be made

I encounter it quite frequently when trying to apply optimization to this sort of Gaussian process model. I'm not entirely sure what it means; maybe someone here could give me a hand with this?

I have already started looking into paid virtual computing solutions. However, because they are paid, before pursuing that avenue I want to make sure there is nothing I can do differently with my model or with the sampler in order to improve convergence. Hence I asked for help here. I really have zero idea about what to do with the sampler parameters such as treedepth and stepsize; I have looked at the discussion in the CmdStan manual but I couldn't understand much of it.

Bob Carpenter

unread,

Mar 24, 2015, 5:48:32 PM3/24/15

to stan-...@googlegroups.com

The usual problem is that there is no posterior mode.

- Bob

> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Luis Usier

unread,

Mar 24, 2015, 5:52:19 PM3/24/15

to stan-...@googlegroups.com

Hi Bob,

It surely can't be the problem in this case, right? There is a posterior mode for 10, 20 and 30 matches, but no posterior mode for 40 or higher. All the parameters have Normal(0,1) priors, so there should be some posterior mode regardless, unless I'm getting something very wrong.

Bob Carpenter

unread,

Mar 24, 2015, 6:45:29 PM3/24/15

to stan-...@googlegroups.com

I was hoping someone better at GPs than me could help out here,
but I'll give it a shot.

Also, I didn't follow all the coding of teams in transformed
data, but I'd try to make sure that's doing the right thing
by using print statements to inspect results by hand or by
precalculating in R and passing in as data.

It looks like you're allowing each team's ability to change very
drastically across time with this covariance function:

> if (team[i]==team[j])
> Sigma[i,j] <- exp(-rho_sq*pow(date_t[i]-date_t[j],2));
> else
> Sigma[i,j] <- 0;

In terms of modeling, I'd think something that provides a bit more
of a constraint on a team's performance over time would be necessary,
because you only have one match informing the ability parameter
of each team at each time.

Otherwise, it seems you might run into separability issues, because
you only have one match informing the ability at each time.

Also, why aren't you trying to estimate the inverse sqrt length scale
rho_sq as a parameter?

I suspect making a simple time series of team's ability would be
much faster and give you roughly if not exactly the same behavior.

For a team's first game:

theta[theta[i]] ~ normal(0,2);

and then for subsequent games

theta[i] ~ normal(theta[i-last[i]], 0.5);

or something like that. You could make it hierarchical
by adding a shared scale sigma for each team's first
game, and go even further by giving each team a scale
tau[k] for variation over time, with all the tau themselves
given a hierarchical prior.

It seems like it'd run much much faster and give you the same
information without having to factor matrices and use multinormals.

There are three things you can do to help get the sampler unstuck
if the model's right.

First, provide sensible initial values for parameters. If you
set each ability to 0, that would probably work, and you can
do that with a command using init=0. Second, you can set
initial stepsize lower than the default, and I'd try stepsize=0.1
and then stepsize=0.01 if that doesn't work. The target
acceptance rate parameter is adapt_delta, and it defaults to 0.8,
but you can set it higher to something like adapt_delta=0.95 or
even adapt_delta=0.99. In RStan, that's done by adding this
to your command

, init=0, control=list(stepsize=0.01, adapt_delta=0.99),

- Bob

> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

> <E0.csv>

Luis Usier

unread,

Mar 24, 2015, 7:33:20 PM3/24/15

to stan-...@googlegroups.com

Thanks for your comments; any help is very much appreciated : )

About the covariance function; it is a special case of the one recommended in the Stan manual. That covariance function has three parameters: sigma_sq, eta_sq and rho_sq. Sigma_sq is not needed for logistic regression. Eta_sq is a constant that multiplies the entire matrix; by omitting it in the model, I have effectively set it to 1. The modeling logic behind this is that, in the case where there is only one game, the priors on the team coefficients should be i.i.d. N(0,1).

Now, you say that something that constrains team performance over time is necessary. But that something is already in the model; it is rho_sq! Rho_sq can be made arbitrarily small in order to make the performance over time highly correlated, and it is the case that the coefficients are already very highly correlated. The tests I have been running have been tried with only two observations for each team, a week apart, and the observations have a correlation of exp(-0.15*0.04) = 0.994. In other words, the performance is very constrained over time. That is actually the problem! Separability is definitely not an issue here; I can make all the correlations zero and get a diagonal covariance matrix and the sampler runs just fine.

Now that brings the question of why on earth I am modeling it as time-varying only to have the parameters barely change. There are two answers. One, the correlation over a couple months, or years, would certainly be much smaller, but still significant. Two, even if a team's ability today is 90% correlated with its ability in a few months, the variation is important in order to predict game outcomes. Modeling it as fixed in time can lead to drastically different result predictions.

Why am I not estimating rho_sq and cuts and doing full bayesian inference? Well, I haven't even managed to do inference with a subset of the parameters, so I am certainly not going to try and do it with all three! Maybe if I get this sampler to run faster I can try moving rho_sq and cuts to the parameter block, but for now this seems like it would just complicate things more. The values I use are just "ultra informative priors" I estimated in other ways; for cuts, by sampling from a time-fixed model, and for rho_sq, by optimizing a model where the values are correlated across seasons, but fixed within each season.

I might try working with the time series model. I went with the GP model because it is more intuitive, more elegant, and more accurate, but of course all of that is no good if it can't be fit in less than a day. I agree that it should probably run much faster than the GP model.

Since I started this discussion, I have played around with max_treedepth using tidbits of information from old threads on the list. I found out that increasing max_treedepth significantly improves my ESS/iteration ratio. A tree depth of 15 brought me to 10%. However, it also makes the running time significantly longer; the overall effect on ESS/time is a wash. I will try using smaller step sizes and/or larger acceptance rates then; hopefully I can make some headway with one of those. I don't think initialization will make a significant difference; the sampler has no problem getting estimates that are in the correct range of values very quickly. I can post some traceplots; maybe that will help with diagnosing the pathology.

Thanks for all the help! And if I may, can I ask about the RStan optimization problem? There is no way it can be due to a lack of a posterior mode, so what other possibility is there? Any ideas? I believe that getting the point estimates from the model may be the ideal path for me, since full bayesian inference is likely to take ages to be done.

Herra Huu

unread,

Mar 24, 2015, 7:50:45 PM3/24/15

to stan-...@googlegroups.com

I think one problem here is numerical stability. Sigma is "almost" singular (reciprocal condition number about 1e-10). Adding small amount of jitter to diagonal terms of Sigma should help a bit. Also, maybe increase the value of rho_sq? (current value implies correlation almost 1 for each team&time and is big part of these problems). If at some point you want to estimate rho_sq (or other covariance function parameters), then I think a better idea would be to use multiple team specific covariance matrices instead of putting them all together. As far as I can see, it would give exactly the same results and avoids inverting much larger matrix while sampling.

Anyway, I think Bob's suggestion about using simple time series models is a good one. I do remember reading an article using exactly that kind of model for football predictions (found a link: https://dspace.lboro.ac.uk/dspace-jspui/bitstream/2134/8929/3/owen_extended_paper_REVISED_29th_Nov_2010_CHANGES_ACCEPTED%5B1%5D.pdf )

Reply all

Reply to author

Forward