GP likelihood calculations with low-rank observations?

John McFarland

unread,

Jan 9, 2011, 11:49:44 AM1/9/11

to PyMC

Hi,

First, congratulations to the developers on excellent software
packages, both pymc and pymc.gp. I have been particularly impressed
by the layout (returning random GP realizations as callabale Python
objects -- brilliant!) and computational efficiencies used by pymc.gp
(in particular, the low-rank Cholesky calculations). Not to mention,
the GP Users' guide is outstanding.

My question has to do with whether or not the low-rank Cholesky
simplifications can be taken advantage of when evaluating the
likelihood associated with GP parent parameters. I'm interested in a
very simple case: following Chapter 2 of the GP Users guide, say that
all of the GP's children, K, are observations of the GP on the mesh
x_star (i.e. the equation in Section 2.3.3) and that there is no
observation variance, so V_i is 0. If I am following the notation
correctly, I believe the likelihood term of interest here would be
p(f(x_star)|P), which is a multivariate normal distribution based on
the GP mean and covariance, which are functions of P.

My question is whether or not we can take advantage of incomplete
Cholesky factorization when computing the likelihood ratio for P and
P_p. The problem I run into often is that for reasonable values of
the covariance "scale" parameter, the covariance matrix is numerically
singular (i.e. there are some redundant data in K). I'm interested in
the idea of using incomplete Cholesky factorization for these cases,
but my concern is whether it is valid to compare likelihood values for
P and P_p when the ranks of their respective incomplete Cholesky
factors might differ (e.g. say their are 1000 observations in K and
for one set of GP parameters P the incomplete Cholesky factor retains
80 observations and for P_p the Cholesky factor retains 300).
Intuitively, to me it doesn't seem "fair" to compare likelihood values
based on two systems having different dimension.

If my intuition here is correct, are there any tricks for dealing with
this situation (aside from introducing a nugget/observation variance),
or do we just have to live with the MCMC sampler avoiding regions for
P where the full covariance matrix is numerically singular?

Thanks,
John

Anand Patil

unread,

Jan 10, 2011, 7:17:56 AM1/10/11

to py...@googlegroups.com

Hi John,

On Sun, Jan 9, 2011 at 4:49 PM, John McFarland <mcfa...@gmail.com> wrote:

Hi,

First, congratulations to the developers on excellent software
packages, both pymc and pymc.gp. I have been particularly impressed
by the layout (returning random GP realizations as callabale Python
objects -- brilliant!) and computational efficiencies used by pymc.gp
(in particular, the low-rank Cholesky calculations). Not to mention,
the GP Users' guide is outstanding.

Thanks very much!

My question has to do with whether or not the low-rank Cholesky
simplifications can be taken advantage of when evaluating the
likelihood associated with GP parent parameters. I'm interested in a
very simple case: following Chapter 2 of the GP Users guide, say that
all of the GP's children, K, are observations of the GP on the mesh
x_star (i.e. the equation in Section 2.3.3) and that there is no
observation variance, so V_i is 0. If I am following the notation
correctly, I believe the likelihood term of interest here would be
p(f(x_star)|P), which is a multivariate normal distribution based on
the GP mean and covariance, which are functions of P.

My question is whether or not we can take advantage of incomplete
Cholesky factorization when computing the likelihood ratio for P and
P_p. The problem I run into often is that for reasonable values of
the covariance "scale" parameter, the covariance matrix is numerically
singular (i.e. there are some redundant data in K). I'm interested in
the idea of using incomplete Cholesky factorization for these cases,
but my concern is whether it is valid to compare likelihood values for
P and P_p when the ranks of their respective incomplete Cholesky
factors might differ (e.g. say their are 1000 observations in K and
for one set of GP parameters P the incomplete Cholesky factor retains
80 observations and for P_p the Cholesky factor retains 300).
Intuitively, to me it doesn't seem "fair" to compare likelihood values
based on two systems having different dimension.

You're right, if the covariances have different ranks the likelihoods have different dimensions and it's not valid to divide them. But there's a more serious issue, which is that if the full covariance is singular then the likelihood of the dataset is zero, unless the data happen to fall right within the low-dimensional span of the eigenfunctions of the covariance; so you're dividing zeros of different orders.

Best thing to do is to change the model so that the covariance is full rank. The simplest way to do that would be to add an unknown, positive constant to the diagonal using the 'nugget' argument to pymc.gp.FullRankCovariance.__init__ .

Cheers,

A

If my intuition here is correct, are there any tricks for dealing with
this situation (aside from introducing a nugget/observation variance),
or do we just have to live with the MCMC sampler avoiding regions for
P where the full covariance matrix is numerically singular?

Thanks,
John

--
You received this message because you are subscribed to the Google Groups "PyMC" group.
To post to this group, send email to py...@googlegroups.com.
To unsubscribe from this group, send email to pymc+uns...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/pymc?hl=en.

John McFarland

unread,

Jan 10, 2011, 11:58:33 AM1/10/11

to PyMC

On Jan 10, 6:17 am, Anand Patil <anand.prabhakar.pa...@gmail.com>
wrote:
> Hi John,

Thanks for the reply. I often run into cases where the covariance may
be computationally singular (but theoretically, I suppose it is merely
positive definite but with some very small eigenvalues), but the data
may still have non-zero likelihood (although the likelihood would not
be computable b/c of the numerically singular covariance). A trivial
example of this is if the data are generated from something like y=x.
I realize that modeling such simplistic functions like this with GP's
is overkill, but we sometimes make use of GP's within other automated
algorithms, and we need to at least do something graceful in cases
like this. Currently I work only with point estimates of covariance
parameters, and in cases like these my MLE algorithms try to push the
correlation length to the best value they can without the covariance
being numerically singular. Ideally, though, I think it would be
preferable to model the GP with a longer correlation length (of course
assuming it is warranted by the data) and then use an incomplete
Cholesky so that the computations are faster (and also less
susceptible to the numerics of handling near-singular matrices).

By the way, is the pymc MAP solver supported for GP submodels? I made
a brief attempt to use it but got an error that looked like it came
out of the NormalApproximation class.

John

Anand Patil

unread,

Jan 12, 2011, 12:11:01 PM1/12/11

to py...@googlegroups.com

The hard zero isn't important. If the covariance is positive definite but numerically singular, the data are just extremely unlikely rather than impossible. You'd be comparing two very ill-fitting models. Also note that you can be pretty sure that the model with the smaller numerical rank deficiency will have a much higher likelihood.

By the way, is the pymc MAP solver supported for GP submodels? I made
a brief attempt to use it but got an error that looked like it came
out of the NormalApproximation class.

No, it isn't. The problem is that GP objects don't have logp's, MAP doesn't know how to deal with that.

Cheers,

Anand

John McFarland

unread,

Jan 13, 2011, 5:36:48 PM1/13/11

to PyMC

> The hard zero isn't important. If the covariance is positive definite but
> numerically singular, the data are just extremely unlikely rather than
> impossible. You'd be comparing two very ill-fitting models. Also note that
> you can be pretty sure that the model with the smaller numerical rank
> deficiency will have a much higher likelihood.

Maybe I misunderstand you, but I'm not sure I completely agree with
this. Isn't the data covariance matrix only a function of the
covariance parameters and the observation mesh? To me, the covariance
matrix itself (or its rank or condition number) says nothing of the
fit of the model, because it is not a function of the observed
outputs. If the covariance matrix is ill-conditioned, I think all
that means is that numerically you will not be able to evaluate the
likelihood, but not necessarily that the fit is poor.

For example, consider drawing a random realization from a GP and
evaluating it on a very fine mesh to collect "observations." If the
mesh is fine enough, the covariance matrix will be numerically
singular (regardless of the actual observation values). This says
nothing of the "fit" of the data to that GP -- I don't think you would
say the fit is poor, because the data came from that GP. The problem
is, if you had the task of estimating the GP covariance parameters
based on those observations alone, you would have a problem: for
correlation lengths close to the actual value, you can't evaluate the
likelihood because the covariance is singular. Thus, your estimation
procedure will not be able to arrive at the correct correlation length
(from which the data were generated).

Sorry to belabor this, but parameter estimation is a big part of the
problems I work with. You have definitely done some great work with
pymc.gp. I'm learning a lot from looking at the code, and it's nice
to have someone knowledgeable to bounce ideas off of.

Regards,
John

John McFarland

unread,

Jan 13, 2011, 7:51:21 PM1/13/11

to py...@googlegroups.com

Here is an example in pymc to make my above points more concrete. The
first program, gen_data.py, generates a random realization of a GP
having scale=1, evaluates it over a fine mesh (100 points), and writes
these data to a file. The second program, fit_mcmc.py, uses pymc to fit
the generated data to a GP model. Ideally, the fitting process should
be able to identify the original GP parameters (and it will if you set
ndata to a lower number in gen_data.py). As discussed, though, we can't
evaluate the likelihood function for reasonably long correlation lengths
because the resulting covariance matrix is numerically singular. The
MCMC sampler behaves as expected: it can not explore regions with longer
correlation lengths and converges to a posterior mean of (in my case)
scale=0.15. What is happening is it can't make it to regions with
larger values of scale, because the covariance becomes numerically
singular. This doesn't mean that the likelihood is worse in those
regions, only that we can't evaluate it on the computer.

I claim that for this example, theoretically, Lhood(scale=1) >
Lhood(scale=0.15), but because of finite precision on the computer, we
can't compute Lhood(scale=1). I believe if the computer had infinite
floating point precision, we could evaluate Lhood(scale=1) and the MCMC
sampler would correctly identify the posterior distribution for the
scale parameter as having a mode close to 1.

John

fit_mcmc.py

gen_data.py

Reply all

Reply to author

Forward