degenerate covariance in GPs

60 views
Skip to first unread message

Victor Hwang

unread,
May 7, 2013, 11:08:41 PM5/7/13
to 10-701-spri...@googlegroups.com
I'm looking through GPs again (http://alex.smola.org/teaching/cmu2013-10-701/slides/12_Gaussian_Processes.pdf) and don't understand why a linear kernel function leads to the variance vanishing (seen on slide titled Linear Gaussian Process Regression) after n observations. Could someone enlighten me?

Karthik

unread,
May 8, 2013, 10:05:37 AM5/8/13
to 10-701-spri...@googlegroups.com
Here's a possible (handwavy) explanation. Please correct me (anyone) if I'm wrong

When you use a linear kernel, the covariance matrix K is a Gram Matrix (https://inst.eecs.berkeley.edu/~ee127a/book/login/def_Gram_matrix.html). If I'm not wrong, a Gram matrix constructed from n-dimensional vectors can have rank at most = n. Which means that the covariance matrix of the GP, given any n+k points still has rank <= n.

What this amounts to is that the GP thinks the covariance between all the elements in the input space (x) is determined completely after seeing the first n points x_{1..n}, since adding another point does not increase the rank of the K matrix. I am not sure if there is an easier mathematical way to see this, but I can bet that if you find the variance of point x_{n+1} by conditioning the joint Gaussian on the n training points, you will get 0. As Prof Smola said in the lecture, this means that the GP is confident of predicting any point given n training points, which is unrealistic. 

Perhaps one way to think about this is the following: Adding a row to the Gram Matrix (corresponding to the (n+1)th point) does not change the rank, which means that this new row is a linear combination of the first n rows. Let's take this to an extreme, let us add a row which is exactly the same as an already existing row. This happens if the test point is one of the training points. In the case that there is no noise, the GP is certain of this new test point since it knows the exact output value.

So in order to keep the rank up, we add small diagonal elements to K.

Hope this makes some sense, and is not going to confuse/mislead anyone.

Rittika

unread,
May 8, 2013, 12:01:14 PM5/8/13
to Karthik, 10-701-spring-2013-cmu
Thank you! that was very helpful!


--
http://alex.smola.org/teaching/cmu2013-10-701 (course website)
http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9 (YouTube playlist)
---
You received this message because you are subscribed to the Google Groups "10-701 Spring 2013 CMU" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-201...@googlegroups.com.
To post to this group, send email to 10-701-spri...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Karthik

unread,
May 8, 2013, 12:08:24 PM5/8/13
to 10-701-spri...@googlegroups.com, Karthik, ri...@pitt.edu
Sorry, minor edit - in the second paragraph I meant covariance between the outputs (not inputs). You would find the variance of f(x_{n+1} | x_{1...n}) by conditioning the Gaussian, not the variance of x_{n+1} as I had originally written.
To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-2013-cmu+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages