Least square with covariance matrix

126 views
Skip to first unread message

Benjamin L’Huillier

unread,
May 30, 2017, 10:03:17 PM5/30/17
to lmfit-py
Hi Folks,

I have been using lmfit for a while now and i find it very convenient to write clean, portable code.

However, I need to minimize a chi-square with non-diagonal terms in the covariance matrix, namely:

min ((dy).T Cov^{-1} (dy)),

where dy = ydata-ymodel

rather than

min(sum((dy/error)**2)

I haven't found a way to do it with LMfit, and I am wondering if it is possible.

Note: after reading this post:
https://groups.google.com/forum/#!searchin/lmfit-py/covariance|sort:relevance/lmfit-py/pq1BiwgDGnQ/uJEEMZt4AgAJ

It seems that it is not implemented. I think it would be very useful. Something like a wrapper to curve_fit, like LMfit is a wrapper to optimize.leastsq.

Thanks,

Benjamin

Matt Newville

unread,
May 31, 2017, 8:12:21 AM5/31/17
to lmfit-py
On Tue, May 30, 2017 at 9:03 PM, Benjamin L’Huillier <benjamin....@obspm.fr> wrote:
Hi Folks,

I have been using lmfit for a while now and i find it very convenient to write clean, portable code.

However, I need to minimize a chi-square with non-diagonal terms in the covariance matrix, namely:

min ((dy).T Cov^{-1} (dy)),

where dy = ydata-ymodel

rather than

min(sum((dy/error)**2)

I haven't found a way to do it with LMfit, and I am wondering if it is possible.


I believe by "the covariance" you mean the covariance of the input data points. Is that correct?

It might be possible to do this.  Perhaps it is as simple as returning

    return np.sqrt(dy * np.dot(np.inv(cov), dy))

which would still return an array of the same length as 'dy', and when squared and sum will give the desired result?




Note: after reading this post:
https://groups.google.com/forum/#!searchin/lmfit-py/covariance|sort:relevance/lmfit-py/pq1BiwgDGnQ/uJEEMZt4AgAJ

It seems that it is not implemented. I think it would be very useful. Something like a wrapper to curve_fit, like LMfit is a wrapper to optimize.leastsq.

There is no plan to wrap curve_fit().  Lmfit has its own approach to modelig data using Model.

--Matt

Benjamin L’Huillier

unread,
May 31, 2017, 8:37:18 PM5/31/17
to lmfit-py

I believe by "the covariance" you mean the covariance of the input data points. Is that correct?

Yes correct.
 
It might be possible to do this.  Perhaps it is as simple as returning

    return np.sqrt(dy * np.dot(np.inv(cov), dy))

which would still return an array of the same length as 'dy', and when squared and sum will give the desired result?


Thanks, it worked!

Cheers,

Benjamin

Message has been deleted

Matt Newville

unread,
Feb 25, 2020, 11:59:59 AM2/25/20
to lmfit-py

On Tue, Feb 25, 2020 at 10:36 AM Moein N. Ivaki <physics...@gmail.com> wrote:
Hi,

I am having the same problem. More generally, could not come up with any libraries for a global fitting (multiple curves with shared parameters) which can accept covariances as input.

The solution provided sounds bizarre. Is dy * np.dot(np.inv(cov), dy) a meaningful quantity at all to minimize with respect to? For instance, chisq = r.T @ inv(sigma) @ r seems the way to go. But this can not be passed as an input to minimize(), since it accepts only arrays with the same dimension as the data!

That seems like a difference in notation to me, but maybe I'm not understanding your point.  You did not define your terms or explain what seems bizarre.  Anyway, if you have a `sigma` (signifying covariance perhaps?) to use with `r` (residual?), why do you think that you cannot you pass that in?  

Even more strange is that lmfit has not been tailored yet to have covariances of a data that needs to be fitted as an input. It appears to be a common problem though.

Too whom does it appear to be a common problem?  I count two questions about this in three years, but I'm willing to round that up to 1 question per year.  And the proposed solution from 3 years ago was reported to work.  Do you have a suggestion for "tailoring" lmfit to accept a covariance matrix?   If so, make it. 

My view is that passing in covariance would be relatively easy if one is thinking of "variables", and is a good deal messier if one is thinking of "parameters" that might be fixed or internally scaled.   In lmfit, we try very hard to not make the user think about any matrix has having a dimension that corresponds to "variable".   So, passing in a covariance matrix is probably the wrong approach for lmfit, and we would probably prefer to have a dictionary with keys of parameter names and values that are the 1d arrays of variance at each sample.

--Matt

Denis Leshchev

unread,
Jun 18, 2020, 10:27:08 AM6/18/20
to lmfit-py
Hi,

This is an old thread, but I wanted to chime in as I have recently worked on a similar problem.

The correct way to produce weighted residuals for a problem where measurement y is given with correlated noise with covariance matrix C (i.e. noise ~ N(0, C)) is to use the Cholesky factor, defined as

C^-1 = LL^T

in Python it can be computed using numpy library:

L = np.linalg.cholesky(np.linalg.inv(C))

then the residuals are computed as
L.T @ (y_data-y_model)

If the model is correct, the residuals should be iid'ed and if you want to test the normality of the residuals to confirm that your model/noise abide the implied statistical distributions (correctness of the model + normality of the errors), you can do so.

This is generally referred to as noise "whitening":


Previous answer suggesting computing residuals as
return np.sqrt(dy * np.dot(np.inv(cov), dy))

is fine, but the residuals do not have any of the expected properties (being iid'ed, for one) and also will work mostly for positive dy. np.sqrt of a negative number will provide nan for otherwise good and not-missing data, making this method unusable for fitting signals with changing sign (oscillatory features and the likes).

Best,
Denis.

Matt Newville

unread,
Jun 18, 2020, 2:56:44 PM6/18/20
to lmfit-py
Hi,

On Thu, Jun 18, 2020 at 9:27 AM Denis Leshchev <leshche...@gmail.com> wrote:
Hi,

This is an old thread, but I wanted to chime in as I have recently worked on a similar problem.

The correct way to produce weighted residuals for a problem where measurement y is given with correlated noise with covariance matrix C (i.e. noise ~ N(0, C)) is to use the Cholesky factor, defined as

C^-1 = LL^T

in Python it can be computed using numpy library:

L = np.linalg.cholesky(np.linalg.inv(C))

then the residuals are computed as
L.T @ (y_data-y_model)

If the model is correct, the residuals should be iid'ed and if you want to test the normality of the residuals to confirm that your model/noise abide the implied statistical distributions (correctness of the model + normality of the errors), you can do so.

This is generally referred to as noise "whitening":


Previous answer suggesting computing residuals as
return np.sqrt(dy * np.dot(np.inv(cov), dy))

is fine, but the residuals do not have any of the expected properties (being iid'ed, for one) and also will work mostly for positive dy. np.sqrt of a negative number will provide nan for otherwise good and not-missing data, making this method unusable for fitting signals with changing sign (oscillatory features and the likes).


As with the earlier parts of this conversation, I am probably not understanding the point you are trying to make or your terminology.  Should I know what you mean by 'iid'?  You use it twice. 

Using uncertainties or any weighting can be used to "whiten" a fit to a model.  And, when doing so, uncertainties are always positive quantities.  Perhaps you can clarify the point you are trying to make here? 

--Matt

Denis Leshchev

unread,
Jun 19, 2020, 11:49:03 AM6/19/20
to lmfi...@googlegroups.com
Hi Matt,

I will try to clarify my post.
 
As with the earlier parts of this conversation, I am probably not understanding the point you are trying to make or your terminology.  Should I know what you mean by 'iid'?  You use it twice.

I apologize for using jargon. "iid" means "independent and identically distributed". This means that each point of the data is independent from the others and has the same width of the distribution (standard deviation) as the others. In more technical terms that means that covariance matrix of the data is diagonal and all elements on its diagonal are the same.

In the discussion below, when I mention covariance matrix I will mean the covariance matrix of the data and not the "parameter covariance matrix" that is obtained as an output of the fitting.

Consider  data vector y distributed normally distributed around y_true with with covariance matrix C.
in statistical notation we can write it as following:
y ~ N(y_true, C)

if we have a model for y_true, let's call it y_model, then to compute chi square we need to do the following (since google groups do not support math type, I will make an attempt to use pseudo-python code)

dy = y - y_model
chisq = dy.T @ C^-1 @ dy # symbol '@' is a shorthand for np.dot

if C is diagonal (C == diag(C)) and of its diagonal elements equal to some value epsilon (all(diag(C) == epsilon)), then equation for chisq reduces to simple
chisq = dy / epsilon


Using uncertainties or any weighting can be used to "whiten" a fit to a model.  And, when doing so, uncertainties are always positive quantities.  Perhaps you can clarify the point you are trying to make here?

Whitening transformation is the one that turns covariance matrix into an identity matrix. The goal of this transformation to have values of the residual vector to be distributed around 0 with standard deviation of 1.

Why cholesky factor is a good choice for whitening transformation? When using cholesky factor defined as C^-1 = L @ L.T, we can transform residuals into weighted ones dy_w = L.T @ dy, which will then have new covariance matrix. This new covariance matrix canb be computed using uncertainty propagation formula:

C_w = L.T @ C @ L

since C^-1 = L @ L.T and C = (L.T)^-1 @ (L)^-1

C_w = L.T @ (L.T)^-1 @ (L)^-1 @ L = I, which is an identity matrix.

for iid'ed data, i.e. the one where C is diagonal and has all the same non-zero elements on its diagonal equal to epsilon, cholesky factor is simply L = diag(1/epsilon). In this case, dy_w = L.T @ dy = dy/epslion, which makes all the traditional formulas work.


I also wanted to add an example that illustrates the issues with the approach involving np.sqrt(dy * np.dot(np.linalg.inv(C), dy))

Before getting into it I need to revisit my own post. I did mention that this approach does not work:

will work mostly for positive dy. np.sqrt of a negative number will provide nan for otherwise good and not-missing data, making this method unusable for fitting signals with changing sign (oscillatory features and the likes).

The last part of the sentence is incorrect since the the sign of the signal does not matter, but what matters is the sign of dy. So independent of the model, the np.sqrt of a negative value will give nan. One might think that dy * np.dot(np.linalg.inv(C), dy) is always positive. Unfortunately, this is not guaranteed as np.dot(np.linalg.inv(C), dy does not necessarily have the same sign as dy so multiplication of the two does not give only positive values. May be you can correct me, but I did not manage to make the code work with these residuals (see below).

One can try to make it work by adding abs before doing sqrt:
np.sqrt(np.abs(dy * np.dot(C_inv, dy)))
This operation while giving seemingly right result, does not transform the covariance matrix of residuals into identity and hence will not provide weighted residuals that are normally distributed, centered around 0, and have std of 1.

I wrote a quick example that illustrates the issues. I have also included comparison to the method where we ignore off-diagonal elements in the data covariance matrix and think of noise as of independent. For brevity I use the following notation:

method "sigma" - we ignore off-diagonal elements of covariance matrix
method "sqrtC" - is where we compute residuals as np.sqrt(np.abs(dy * np.dot(C_inv, dy)))
method "chol" - is where we compute residuals using cholesky factor

<code>

# imports
import numpy as np
from scipy import linalg, stats
from lmfit import Minimizer, Parameters
import matplotlib.pyplot as plt

# set seed to 1
np.random.seed(1)

# create true data
x = np.linspace(0, 15, 501)
data_true = (5.0 * np.sin(2.0*x - 0.1) * np.exp(-x*x*0.025)) +10


# create noisy data with covariance that has off-diagonal elements

corelation_width = 1e-1
base = np.exp(-(x/(np.sqrt(2)*corelation_width))**2)
A = linalg.toeplitz(base)
C = A @ A.T *1e-2 + np.eye(x.size)*1e-3
data = np.random.multivariate_normal(data_true, C)

# pre-compute some matrices to speed up calculations
C_inv = np.linalg.inv(C)
L = np.linalg.cholesky(C_inv)
eps_data = np.sqrt(np.diag(C))


def func(params, x):
    amp = params['amp']
    phaseshift = params['phase']
    freq = params['freq']
    decay = params['decay']
    offset = params['offset']
    return amp * np.sin(x*freq + phaseshift) * np.exp(-x*x*decay) + offset


def residual_sigma(params, x, data, sigma):
    dy = data-func(params, x)
    return dy/sigma


def residual_sqrtC(params, x, data, C_inv):
    dy = data-func(params, x)
    # return np.sqrt(dy * np.dot(np.linalg.inv(C), dy)) # <- this produces NaNs and does not work
    return np.sqrt(np.abs(dy * np.dot(C_inv, dy))) # <- this does work

def residual_chol(params, x, data, L):
    dy = data-func(params, x)
    return L.T @ dy



params = Parameters()

params.add('amp', value=1)
params.add('decay', value=0.007)
params.add('phase', value=0.2)
params.add('freq', value=3.0)
params.add('offset', value=0.0)


minner_sigma = Minimizer(residual_sigma, params, fcn_args=(x, data, eps_data))
minner_sqrtC = Minimizer(residual_sqrtC, params, fcn_args=(x, data, C_inv))
minner_chol = Minimizer(residual_chol, params, fcn_args=(x, data, L))

result_sigma = minner_sigma.minimize()
result_sqrtC = minner_sqrtC.minimize()
result_chol = minner_chol.minimize()

# find residuals
resid_sigma = result_sigma.residual
resid_sqrtC = result_sqrtC.residual
resid_chol = result_chol.residual

# the part below is a bit of a dirty work, but gets the job done
# figures
plt.figure(1)
plt.clf()

plt.subplot(2,3,(1,2))
plt.plot(x, data, 'k.')
plt.plot(x, func(result_sigma.params, x), 'g-') # sigma
plt.plot(x, func(result_sqrtC.params, x), 'b-') # sqrt(abs)
plt.plot(x, func(result_chol.params, x), 'r-') # proper
plt.xlabel('x')
plt.ylabel('data')
plt.title('data/fit')



plt.subplot(2,3,(4,5))
#plt.plot(x, result_sigma.residual, 'g-') # sqrt(abs)
plt.plot(x, result_sqrtC.residual, 'b-', label='sqrtC')
plt.plot(x, result_chol.residual, 'r-', label='chol')
plt.plot(x, result_sigma.residual, 'g-', label='sigma')


plt.xlabel('x')
plt.ylabel('residuals')
plt.title('residuals')
plt.legend()

plt.subplot(2,3,(6))
plt.hist(resid_sigma, 25, orientation='horizontal', color='g', histtype='step');
plt.hist(resid_sqrtC, 25, orientation='horizontal', color='b', histtype='step');
plt.hist(resid_chol, 25, orientation='horizontal', color='r', histtype='step');

plt.xlabel('occurances')
plt.ylabel('residuals ')
plt.title('residual distribution')
plt.tight_layout()



print('')
print('resid sigma distribution mean:', np.mean(resid_sigma))
print('resid sqrtC distribution mean:', np.mean(resid_sqrtC))
print('resid chol distribution mean:', np.mean(resid_chol))
print('')

print('resid sigma distribution std:', np.std(resid_sigma))
print('resid sqrtC distribution std:', np.std(resid_sqrtC))
print('resid chol distribution std:', np.std(resid_chol))
print('')

# testing residual normality
print('D’Agostino normality test results:')
print('resid sigma normally distributed:', stats.normaltest(resid_sigma)[1]>0.05 )
print('resid sqrtC normally distributed:', stats.normaltest(resid_sqrtC)[1]>0.05 )
print('resid chol normally distributed:', stats.normaltest(resid_chol)[1]>0.05 )


print('')
print('Parameter values')
print('\t sigma \t\t sqrtC \t\t chol')
for key in params.keys():
    print(key, '\t %0.6f' %result_sigma.params[key].value,
               '\t %0.6f' %result_sqrtC.params[key].value,
               '\t %0.6f' %result_chol.params[key].value,
                )
print('')
print('Parameter std errors')
print('\t sigma \t\t sqrtC \t\t chol')
for key in params.keys():
    print(key, '\t %0.6f' %result_sigma.params[key].stderr,
               '\t %0.6f' %result_sqrtC.params[key].stderr,
               '\t %0.6f' %result_chol.params[key].stderr,
                )

<code>

This will produce the following figure:

On the top panel we can see that the fits are essentially the same.
 
The inspection of the residuals on the low-left panel shows that the sqrtC method (unsurprisingly) gives values that are all positive and are not centered around 0. If you ever use weighted residuals to check the fit quality (and you should) this will not be a good plot to look at.

Some of the output printouts indicate that both sigma and chol methods give residuals centered around zero; strikingly all the residuals have standard deviation of ~1, which means that some of the whitening actually works:

resid sigma distribution mean: -5.8269982730564494e-05
resid sqrtC distribution mean: 1.7250573130189921
resid chol distribution mean: 0.012187697573908895

resid sigma distribution std: 0.8727569387567908
resid sqrtC distribution std: 1.0494047460860807
resid chol distribution std: 0.9829735616819255

To further understand the normality of the residuals, we can use some classic normality test (scipy implementation of D’Agostino test works just fine):
D’Agostino normality test results:
resid sigma normally distributed: False
resid sqrtC normally distributed: False
resid chol normally distributed: True


Only the proper whitening operator gives the true normal distribution in the weighted residuals, making it a good indicator of goodness of fit.

Although all of these considerations are interesting, one can ask: why bother going through all the linear algebra, if we get the same fits?

Indeed, all 3 fits provide similar best-fit values:
Parameter values
         sigma           sqrtC           chol
amp      4.743743        4.767942        4.723184
decay    0.022070        0.021600        0.021898
phase    -0.121092       -0.107353       -0.104812
freq     2.002138        1.995802        1.998234
offset   10.054233       10.036156       10.049204

However, the key is that all of them but chol provide severely underestimated uncertainties (even after correction for chisq->1, that lmfit does automatically):

Parameter std errors
         sigma           sqrtC           chol
amp      0.032110        0.000379        0.117574
decay    0.000340        0.000007        0.001262
phase    0.008500        0.000161        0.025366
freq     0.002570        0.000112        0.008460
offset   0.009648        0.000402        0.036033



To conclude, if you know your covariance matrix, you should incorporate it in the fitting. Using its cholesky factor to compute the residuals is not difficult and produces statistically sound results, while other methods may give you similar fit results while violating a bunch of assumptions and give incorrect uncertainty estimations which one may not even notice.

I hope this clarifies my earlier message,
Denis.


On Thursday, June 18, 2020 at 2:56:44 PM UTC-4, Matt Newville wrote:
Hi,

Matt Newville

unread,
Jun 19, 2020, 12:45:34 PM6/19/20
to lmfit-py
On Fri, Jun 19, 2020 at 10:49 AM Denis Leshchev <leshche...@gmail.com> wrote:
Hi Matt,

I will try to clarify my post.
 
As with the earlier parts of this conversation, I am probably not understanding the point you are trying to make or your terminology.  Should I know what you mean by 'iid'?  You use it twice.

I apologize for using jargon. "iid" means "independent and identically distributed". This means that each point of the data is independent from the others and has the same width of the distribution (standard deviation) as the others. In more technical terms that means that covariance matrix of the data is diagonal and all elements on its diagonal are the same.

In the discussion below, when I mention covariance matrix I will mean the covariance matrix of the data and the "parameter covariance matrix" that is obtained as an output of the fitting.

Consider  data vector y distributed normally distributed around y_true with with covariance matrix C.
in statistical notation we can write it as following:
y ~ N(y_true, C)

if we have a model for y_true, let's call it y_model, then to compute chi square we need to do the following (since google groups do not support math type, I will make an attempt to use pseudo-python code)

dy = y - y_model
chisq = dy.T @ C^-1 @ dy # symbol '@' is a shorthand for np.dot

if C is diagonal (C == diag(C)) and of its diagonal elements equal to some value epsilon (all(diag(C) == epsilon)), then equation for chisq reduces to simple
chisq = dy / epsilon


Using uncertainties or any weighting can be used to "whiten" a fit to a model.  And, when doing so, uncertainties are always positive quantities.  Perhaps you can clarify the point you are trying to make here?

Whitening transformation is the one that turns covariance matrix into an identity matrix. The goal of this transformation to have values of the residual vector to be distributed around 0 with standard deviation of 1.

Yeah.... so.  That's just "(data - model)/epsilon", which is what we use. 


Why cholesky factor is a good choice for whitening transformation? When using cholesky factor defined as C^-1 = L @ L.T, we can transform residuals into weighted ones dy_w = L.T @ dy, which will then have new covariance matrix. This new covariance matrix canb be computed using uncertainty propagation formula:

C_w = L.T @ C @ L

since C^-1 = L @ L.T and C = (L.T)^-1 @ (L)^-1

C_w = L.T @ (L.T)^-1 @ (L)^-1 @ L = I, which is an identity matrix.

for iid'ed data, i.e. the one where C is diagonal and has all the same non-zero elements on its diagonal equal to epsilon, cholesky factor is simply L = diag(1/epsilon). In this case, dy_w = L.T @ dy = dy/epslion, which makes all the traditional formulas work.


If the covariance of the data is diagonal, none of that is necessary. It is handled by "(data - model)/epsilon".   Why do you think that you need to do any factorization? 





I also wanted to add an example that illustrates the issues with the approach involving np.sqrt(dy * np.dot(np.linalg.inv(C), dy))

Before getting into it I need to revisit my own post. I did mention that this approach does not work:

will work mostly for positive dy. np.sqrt of a negative number will provide nan for otherwise good and not-missing data, making this method unusable for fitting signals with changing sign (oscillatory features and the likes).

The last part of the sentence is incorrect since the the sign of the signal does not matter, but what matters is the sign of dy.

Um, uncertainties are positive definite.  Actually, they are strictly positive but sometimes very close to zero.  They are never negative.  
  
So independent of the model, the np.sqrt of a negative value will give nan. One might think that dy * np.dot(np.linalg.inv(C), dy) is always positive. Unfortunately, this is not guaranteed as np.dot(np.linalg.inv(C), dy does not necessarily have the same sign as dy so multiplication of the two does not give only positive values. May be you can correct me, but I did not manage to make the code work with these residuals (see below).

I think you may be over-thinking all of this.


One can try to make it work by adding abs before doing sqrt:
np.sqrt(np.abs(dy * np.dot(C_inv, dy)))
This operation while giving seemingly right result, does not transform the covariance matrix of residuals into identity and hence will not provide weighted residuals that are normally distributed, centered around 0, and have std of 1.

If it is purely diagonal, the covariance matrix is exactly an array of uncertainties.  We do not need to do any factorization at all.

So, I have no idea what you are going on about. 

--Matt

Denis Leshchev

unread,
Jun 19, 2020, 1:22:19 PM6/19/20
to lmfit-py
Hi Matt,

If it is purely diagonal, the covariance matrix is exactly an array of uncertainties.  We do not need to do any factorization at all.

So, I have no idea what you are going on about.

The original post in the thread is about using full covariance matrix and not the diagonal one. My post was about how to incorporate it correctly. As you rightfully said, if covariance matrix is diagonal then none of that is needed and one can just use epsilon and keep enjoying their life :-)

Um, uncertainties are positive definite.  Actually, they are strictly positive but sometimes very close to zero.  They are never negative. 

That is almost right. Covariance matrix must be positive semi-definite. What that means is that for covariance matrix C of size n x n, for any vector  x of size n x 1 product x @ C @ x >=0. Note that this is not the same is  x * C @ x (with element wise multiplication as the first operator) and the vector produced in this expression is not guaranteed to have all elements positive.
To be positive-semi-definite C must have positive elements on its diagonal (which correspond to squares of uncertainties, aka variances), but it is not required to have them elsewhere.

Now, for chi square we do not use C, but its inverse, C^-1. It is also positive semidefitnite of course; and this guarantees that dy @ C^-1 @ dy is positive, but it does not guarantee that every element of dy * C^-1 @ y is positive, hence square rooting may result in NaNs. In fact, if you try simple sqrt in the example I have provided above (the commented line in the definition of residual_sqrtC function), the minimizer will not work and python returns error saying that residual does in fact returns Nans. This is why I do not quite understand how the originally proposed solution in your answer even works at all.

Denis.


On Friday, June 19, 2020 at 12:45:34 PM UTC-4, Matt Newville wrote:

Matt Newville

unread,
Jun 19, 2020, 3:20:40 PM6/19/20
to lmfit-py
On Fri, Jun 19, 2020 at 12:22 PM Denis Leshchev <leshche...@gmail.com> wrote:
Hi Matt,

If it is purely diagonal, the covariance matrix is exactly an array of uncertainties.  We do not need to do any factorization at all.

So, I have no idea what you are going on about.

The original post in the thread is about using full covariance matrix and not the diagonal one. My post was about how to incorporate it correctly. As you rightfully said, if covariance matrix is diagonal then none of that is needed and one can just use epsilon and keep enjoying their life :-)

Um, uncertainties are positive definite.  Actually, they are strictly positive but sometimes very close to zero.  They are never negative. 

That is almost right. Covariance matrix must be positive semi-definite. What that means is that for covariance matrix C of size n x n, for any vector  x of size n x 1 product x @ C @ x >=0. Note that this is not the same is  x * C @ x (with element wise multiplication as the first operator) and the vector produced in this expression is not guaranteed to have all elements positive.
To be positive-semi-definite C must have positive elements on its diagonal (which correspond to squares of uncertainties, aka variances), but it is not required to have them elsewhere.

Now, for chi square we do not use C, but its inverse, C^-1. It is also positive semidefitnite of course; and this guarantees that dy @ C^-1 @ dy is positive, but it does not guarantee that every element of dy * C^-1 @ y is positive, hence square rooting may result in NaNs. In fact, if you try simple sqrt in the example I have provided above (the commented line in the definition of residual_sqrtC function), the minimizer will not work and python returns error saying that residual does in fact returns Nans. This is why I do not quite understand how the originally proposed solution in your answer even works at all.

Hm, dunno.  I have never had a case where I knew or could ever imagine knowing a covariance matrix for data with any significant off-diagonal components.  I suppose I can picture bands or +/- 1 or 2 data points off of diagonal, but these seem like they would have to tail off dramatically.   Anyway, it seems to me that if dy @ C^-1 @ dy has negative components, something may be wrong with the covariance.  Maybe you want to take the norm first.   Not sure.  Again, I have no real experience with this situation.   

Apparently the suggested approach worked for the original poster.  I don't know why it doesn't work for you.  Maybe the approach to factorization matters (seems surprising, no?) or maybe there is an error somewhere. 



Nicholas Zobrist

unread,
Jun 19, 2020, 4:27:59 PM6/19/20
to lmfi...@googlegroups.com
Hi Matt,

Maybe I can add some context. I’ve run into this problem before when fitting time-series data. Often with this kind of data, there exists an excess of noise at particular frequencies (e.g. 60 Hz) or 1/f noise caused by the readout electronics. This situation naturally leads to a non-diagonal covariance matrix since samples at different times become correlated. The covariance matrix can also often be estimated in this case by computing a power spectral density of the noise from calibration data.

Denis’s implementation is roughly equivalent to what scipy’s curve_fit() does to allow this functionality, and it is the most numerically robust way to approach this problem.

As Denis was saying, the current Model class doesn’t allow for a covariance matrix input, so an lmfit user must fall back to the Minimizer class in order to incorporate it. I think it would be a useful addition to lmfit to be able to handle the more general covariance so that this isn’t necessary (not that I am volunteering to tackle it haha).

-N

-- 
You received this message because you are subscribed to the Google Groups "lmfit-py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lmfit-py+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lmfit-py/CA%2B7ESbq0JdU3zqTN6LF_SXr67qY_Gm8B9TA8TE4Kc1-LuQaD3Q%40mail.gmail.com.

Matt Newville

unread,
Jun 19, 2020, 5:37:47 PM6/19/20
to lmfit-py
Hi Nicholas,


On Fri, Jun 19, 2020 at 3:27 PM Nicholas Zobrist <nzob...@physics.ucsb.edu> wrote:
Hi Matt,

Maybe I can add some context. I’ve run into this problem before when fitting time-series data. Often with this kind of data, there exists an excess of noise at particular frequencies (e.g. 60 Hz) or 1/f noise caused by the readout electronics. This situation naturally leads to a non-diagonal covariance matrix since samples at different times become correlated. The covariance matrix can also often be estimated in this case by computing a power spectral density of the noise from calibration data.

Denis’s implementation is roughly equivalent to what scipy’s curve_fit() does to allow this functionality, and it is the most numerically robust way to approach this problem.

Hm, maybe, I'm not sure.  I didn't really see "numerically robust"  as part of the argument.  But, for sure, I was (and probably still am not) fully understanding the point(s) being made.

I'm not getting why this is actually different from   `np.sqrt(dy * np.dot(np.inv(cov), dy))`.   If I understand correctly (and that is not at all certain), Denis is saying that taking the square-root is a problem because its argument can be negative.  It can?  Isn't the idea to minimize  `dy * np.dot(np.inv(cov), dy)` as an extension of `(dy/epsilon)^2`? 

I think that you need to take the square root of this quantity because the solver *will* square it (least *squares*).   How does one interpret the **square** of the residual being negative?  Should that covariance be squared somehow?   Like, would a purely diagonal matrix contain the uncertainties or the squares of the uncertainties? 

Basically, saying the square-root is the problem makes me suspicious that something is wrong. But, maybe I'm not understanding.

How the matrix inversion seems sort of secondary to me.  Yeah, the method used can be important for performance but I do not think that this was part of the earlier discussion.    Again, I could easily be missing something.

As Denis was saying, the current Model class doesn’t allow for a covariance matrix input, so an lmfit user must fall back to the Minimizer class in order to incorporate it. I think it would be a useful addition to lmfit to be able to handle the more general covariance so that this isn’t necessary (not that I am volunteering to tackle it haha).

OK, I can see that this could be a reasonable wish.  I did not pick up that desire from Denis.  I am certainly not volunteering to do this, but if someone else wants to take this on, go for it. 

--Matt

Nicholas Zobrist

unread,
Jun 19, 2020, 8:33:39 PM6/19/20
to lmfi...@googlegroups.com
Hi Matt,

On Jun 19, 2020, at 2:37 PM, Matt Newville <newv...@cars.uchicago.edu> wrote:

Hi Nicholas,


On Fri, Jun 19, 2020 at 3:27 PM Nicholas Zobrist <nzob...@physics.ucsb.edu> wrote:
Hi Matt,

Maybe I can add some context. I’ve run into this problem before when fitting time-series data. Often with this kind of data, there exists an excess of noise at particular frequencies (e.g. 60 Hz) or 1/f noise caused by the readout electronics. This situation naturally leads to a non-diagonal covariance matrix since samples at different times become correlated. The covariance matrix can also often be estimated in this case by computing a power spectral density of the noise from calibration data.

Denis’s implementation is roughly equivalent to what scipy’s curve_fit() does to allow this functionality, and it is the most numerically robust way to approach this problem.

Hm, maybe, I'm not sure.  I didn't really see "numerically robust"  as part of the argument.  But, for sure, I was (and probably still am not) fully understanding the point(s) being made.

I'm not getting why this is actually different from   `np.sqrt(dy * np.dot(np.inv(cov), dy))`.   If I understand correctly (and that is not at all certain), Denis is saying that taking the square-root is a problem because its argument can be negative.  It can?  Isn't the idea to minimize  `dy * np.dot(np.inv(cov), dy)` as an extension of `(dy/epsilon)^2`? 

Yes, as explained by Denis, the number 'dy @ np.inv(cov) @ dy' is guaranteed to be non-negative because cov is positive semidefinite, but the vector 'dy * np.inv(cov) @ dy' may have some negative components for a general covariance matrix. In principle this is not an issue, but in numpy 'np.sqrt(-1) = np.nan' instead of ‘1j'. You could, for example use numpy.lib.scimath.sqrt, which properly handles the negative arguments, and arrive at the same result.


I think that you need to take the square root of this quantity because the solver *will* square it (least *squares*).   How does one interpret the **square** of the residual being negative?  Should that covariance be squared somehow?   Like, would a purely diagonal matrix contain the uncertainties or the squares of the uncertainties? 

A negative square residual is confusing to interpret and may have unintended consequences for optimization algorithms that assume that this value is positive (I’m not sure). If the Cholesky decomposition, is used however, the residual is guaranteed to be real and this interpretation problem goes away.

np.inv(cov) for a diagonal matrix should have 1 / sigma^2 on the diagonal.


Basically, saying the square-root is the problem makes me suspicious that something is wrong. But, maybe I'm not understanding.

How the matrix inversion seems sort of secondary to me.  Yeah, the method used can be important for performance but I do not think that this was part of the earlier discussion.    Again, I could easily be missing something.

I see your point. There are many ways to take the square root of a matrix and this method is certainly not unique. Perhaps I was a bit ahead of myself saying it is the ‘most robust’ way. The Cholesky decomposition is definitely faster and removes the need for complex numbers in the residual, but I’m not aware of any other benefits (not that there aren’t any).


As Denis was saying, the current Model class doesn’t allow for a covariance matrix input, so an lmfit user must fall back to the Minimizer class in order to incorporate it. I think it would be a useful addition to lmfit to be able to handle the more general covariance so that this isn’t necessary (not that I am volunteering to tackle it haha).

OK, I can see that this could be a reasonable wish.  I did not pick up that desire from Denis.  I am certainly not volunteering to do this, but if someone else wants to take this on, go for it. 

--Matt

--
You received this message because you are subscribed to the Google Groups "lmfit-py" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lmfit-py+u...@googlegroups.com.

Mike Hudson

unread,
Sep 18, 2020, 3:53:52 PM9/18/20
to lmfit-py
I am also interested in this problem, and would be grateful if someone could implement it in lmfit.

Matt Newville

unread,
Sep 19, 2020, 12:30:05 PM9/19/20
to lmfit-py
On Fri, Sep 18, 2020 at 2:53 PM Mike Hudson <mjhu...@uwaterloo.ca> wrote:
I am also interested in this problem, and would be grateful if someone could implement it in lmfit.


I am not aware of anyone working on "this problem", though I am also not sure that there is a single "it" to be implemented.

For sure, I am simply not able to follow email threads that got dormant for months and then re-appear with a "me too".  That resurrection of a dormant conversation has happened multiple times in this thread.  Is the original message even relevant to anyone anymore?   I don't know.    I 

So, if you have a specific "this problem", please state what that is.  If you're interested in working on this, let us know.

From my perspective, the main point of discussion in this thread is that people have suggested using a covariance matrix as an extension of an array of uncertainties that extension is from an array (`epsilon`) of uncertainties to

      epsilon ->     `np.sqrt(epsilon * np.dot(np.inv(covariance), epsilon))`

OK, that seems fine to me.  But, I really do not understand how it is physically/mathematically meaningful for that array to have negative values.  Really, that simply makes no sense to me.  

Anyway, if you or some you know is interested in *working* on this, let us know.

--Matt

Reply all
Reply to author
Forward
0 new messages