[RFC] gonum/fit

Seb Binet

unread,

Jan 12, 2017, 8:43:42 AM1/12/17

to gonu...@googlegroups.com

hi there,

a quick mail to request comments about the new `github.com/go-hep/fit` package:

it leverages `github.com/gonum/optimize` to fit data with a provided model.

The cost function is automatically created from the function to be fitted with, as well as the gradient (using github.com/gonum/diff/fd).

e.g:

```

xdata, ydata := ...

res, err := fit.Curve1D(

fit.Func1D{

F: func(x float64, ps []float64) float64 { return math.Exp(ps[0] * x + ps[1]) },

X: xdata,

Y: ydata,

Ps: []float64{1, 1},

},

nil, nil,

)

```

right now:

- it only handles unconstrained fits

- it defaults to SIMPLEX (NelderMead)

one should be able to implement the Levenberg-Marquardt algorithm for unconstrained fits (and greatly speed-up the convergence).

I am announcing go-hep/fit here b/c I want to -eventually- push it to gonum/fit :)

feedback (about API, perfs, ...) welcome!

-s

mans...@uchicago.edu

unread,

Jan 16, 2017, 12:56:16 AM1/16/17

to gonum-dev

Hi Sebastien,

I have some thoughts on how to do this. It will take me a little while to compose them, so I'll be back either tomorrow or next weekend with a post on them. Sorry about not following up on the comment I left on that issue back in November: I had teaching, research, and personal obligations that all popped up at the same time.

In the meantime, I think would be good for us to be explicit about the types of users we want to support and the types of use cases we want to support.

Users:

Do we want to support users who want publication-quality fits?
Do we want to support users with strict time constraints? (e.g. The difference between a developer who needs a fit to be part of a responsive real-time system and a researcher who is willing to wait fifteen seconds for their fit to finish.)
Do we want to support users who only want predictive fits and don't care much about the corresponding functional form or about parameter values?
Do we want our API to cater mostly to non-expert users?
Do we want our API to cater mostly to expert users?

(Personally, I favor an approach that focuses on points 1 and 4, but regardless, this is something that would be good to decide early.)

Use Cases:

Here are a non-exhaustive list of reasonable things a researcher might want a fitting library to be able to do. I've encountered all of the cases I list below multiple times in the past year alone and have seen them all in papers. We're obviously under no obligation to support all of these, but it's good to understand the ecosystem of things that people might want to do and also good to understand what types of things we don't want to support.

Data

(Assuming 2D data to keep terminology simple)

The user has no information on the associated errors, but has reason to believe that X errors are small and that all Y errors are gaussian and identical.
The user has gaussian Y errors on every point.
The user has gaussian X and Y errors on every point.
The user has non-Gaussian errors.
(A special case of 5:) Some points are only upper or lower bounds.
(A special case of 5:) Some points have asymmetrical error bars.
The user has data which is maliciously distributed (e.g. you're fitting a power law to data which has an enormous amount of measurements at low X and only a few measurements at high X)
The user has no information on the associated errors, but has no reason to believe that different data points have the same types of errors (e.g. one can see a large number of outliers, different scatter in different parts of the data set, etc.)
The user has outliers in their data that they want to handle self-consistently (instead of removing them by hand.)
The user has covariant errors in X and Y for some of their data. (Admittedly, this is pretty uncommon, mostly because people don't check for covariance in their data very often.)

Models

The user is fitting a 1D curve through 2D data.
The user is fitting an (N-1)D surface through ND data.
The user is fitting a _distribution_ to their data as opposed to a curve.
(A special case of 3:) The user is fitting a curve where there is some (possibly unknown) intrinsic scatter to points that is a part of the model itself and not due to reported measurement errors. (IMO, this is actually a more common occurrence than the case where there is no intrinsic scatter.)
(A special case of 3:) The user is fitting a curve where there is some intrinsic scatter that is a function of intrinsic variables.

Analysis

The user has no idea what is going on and just wants a qualitatively reasonable set of best fit parameters.
The user wants the best fit parameter vector that corresponds to the point of maximum likelihood in the joint posterior distribution of all the model parameters.
The user wants low-sigma error bars on the parameters and is willing to assume that the posterior distribution is Gaussian.
The user has reason to believe that the posterior distribution will not be Gaussian and wants some other statistic to represent the best fit values and error bars (e.g. median and 64% contours). (IMO, this is significantly more common than the alternative).
The user wants high-sigma error bars.
The user wants covariance information.
The user wants the full posterior distribution. (e.g. they want to check for weird features by eye.)
The user wants to change the initial parameter guess.
The user has priors on what portions of parameter space the best fit can fall within. (The most common reason for this would be that you want to keep your objective function from blowing up, but there are many other reasons one could choose.)
The user wants commonly used fit statistics, like R^2 or reduced Chi^2.
The user wants to compare the evidence ratio of different models to one another.

I'm going to eventually suggest an API and backend which supports most of these things out of the box and (and all of them if the user is willing to put in some elbow grease), but the vast majority of libraries choose not to do this. I'd say that the bare minimum would be:

Support 1D curves with different Y errors on every point.
Support restricting ranges on all fit parameters.
Support 1 sigma errors and parameter covariances.

(Although I'd personally be pretty uncomfortable signing off on library that only supported these cases. But I'm just some random guy, so that's not worth much ;-) )

Thoughts on go-hep/fit:

I have some objections to this approach. The first (which you can probably guess from the preceding discussion), is that this library is pretty opinionated about the types of data sets that it will accept. I'm less worried about supporting the different types of models and data types than I am about the assumptions about the shapes of the posterior parameter distributions. Additionally, the approach for fitting distributions (by fitting a curve to an intermediate histogram) will give reasonable qualitative results in some cases, but is not quantitatively correct, particularly in cases where you have a small or moderate number of points, cases where your samples have measurement errors, cases where the distribution is steeply sloped, or cases where the user cares about getting correct errors on distribution parameters. (I can elaborate on this later if you disagree.) If we want to support fitting distributions we should do it with a method designed to fit distributions, like MCMC.

I also have some objections to the API which aren't related purely to functionality or to backend. The error checking approach seems to stutter to me because the user has to check two pointers against nil. Also, I'm not a fan of having optimize.Settings and optimize.Method in the parameter list for fit.Curve1D. The majority of users are going to leave these as nil, so I think it would work better to leave them out of the parameter list. (The API could still be adapted to let people change these if they want: adding a second function, changing Curve1D to constructor and having Settings and Method methods, functional variadic arguments, etc.) One of the benefits of doing this is that the parameter lists is now small enough that the required fields of Func1D (F, X, and Y) can be turned into function parameters which lets the compiler check that the fields were set and also makes it easier for users to read the documentation for the function.

Also, in your example code you never use the Y errors read from the file in the first code block, which means that there's no examples which use the Err field of Func1D.

Best,

-Phil

Seb Binet

unread,

Jan 16, 2017, 12:39:47 PM1/16/17

to mans...@uchicago.edu, gonum-dev

(and now, from the correct email address. apologies for the noise...)

On Mon, Jan 16, 2017 at 6:39 PM, Sebastien Binet <bi...@cern.ch> wrote:

Phil,

those are all great questions/discussion topics.
this will need some time on my end to digest.

but, already one small comment: go-hep/fit does exercize a fit with (Y) errors.
the histogram contains statistical errors for each bin and these are collected and passed to the Func1D value:
https://github.com/go-hep/fit/blob/master/hist.go#L32

thanks for starting the discussion :)

-s

mans...@uchicago.edu

unread,

Jan 16, 2017, 7:00:39 PM1/16/17

to gonum-dev, mans...@uchicago.edu

go-hep/fit does exercize a fit with (Y) errors. the histogram contains statistical errors for each bin and these are collected and passed to the Func1D value:
https://github.com/go-hep/fit/blob/master/hist.go#L32

Allow me to clarify what I meant. In your example code, there is no example where readers are shown how to explicitly assign error bars to a set of X, Y points, since the use of errors in Hist1D is behind the scenes. I did make a mistake in thinking that the err value returned by readXY was a slice of y error bars on each value as opposed to an error, so there isn't a simple way to fix it.

mans...@uchicago.edu

unread,

Jan 17, 2017, 10:48:50 AM1/17/17

to gonum-dev, mans...@uchicago.edu

I still haven't finished writing up my proposed interface/backend description, but here is something else which might be useful when deciding how we want to write this library.

Half a year ago during a long flight delay at O'Hare, I made a comparison of the functionality and APIs of ~30 different publicly available curve fitting libraries. (My original intention was to turn it into a blog post, but I decided against it because no one could possibly care and that it might be upsetting to some of the library authors.) I'll attach a table with an abbreviated version of the full comparison, but here are the important takeaways:

Overwhelmingly, the most common interface for fitting libraries is to export a single function and for that function to look like func CurveFit(f Func1D, float64, p0, x, y, yerr []float64) (pBest, errBest []float64) (or the equivalent in their language). Here Func1D is a callback of type func(x float64, p []float64) float64, and returns the result of a 1D function evaluated at x with a parameter vector p. The remaining arguments are the initial parameter vector, p0, arrays the independent and dependent measurements -- x and y, respectively -- and a nil-able array of measurement errors on the independent variable, yerr. In languages with keyword arguments, yerr and p0 are usually left as optional. The return values are the best fit parameters and the errors on There are several things I don't like about this interface (besides the assumptions it's making about input data), but I'll leave that discussion until later.
Languages with keyword arguments generally stuff absolutely every free parameter in their fitting algorithm into the argument list, too, but for reasons I'll explain in my later proposal post, I personally think this a bad approach.
A large minority of math packages include their interpolation routines in their fitting libraries. (If we want to do this, I already have have a bunch of Go-based spline and interpolation code lying around, so it wouldn't be too painful to add. The thing I don't have is B-splines -- because I don't like them as a concept -- but it wouldn't be too bad to write one up.)
The majority of the libraries I looked at do not give access to all three of the "minimum" features that I suggested earlier (y measurement error bars, parameter bounds, and best fit errors).
The only libraries which answer "yes" to a good chunk of the questions I posed in my first post are not easily usable by non-experts (e.g. ROOT, emcee, PyMC, any MCMC libraries I forgot about, etc.).
Given the way that I weighted different types of functionality and my own qualitative judgement of APIs, the "best" fitting library I could find was emcee, and the "best" library which was designed for non-experts was LmFit (for Python) (although the documentation does its best to convince you otherwise).
By the same standards, the worst fitting library every written is (perhaps unsurprisingly) Apache Commons Math.

One important takeaway from this is that there is (surprisingly) a really big hole in the curve fitting library ecosystem. As far as I can tell, there is no library out there which simultaneously gives you access to a lot of the features you would want for a publication-quality fit while also being usable by a random grad student who doesn't know how to write down, say, a PDF for a power law data set with intrinsic scatter, logarithmic x and y errors, and upper bounds on some data points (or, equivalently, one who knows how to do that and doesn't want to spend a day working out the integrals and another three days debugging the fact that they forgot to downweight the oversampled region).

Also, would there be general interest in having me give a brief background on curve fitting methods? (Not the specifics of how different algorithms work, but a high level view.) I've been using Bayesian terminology for a lot of this, but I realize that some people may not be familiar with that in the context of curve fitting, or about how exactly MCMC/integration methods relate to least squares.

-Phil

fitting_libraries.txt

Seb Binet

unread,

Jan 17, 2017, 11:17:08 AM1/17/17

to mans...@uchicago.edu, gonum-dev

On Tue, Jan 17, 2017 at 4:48 PM, <mans...@uchicago.edu> wrote:

I still haven't finished writing up my proposed interface/backend description, but here is something else which might be useful when deciding how we want to write this library.

Half a year ago during a long flight delay at O'Hare, I made a comparison of the functionality and APIs of ~30 different publicly available curve fitting libraries. (My original intention was to turn it into a blog post, but I decided against it because no one could possibly care and that it might be upsetting to some of the library authors.) I'll attach a table with an abbreviated version of the full comparison, but here are the important takeaways:
Overwhelmingly, the most common interface for fitting libraries is to export a single function and for that function to look like func CurveFit(f Func1D, float64, p0, x, y, yerr []float64) (pBest, errBest []float64) (or the equivalent in their language). Here Func1D is a callback of type func(x float64, p []float64) float64, and returns the result of a 1D function evaluated at x with a parameter vector p. The remaining arguments are the initial parameter vector, p0, arrays the independent and dependent measurements -- x and y, respectively -- and a nil-able array of measurement errors on the independent variable, yerr. In languages with keyword arguments, yerr and p0 are usually left as optional. The return values are the best fit parameters and the errors on There are several things I don't like about this interface (besides the assumptions it's making about input data), but I'll leave that discussion until later.
Languages with keyword arguments generally stuff absolutely every free parameter in their fitting algorithm into the argument list, too, but for reasons I'll explain in my later proposal post, I personally think this a bad approach.
A large minority of math packages include their interpolation routines in their fitting libraries. (If we want to do this, I already have have a bunch of Go-based spline and interpolation code lying around, so it wouldn't be too painful to add. The thing I don't have is B-splines -- because I don't like them as a concept -- but it wouldn't be too bad to write one up.)
The majority of the libraries I looked at do not give access to all three of the "minimum" features that I suggested earlier (y measurement error bars, parameter bounds, and best fit errors).
The only libraries which answer "yes" to a good chunk of the questions I posed in my first post are not easily usable by non-experts (e.g. ROOT, emcee, PyMC, any MCMC libraries I forgot about, etc.).
Given the way that I weighted different types of functionality and my own qualitative judgement of APIs, the "best" fitting library I could find was emcee, and the "best" library which was designed for non-experts was LmFit (for Python) (although the documentation does its best to convince you otherwise).
By the same standards, the worst fitting library every written is (perhaps unsurprisingly) Apache Commons Math.

just for completeness, here is my own list:

https://github.com/sbinet-gonum/fit/issues/1

BTW, when you say "ROOT", do you mean the C/C++ interface to the old (FORTRAN) MINUIT code or the new-ish C++ rewrite (which was once part of LCG/SEAL, Minuit2) ?

Also, for fitting distributions, the high energy physics "go-to" tool is RooFit:

https://root.cern.ch/doc/v606/group__tutorial__roofit.html

https://root.cern/roofit-20-minutes

-s

mans...@uchicago.edu

unread,

Jan 17, 2017, 11:45:34 AM1/17/17

to gonum-dev, mans...@uchicago.edu

Good point. I was uneasy about even including ROOT because it's so huge (and because it's sufficient to do all the curve fitting tasks that an entire subfield runs into). I think I was looking through an older version, since I don't remember seeing documentation that looked anywhere near that nice. It definitely wasn't RooFit, which seems much nicer to use than whatever I was looking at. I'll retract whatever I've said about this family of libraries until I've read more.

I'll look through the APIs you linked through the next time I get a chance.

Reply all

Reply to author

Forward