Hi Sebastien,
I have some thoughts on how to do this. It will take me a little while to compose them, so I'll be back either tomorrow or next weekend with a post on them. Sorry about not following up on the comment I left on that issue back in November: I had teaching, research, and personal obligations that all popped up at the same time.
In the meantime, I think would be good for us to be explicit about the types of users we want to support and the types of use cases we want to support.
Users:
- Do we want to support users who want publication-quality fits?
- Do we want to support users with strict time constraints? (e.g. The difference between a developer who needs a fit to be part of a responsive real-time system and a researcher who is willing to wait fifteen seconds for their fit to finish.)
- Do we want to support users who only want predictive fits and don't care much about the corresponding functional form or about parameter values?
- Do we want our API to cater mostly to non-expert users?
- Do we want our API to cater mostly to expert users?
(Personally, I favor an approach that focuses on points 1 and 4, but regardless, this is something that would be good to decide early.)
Use Cases:
Here are a non-exhaustive list of reasonable things a researcher might want a fitting library to be able to do. I've encountered all of the cases I list below multiple times in the past year alone and have seen them all in papers. We're obviously under no obligation to support all of these, but it's good to understand the ecosystem of things that people might want to do and also good to understand what types of things we don't want to support.
Data
(Assuming 2D data to keep terminology simple)
- The user has no information on the associated errors, but has reason to believe that X errors are small and that all Y errors are gaussian and identical.
- The user has gaussian Y errors on every point.
- The user has gaussian X and Y errors on every point.
- The user has non-Gaussian errors.
- (A special case of 5:) Some points are only upper or lower bounds.
- (A special case of 5:) Some points have asymmetrical error bars.
- The user has data which is maliciously distributed (e.g. you're fitting a power law to data which has an enormous amount of measurements at low X and only a few measurements at high X)
- The user has no information on the associated errors, but has no reason to believe that different data points have the same types of errors (e.g. one can see a large number of outliers, different scatter in different parts of the data set, etc.)
- The user has outliers in their data that they want to handle self-consistently (instead of removing them by hand.)
- The user has covariant errors in X and Y for some of their data. (Admittedly, this is pretty uncommon, mostly because people don't check for covariance in their data very often.)
Models
- The user is fitting a 1D curve through 2D data.
- The user is fitting an (N-1)D surface through ND data.
- The user is fitting a _distribution_ to their data as opposed to a curve.
- (A special case of 3:) The user is fitting a curve where there is some (possibly unknown) intrinsic scatter to points that is a part of the model itself and not due to reported measurement errors. (IMO, this is actually a more common occurrence than the case where there is no intrinsic scatter.)
- (A special case of 3:) The user is fitting a curve where there is some intrinsic scatter that is a function of intrinsic variables.
Analysis
- The user has no idea what is going on and just wants a qualitatively reasonable set of best fit parameters.
- The user wants the best fit parameter vector that corresponds to the point of maximum likelihood in the joint posterior distribution of all the model parameters.
- The user wants low-sigma error bars on the parameters and is willing to assume that the posterior distribution is Gaussian.
- The user has reason to believe that the posterior distribution will not be Gaussian and wants some other statistic to represent the best fit values and error bars (e.g. median and 64% contours). (IMO, this is significantly more common than the alternative).
- The user wants high-sigma error bars.
- The user wants covariance information.
- The user wants the full posterior distribution. (e.g. they want to check for weird features by eye.)
- The user wants to change the initial parameter guess.
- The user has priors on what portions of parameter space the best fit can fall within. (The most common reason for this would be that you want to keep your objective function from blowing up, but there are many other reasons one could choose.)
- The user wants commonly used fit statistics, like R^2 or reduced Chi^2.
- The user wants to compare the evidence ratio of different models to one another.
I'm going to eventually suggest an API and backend which supports most of these things out of the box and (and all of them if the user is willing to put in some elbow grease), but the vast majority of libraries choose not to do this. I'd say that the bare minimum would be:
- Support 1D curves with different Y errors on every point.
- Support restricting ranges on all fit parameters.
- Support 1 sigma errors and parameter covariances.
(Although I'd personally be pretty uncomfortable signing off on library that only supported these cases. But I'm just some random guy, so that's not worth much ;-) )
Thoughts on go-hep/fit:
I have some objections to this approach. The first (which you can probably guess from the preceding discussion), is that this library is pretty opinionated about the types of data sets that it will accept. I'm less worried about supporting the different types of models and data types than I am about the assumptions about the shapes of the posterior parameter distributions. Additionally, the approach for fitting distributions (by fitting a curve to an intermediate histogram) will give reasonable qualitative results in some cases, but is not quantitatively correct, particularly in cases where you have a small or moderate number of points, cases where your samples have measurement errors, cases where the distribution is steeply sloped, or cases where the user cares about getting correct errors on distribution parameters. (I can elaborate on this later if you disagree.) If we want to support fitting distributions we should do it with a method designed to fit distributions, like MCMC.
I also have some objections to the API which aren't related purely to functionality or to backend. The error checking approach seems to stutter to me because the user has to check two pointers against nil. Also, I'm not a fan of having optimize.Settings and optimize.Method in the parameter list for fit.Curve1D. The majority of users are going to leave these as nil, so I think it would work better to leave them out of the parameter list. (The API could still be adapted to let people change these if they want: adding a second function, changing Curve1D to constructor and having Settings and Method methods, functional variadic arguments, etc.) One of the benefits of doing this is that the parameter lists is now small enough that the required fields of Func1D (F, X, and Y) can be turned into function parameters which lets the compiler check that the fields were set and also makes it easier for users to read the documentation for the function.
Also, in your example code you never use the Y errors read from the file in the first code block, which means that there's no examples which use the Err field of Func1D.
Best,
-Phil