Fitting a line

109 views
Skip to first unread message

Adam Ginsburg

unread,
Mar 3, 2013, 3:26:32 PM3/3/13
to astroml...@googlegroups.com
I've had some code sitting around for fitting a line to data.  From what I can tell of astroml / sklearn, there actually is no provision for fitting data with errors on both axes using pymc.  The outlier rejection example, in particular, is pretty limited as written in that it requires Y errors and fitting an intercept along with a slope (all though that's easy to modify).

So, is there any interest in including more general linear regression (using MCMC) in astroml?  Or better yet, any plans to do so?

Here's the code, with apologies for the svn repository (I'll move it to git if there's interest):
In particular, look at "pymc_linear_fit".

Adam


Jake Vanderplas

unread,
Mar 3, 2013, 5:08:53 PM3/3/13
to astroml...@googlegroups.com
Hi Adam,
We have a few examples for fitting a line to data with noise in both x and y.  I just wanted to make sure you were aware of them:
   http://astroml.github.com/book_figures/chapter8/fig_total_least_squares.html
   http://astroml.github.com/book_figures/chapter8/fig_outlier_rejection.html
It sounds like you looked at the second one already.  My thought on the matter is that the Bayesian approach is very nuanced, and you must make sure your model fits what you know about the data (for example, what distribution do you expect outliers to be drawn from?)  For this reason, the approach we've taken is to give examples of approaches on toy data rather than to provide a python function that does it automatically.  Any such function would be very likely to be misused.

That being said, I think the code you linked to is great.  If you'd be open to submitting it as an astroML example script for the website, that would be awesome.


Adam


--
You received this message because you are subscribed to the Google Groups "astroML-general" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astroml-gener...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Adam Ginsburg

unread,
Mar 3, 2013, 5:12:59 PM3/3/13
to astroml...@googlegroups.com
On Sun, Mar 3, 2013 at 2:49 PM, Jake Vanderplas <jak...@gmail.com> wrote:
> Hi Adam,
> We have a few examples for fitting a line to data with noise in both x and
> y. I just wanted to make sure you were aware of them:
>
> http://astroml.github.com/book_figures/chapter8/fig_total_least_squares.html
> http://astroml.github.com/book_figures/chapter8/fig_outlier_rejection.html

Yes, there's a good part of my code I wouldn't have written had I been
aware of astroML earlier.



> It sounds like you looked at the second one already. My thought on the
> matter is that the Bayesian approach is very nuanced, and you must make sure
> your model fits what you know about the data (for example, what distribution
> do you expect outliers to be drawn from?) For this reason, the approach
> we've taken is to give examples of approaches on toy data rather than to
> provide a python function that does it automatically. Any such function
> would be very likely to be misused.

I'm in general opposed to the philosophy that "likely to be misused"
means you shouldn't distribute (or simplify) code. I like your
examples well enough, though, and can probably use them without much
difficulty.

However, on this specific point - identifying outliers - is there a
more general approach? The two approaches you illustrate from Hogg's
paper are both good & powerful, but is there any reason you couldn't
identify outliers with an uninformative prior, under the assertion
that "outlier" means "probably not described by the model"?

So instead of the Bernoulli prior, you'd just start with a
DiscreteUniform distribution that only allows two variables. And
perhaps set the likelihood such that if the variable is marked
"Outlier", its likelihood is replaced with the likelihood of another
data point sampled from the observations... basically, this becomes a
filter to remove the least-likely data points. That's a rather
complicated set up, though, so perhaps the answer is 'no, there's
nothing more general'.

My problem was dealing with outliers in a set of data where the parent
distribution of the outliers is very poorly understood.

--
Adam

Jake Vanderplas

unread,
Mar 3, 2013, 6:14:36 PM3/3/13
to astroml...@googlegroups.com
On Sun, Mar 3, 2013 at 2:12 PM, Adam Ginsburg <kefl...@gmail.com> wrote:
On Sun, Mar 3, 2013 at 2:49 PM, Jake Vanderplas <jak...@gmail.com> wrote:
> Hi Adam,
> We have a few examples for fitting a line to data with noise in both x and
> y. I just wanted to make sure you were aware of them:
>
> http://astroml.github.com/book_figures/chapter8/fig_total_least_squares.html
> http://astroml.github.com/book_figures/chapter8/fig_outlier_rejection.html

Yes, there's a good part of my code I wouldn't have written had I been
aware of astroML earlier.


> It sounds like you looked at the second one already. My thought on the
> matter is that the Bayesian approach is very nuanced, and you must make sure
> your model fits what you know about the data (for example, what distribution
> do you expect outliers to be drawn from?) For this reason, the approach
> we've taken is to give examples of approaches on toy data rather than to
> provide a python function that does it automatically. Any such function
> would be very likely to be misused.

I'm in general opposed to the philosophy that "likely to be misused"
means you shouldn't distribute (or simplify) code. I like your
examples well enough, though, and can probably use them without much
difficulty.

I've had a lot of conversations with folks about this, and I have to admit I still don't have a good idea about the best approach.  On one hand, importable functions for performing analysis are convenient and user-friendly.  But the average user will then use them without necessarily understanding the assumptions that go into them.  This might be alright for some low-level things (e.g. something like a quicksort), but for high-level things like a multi-component Bayesian model, it's more complicated.  We've tended to avoid this in astroML by sharing examples in these circumstances, rather than trying to implement a one-size-fits-all function.  Though, admittedly, we've strayed from that philosophy in some cases (one example is the Bayesian Blocks histogram routines).  It's really tough to figure out the best approach.

One result of this is that the astroML package itself has no dependency on pymc: several of the example scripts do, but nothing imported from astroML uses pymc.  This is a result of what I was talking about above: there are so many tunable things in pymc, that including any one-size-fits-all function which relies on it would fall into two categories: it would either be so simplistic as to be only very narrowly useful, or basically replicate the pymc interface to allow the flexibility it would need to be useful.  With this in mind, we've opted to not implement one-size-fits-all routines, and share the code via example scripts instead.
 

However, on this specific point - identifying outliers - is there a
more general approach? The two approaches you illustrate from Hogg's
paper are both good & powerful, but is there any reason you couldn't
identify outliers with an uninformative prior, under the assertion
that "outlier" means "probably not described by the model"?

So instead of the Bernoulli prior, you'd just start with a
DiscreteUniform distribution that only allows two variables. And
perhaps set the likelihood such that if the variable is marked
"Outlier", its likelihood is replaced with the likelihood of another
data point sampled from the observations... basically, this becomes a
filter to remove the least-likely data points. That's a rather
complicated set up, though, so perhaps the answer is 'no, there's
nothing more general'.

Interesting ideas - I haven't thought too deeply about this myself, but I don't see any reason off-hand why this sort of thing wouldn't work.

I do think that the code you linked to would be great to add as examples on the astroML page.  I'm sure people would find it useful.  If you agree, you could put together a github pull request.  Let me know,
   Jake
 

My problem was dealing with outliers in a set of data where the parent
distribution of the outliers is very poorly understood.

--
Adam

--
Reply all
Reply to author
Forward
0 new messages