test datasets for regression / robust regression ?

348 views
Skip to first unread message

denis

unread,
Nov 16, 2012, 4:50:12 AM11/16/12
to pystat...@googlegroups.com
Stats people,
  can anyone suggest online test datasets for regression / robust regression ?
Scikit-learn has only one of any size (boston house prices, 500 x 13)
and more testing, especially datasets with outliers, would help compare and tune
the different algorithms.
(How to plot and compare them, beyond train/test and plot residuals ?)

Thanks,
cheers
  -- denis

VincentAB

unread,
Nov 16, 2012, 8:14:34 AM11/16/12
to pystat...@googlegroups.com
Try this: Rdatasets is a collection of over 500 data sets that were originally distributed alongside the statistical software environment R and some of its add-on packages. They are now available as CSV files and they are fully documented. 


Best, 

Vincent

Skipper Seabold

unread,
Nov 16, 2012, 8:36:36 AM11/16/12
to pystat...@googlegroups.com
Is the size of the dataset important? Most of the traditional stats
datasets that I know of are not "big." The Duncan prestige data has 45
observations. I just added a dataset to statsmodels based on US state
crime statistics from the late 2000s that has a few problem
observations. There's the stackloss data in statsmodels but it's again
only 21 observations.

I'm not familiar with the boston house prices data. Does it contain
outliers or just long tails?

I'm actually looking at robust regression methods now, so I'll let you
know if I come across any good, benchmark big example datasets. Josef
may know of some.

Alternatively, if the idea is to test the estimator more than to have
a clear example, in the past I've simulated data with outliers to test
the assumptions on theoretical breakdown points of estimators. You
could always do this.

Skipper

josef...@gmail.com

unread,
Nov 16, 2012, 9:40:20 AM11/16/12
to pystat...@googlegroups.com
My experience is pretty much the same. The main "classical" robust
regression datasets have 15 to 30 observations. Some books on robust
regression have accompanying datasets, but I don't remember if
anything is larger than a few observations.
There might be a few more like Boston housing.

(I think we should include Boston housing in statsmodels, but I have
seen it as an example in microeconometrics, IIRC)

Josef


>
> Skipper

Skipper Seabold

unread,
Nov 16, 2012, 9:46:27 AM11/16/12
to pystat...@googlegroups.com
I assume you've asked on the sklearn ML, but Jake may know of some
examples in astronomy.

http://astroml.github.com/modules/classes.html#astronomy-datasets

Skipper

denis

unread,
Nov 16, 2012, 12:55:07 PM11/16/12
to pystat...@googlegroups.com
Folks,
  Rdatasets looks like a worthy effort. Vincent, would you know which csvs are for regression ?
csv/MASS/Boston.csv looks familiar .
Fwiw, some sizes --

    # nrow  ncol  file  colnames
    2783  1  csv/MASS/SP500.csv     r500
    1000  3  csv/MASS/synth.te.csv  xs ys yc
     506 14  csv/MASS/Boston.csv    crim zn indus chas nox rm age dis rad tax ptrati
     365  9  csv/MASS/gilgais.csv   pH00 pH30 pH80 e00 e30 e80 c00 c30 c80
     314  2  csv/MASS/GAGurine.csv  Age GAG
     299  2  csv/MASS/geyser.csv    waiting duration
     250  3  csv/MASS/synth.tr.csv  xs ys yc
     205  7  csv/MASS/Melanoma.csv  time status sex age year thickness ulcer
     189 10  csv/MASS/birthwt.csv   low age lwt race smoke ptl ht ui ftv bwt
     133  2  csv/MASS/mcycle.csv    times accel
     114  4  csv/MASS/beav1.csv     day time temp activ
     100  4  csv/MASS/beav2.csv     day time temp activ
    ...

I was looking for > say 100 observations
so that I could split them into say 2/3 train, 1/3 test,
regress Xytrain, test that on Xytest;
my goal was just to get an overview of the many *Regress in sklearn,
see github.com/denis-bz/sklearn-regress .
But that may be silly -- I'm not a statistician, hardly know R, terrible.

Somewhat offtopic but funny (wry-funny):
http://stats.stackexchange.com/questions/1164/why-havent-robust-and-resistant-statistics-replaced-classical-techniques

cheers
  -- denis

Skipper Seabold

unread,
Nov 16, 2012, 1:15:54 PM11/16/12
to pystat...@googlegroups.com
On Fri, Nov 16, 2012 at 12:55 PM, denis <denis...@t-online.de> wrote:
> Folks,
> Rdatasets looks like a worthy effort. Vincent, would you know which csvs
> are for regression ?

The datasets are arranged according to the package they're in in R.
Ie., http://cran.r-project.org/web/packages/MASS/index.html

I usually sift through existing examples in software packages and
published papers/textbooks to try to find example datasets. For
instance, you might be interested in the Rousseeuw and Van Driessen
paper on computing Least Trimmed Squares. The DPOSS data has 56k
observations, as I mentioned there are probably more examples in
astronomy that would be useful. We also have a PR out right now to
include Fast-LTS in statsmodels.

AFAIK, robustbase is the 'best' state of the art robust regression
package for R, but IIRC most of their examples are still using small
data sets. You'll probably want to have a look.

http://cran.r-project.org/web/packages/robustbase/index.html

Just some advertising, you can get all the datasets in Rdataset into a
DataFrame using statsmodels master now, but robustbase isn't included
yet.

boston = sm.datasets.get_rdataset("Boston", "MASS")

josef...@gmail.com

unread,
Nov 16, 2012, 1:16:34 PM11/16/12
to pystat...@googlegroups.com
On Fri, Nov 16, 2012 at 12:55 PM, denis <denis...@t-online.de> wrote:
I only got to the 3rd answer, which I upvoted
"So 'classical models' (whatever they are ..."

which robust ?

econometrics is full of "robust" (to mis-specification)
OLS is robust *)
sandwich estimator for covariance are robust
GEE/GMM are robust *)

*) no (?) distributional assumptions
GMM = General Method of Momentes
GEE = Generalized Estimating Equation

-------

However, we are on a mission to popularize statsmodel.robust for
"robust to outliers"

Josef


>
> cheers
> -- denis

Skipper Seabold

unread,
Nov 16, 2012, 1:20:52 PM11/16/12
to pystat...@googlegroups.com
Good point. I was waffling a bit by what is meant by a robust
estimator for big data in the first place and how this fits in to the
data mining mentality.

Skipper

josef...@gmail.com

unread,
Nov 16, 2012, 1:29:39 PM11/16/12
to pystat...@googlegroups.com
On Fri, Nov 16, 2012 at 1:16 PM, <josef...@gmail.com> wrote:
I forgot one:
Empirical Likelihood: ``statsmodels.emplike`` needs advertising.

Josef

Skipper Seabold

unread,
Nov 16, 2012, 1:40:38 PM11/16/12
to pystat...@googlegroups.com
On Fri, Nov 16, 2012 at 1:15 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Fri, Nov 16, 2012 at 12:55 PM, denis <denis...@t-online.de> wrote:
>> Folks,
>> Rdatasets looks like a worthy effort. Vincent, would you know which csvs
>> are for regression ?
>
> The datasets are arranged according to the package they're in in R.
> Ie., http://cran.r-project.org/web/packages/MASS/index.html
>
> I usually sift through existing examples in software packages and
> published papers/textbooks to try to find example datasets. For
> instance, you might be interested in the Rousseeuw and Van Driessen
> paper on computing Least Trimmed Squares. The DPOSS data has 56k
> observations, as I mentioned there are probably more examples in
> astronomy that would be useful. We also have a PR out right now to
> include Fast-LTS in statsmodels.
>
> AFAIK, robustbase is the 'best' state of the art robust regression
> package for R, but IIRC most of their examples are still using small
> data sets. You'll probably want to have a look.
>
> http://cran.r-project.org/web/packages/robustbase/index.html

There are actually several datasets in this package with many
observations, so this might be your best bet.

> nrow(NOxEmissions)
[1] 8088

Skipper Seabold

unread,
Nov 16, 2012, 2:37:22 PM11/16/12
to pystat...@googlegroups.com
On Fri, Nov 16, 2012 at 1:15 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Fri, Nov 16, 2012 at 12:55 PM, denis <denis...@t-online.de> wrote:
>> Folks,
>> Rdatasets looks like a worthy effort. Vincent, would you know which csvs
>> are for regression ?
>
> The datasets are arranged according to the package they're in in R.
> Ie., http://cran.r-project.org/web/packages/MASS/index.html
>
> I usually sift through existing examples in software packages and
> published papers/textbooks to try to find example datasets. For
> instance, you might be interested in the Rousseeuw and Van Driessen
> paper on computing Least Trimmed Squares. The DPOSS data has 56k
> observations, as I mentioned there are probably more examples in
> astronomy that would be useful. We also have a PR out right now to
> include Fast-LTS in statsmodels.
>
> AFAIK, robustbase is the 'best' state of the art robust regression
> package for R, but IIRC most of their examples are still using small
> data sets. You'll probably want to have a look.
>
> http://cran.r-project.org/web/packages/robustbase/index.html
>
> Just some advertising, you can get all the datasets in Rdataset into a
> DataFrame using statsmodels master now, but robustbase isn't included
> yet.
>

Thanks to Vincent, all the robustbase datasets are now available also
from statsmodels.

https://github.com/vincentarelbundock/Rdatasets/tree/master/csv/robustbase

VincentAB

unread,
Nov 16, 2012, 2:38:38 PM11/16/12
to pystat...@googlegroups.com


On Friday, November 16, 2012 4:50:12 AM UTC-5, denis wrote:
I followed Skipper's suggestion and added data from the 'robustbase' R package to Rdatasets. You'll find them by searching for 'robustbase' on the html index: https://vincentarelbundock.github.com/Rdatasets

Vincent 

VincentAB

unread,
Nov 16, 2012, 2:39:40 PM11/16/12
to pystat...@googlegroups.com

VincentAB

unread,
Nov 16, 2012, 2:40:39 PM11/16/12
to pystat...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages