Massive Scale Mythbusting of Data Science Rules of Thumb with OpenML

Skip to first unread message

Peter van der Putten

Feb 10, 2021, 5:14:44 PM2/10/21
to LIACS thesis projects
Dear students,

Please see below for a thesis project in the area of meta-learning and automated machine learning.

This will be supervised jointly by me and Jan van Rijn,



Massive Scale Mythbusting of Data Science Rules of Thumb with OpenML

Data science practitioners often use very common rules of thumb, but often large scale studies backing up these rules of thumb are lacking, for instance ‘feature selection is more important than algorithm selection’ or ‘nonlinear models are significantly better than linear models’

In this project you will pick one or two of these myths and investigate these using OpenML, a very large repository of data models. For example you will compare pipeline A and A’ (example: feature selection Y/N) and run these across hundreds of data sets and for a large number of algorithms. This will result in a meta level data set of results that we can then mine to validate these hypothesis rules of thumb, or perform actual meta level knowledge discovery. Typically this will not result in a simple answer whether the rule of thumb holds, but more a description of when it holds, for example for certain types of data set / algorithm combinations.

Benjamin Strang, Peter van der Putten, Jan N. van Rijn and Frank Hutter. Don't Rule Out Simple Models Prematurely: a Large Scale Benchmark Comparing Linear and Non-linear Classifiers in OpenML. In: Seventeenth International Symposium on Intelligent Data Analysis (IDA), 2018 (preprint)
Martijn J. Post, Peter van der Putten and Jan N. van Rijn. Does Feature Selection Improve Classification? A Large Scale Experiment in OpenML. In: Fifteenth International Symposium on Intelligent Data Analysis (IDA), 2016 (preprint)

Peter van der Putten

Assistant professor, LIACS, Leiden University, The Netherlands




Reply all
Reply to author
0 new messages