Questions regarding the public database

William RAYNAUT

unread,

Nov 16, 2015, 2:11:15 PM11/16/15

to OpenML

Hello,

First, I would like to express my interest in the OpenML framework. By all means it sounds like an ambitious project, but the its current state allows great expectations !

To come to my point, I am currently planning a meta-learning experiment using OpenML public database as source for a meta-database, and have a few questions regarding particular points.

1) What is the difference between the database snapshot available on openml.org and the one from openml.liacs.nl ? Which one should I use ?

2) I am exploring the public database's tables manually and have trouble finding some key elements. For instance, where and how are stored flows ?

3) Is a schema of the database available ? It would doubtlessly help a lot with future occurrences of the first point.

4) I intend to build my meta-database using parts of openML public database and potentially additional original data. Am I allowed to do so ? Is there any constraints or recommendation toward the use or publication of such data ?

Thank you for your time,

William Raynaut

Joaquin Vanschoren

unread,

Nov 23, 2015, 8:38:18 AM11/23/15

to OpenML

Hi,

Sorry for the slow reply.

1) What is the difference between the database snapshot available on openml.org and the one from openml.liacs.nl ? Which one should I use ?

I don't quite understand. Do you find two different snapshots? In any case, always use openml.org (openml.liacs.nl may disappear or be renamed at some point in the future).

2) I am exploring the public database's tables manually and have trouble finding some key elements. For instance, where and how are stored flows ?

Flows are called implementations in the database (historical reasons)

3) Is a schema of the database available ? It would doubtlessly help a lot with future occurrences of the first point.

Yes: http://www.openml.org/query

4) I intend to build my meta-database using parts of openML public database and potentially additional original data. Am I allowed to do so ? Is there any constraints or recommendation toward the use or publication of such data ?

Sure, as long as you properly attribute OpenML: http://www.openml.org/guide#!cite. It would also be very nice if you could contribute by adding your datasets and experiments to OpenML. Let me know if you need help with that. We are also working on a new feature called 'studies' where you can easily generate such meta-databases, and 'circles' to share any resources (e.g. datasets, experiments) in a smaller group of people prior to publication.

Cheers,

Joaquin

William RAYNAUT

unread,

Nov 26, 2015, 3:25:26 AM11/26/15

to OpenML

Hello and thank you for this answer !

Regarding my first point, I was referrig to this guide stating :

NOTE: Developers are advised to use the development version of the databases instead, see: http://openml.liacs.nl/developers. Only these will include the latest changes needed for the latest website updates.

And I can also download a snapshot from there, so I was wondering if there were any actual difference between the two downloads...

Regarding contributions, I would love to experiment with this "studies" feature once it is implemented. Any rough ETA ?

On a more "long term" perspective, I intend to experiment with the automated generation of data mining workflows. Since it would result in the generation a very large number of partly random runs, I for now dont think wise to upload them to openML and consider instead a custom local installation of openML to handle such meta-data. If the end results of these generation processes happen to be of acceptable quality, I will consider contributing them to the main openML database. Such runs would however not be the product of human expertise, do you think it would contradict openML purpose ?

Joaquin Vanschoren

unread,

Nov 29, 2015, 8:02:05 PM11/29/15

to William RAYNAUT, OpenML

Dear William,

Thanks, the wiki was outdated, and I fixed this now.

The studies are largely implemented but still being documented and tested. We are all very very busy at the moment, but I hope to finish it before Christmas.

Regarding your experiments: studies would indeed be great here, since they will encapsulate your randomized experiment nicely. We do need to work more on a proper filtering of Flows, so that the popular ones are high up in the lists, and the random ones much lower. For the time being, it may indeed be better if you do your first experiments locally, but as a rule I think OpenML should support this type of experimentation.

BTW, you may want to get it touch with a student of mine, Jan van Rijn, he is doing something very related.

Cheers,

Joaquin

--
You received this message because you are subscribed to the Google Groups "OpenML" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openml+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

William RAYNAUT

unread,

Dec 7, 2015, 11:37:49 AM12/7/15

to OpenML, j.vans...@tue.nl

Dear Joaquin,

Sorry for the slow response and thanks for these answers !

I think I can find what I need in openML database now, but I will definitely check the studies when they become available.

Looking up M. van Rijn, I found the very interesting description of the Massively Collaborative Data Mining project, but fail to find any published work on that particular topic. Any idea of where I should look ?

Cheers,

William

Joaquin Vanschoren

unread,

Dec 7, 2015, 6:32:17 PM12/7/15

to William RAYNAUT, OpenML

Hi William,

That is the project funding Jan's PhD. We later renamed it to OpenML.

All of his publications are done under this project:

https://scholar.google.be/citations?user=O4X5CpwAAAAJ&hl=en&oi=ao

Cheers,

Joaquin

William RAYNAUT

unread,

Dec 8, 2015, 6:23:40 AM12/8/15

to OpenML, j.vans...@tue.nl

Hello again,

Ok, I didnt realize that it was the previous version of openML. I was hooked by the last paragraph :

To illustrate the obtained benefits, we will exploit the resulting repository and novel meta-learning techniques to perform large-scale meta-learning studies that are nearly impossible today, on complex real-world bioinformatics data.

But I guess it precisely comes back to this studies feature you mentionned, and that communications on that topic are on the way. I'll be following Jan's work and yours closely !