High Volume + Incrementally Growing Datasets

14 views
Skip to first unread message

Mark Liversedge

unread,
Mar 29, 2018, 1:40:32 PM3/29/18
to OpenML
Hello,

I am the lead dev for GoldenCheetah, a desktop application used by many thousands of Cyclists and Triathletes.

To support ML and more general research I am adding features to the desktop app to allow users to post their workout data for public use. I was planning on publishing this data quarterly.

It is likely to be a high volume set, each athlete is likely to post 500-5000 workouts each with around 3600-50000 rows of data.
So with a very conservative estimate of 100 athletes data in the course of a year we are talking 100x500x3600= 180million rows of data, probably stored per workout so 50,000 files.

Is OpenML the right place to post this?

If so, any advice on how to manage this, bearing in mind that athletes will be contributing data over time ?

Regards,
Mark

Joaquin Vanschoren

unread,
Mar 29, 2018, 4:35:41 PM3/29/18
to Mark Liversedge, OpenML
Hi Mark, 

Right now we don't handle this really well, although you can always upload 
individual datasets and tag them to organize them.

We are looking to integrate data packages (https://frictionlessdata.io/) which work with folders of related data files. We hope to integrate this later in the year. 

If you want to share data for more general research, maybe data.world is a good place?

Hope that helps,
Joaquin



--
You received this message because you are subscribed to the Google Groups "OpenML" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openml+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Thank you,
Joaquin

Mark Liversedge

unread,
Mar 30, 2018, 5:23:54 AM3/30/18
to OpenML

On Thursday, 29 March 2018 21:35:41 UTC+1, Joaquin Vanschoren wrote:
Right now we don't handle this really well, although you can always upload 
individual datasets and tag them to organize them.

We are looking to integrate data packages (https://frictionlessdata.io/) which work with folders of related data files. We hope to integrate this later in the year. 

If you want to share data for more general research, maybe data.world is a good place?

Thanks for the advice, I have signed up at data.world and will take a peek and check back in to OpenML in 6 months time.

Kind regards,
Mark

Mark Liversedge

unread,
May 14, 2018, 10:23:30 AM5/14/18
to OpenML
For info, data.world have a 5gb limit, I eventually landed at the Open Science Framework https://osf.io.


Regards,
Mark

Joaquin Vanschoren

unread,
May 14, 2018, 10:46:44 AM5/14/18
to Mark Liversedge, OpenML
OK :). For generic data storage, there is also Zenodo (also gives you a DOI).

If you also want to store machine learning models/experiments created on this data, let me know. For this, you probably want to specify the machine learning task first (classification, regression?), and specific datasets for these tasks?

Best,
Joaquin


--
You received this message because you are subscribed to the Google Groups "OpenML" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openml+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Thank you,
Joaquin
Reply all
Reply to author
Forward
0 new messages