Data formats for disaggregation competition (training data and NILM output)

163 views
Skip to first unread message

Jack Kelly

unread,
Jan 20, 2015, 3:37:11 PM1/20/15
to energy-dis...@googlegroups.com
I have a wonderful group of MSc students working on building a proof-of-concept energy disaggregation competition platform.  The full project spec is here.

The basic idea is that the platform will allow researchers to download training data (recording both the whole-house aggregate and individual appliance data) at a number of different sample rates, and then researchers will upload the output of their disaggregation algorithm to the platform for scoring against a set of metrics.

Which file format should we use for the training data (the data that competitors will download) and the NILM output (the data that competitors will upload)?  CSV seems like a necessity to make it as easy as possible to load.  Should we just use the REDD format (i.e. have a set of house_X directories, each of which contains a labels.dat file and several channel_Y.dat files, each containing two columns: timestamp and power demand) but use a controlled vocabulary for appliance names?  Or do we need a more sophisticated file format (e.g. the CSV format defined in NILMTK's CSVDataStore class that could potentially handle non-power data like weather data and confidence scores for NILM output?  Or a richer metadata format which can encode things like sample rate, active / apparent power, etc... e.g. the Simple NILM Metadata schema currently stuttering into existence?)

It would also be nice to allow folks to download high frequency data (kHz).  What format should we use for that?  Lucas' Pereira's "SURF-PI" format seems to be the only attempt to standardise kHz energy data.  Can we compress SURF-PI using, for example, FLAC?

nipun batra

unread,
Feb 1, 2015, 8:45:59 PM2/1/15
to Jack Kelly, energy-dis...@googlegroups.com
I would imagine that the simplest option might be the best for the competition. So, a variant of REDD looks promising. Something like:

House X has a labels.csv of the following form

Channel number, Channel name, Channel instance
1, Mains, 1
2, Lights, 1
3, Lights, 2
4, Fridge, 1

Channel name should come from nilmtk vocabulary.

Furthermore, for the competition, I would encourage using `clean` data; so that algorithms and the competition platform are the main focus. 

Additionally, the metadata could also be as simple as possible. Maybe, assuming all the readings come from the same type of sensor having identical attributes (sampling rate, etc.). This way a nilmtk converter should take less than 10 minutes to write. 

--
You received this message because you are subscribed to the Google Groups "Energy Disaggregation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to energy-disaggreg...@googlegroups.com.
To post to this group, send email to energy-dis...@googlegroups.com.
Visit this group at http://groups.google.com/group/energy-disaggregation.
To view this discussion on the web, visit https://groups.google.com/d/msgid/energy-disaggregation/cf4b0215-0ef1-4120-94f7-3c161986b30c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages