I have a wonderful group of MSc students working on building a proof-of-concept energy disaggregation competition platform.
The full project spec is here.
The basic idea is that the platform will allow researchers to download training data (recording both the whole-house aggregate and individual appliance data) at a number of different sample rates, and then researchers will upload the output of their disaggregation algorithm to the platform for scoring against a set of metrics.
Which file format should we use for the training data (the data that competitors will download) and the NILM output (the data that competitors will upload)? CSV seems like a necessity to make it as easy as possible to load. Should we just use the
REDD format (i.e. have a set of
house_X directories, each of which contains a
labels.dat file and several
channel_Y.dat files, each containing two columns: timestamp and power demand) but use a controlled vocabulary for appliance names? Or do we need a more sophisticated file format (e.g. the
CSV format defined in NILMTK's CSVDataStore class that could potentially handle non-power data like weather data and confidence scores for NILM output? Or a richer metadata format which can encode things like sample rate, active / apparent power, etc... e.g. the
Simple NILM Metadata schema currently stuttering into existence?)
It would also be nice to allow folks to download high frequency data (kHz). What format should we use for that? Lucas' Pereira's "
SURF-PI" format seems to be the only attempt to standardise kHz energy data. Can we compress SURF-PI using, for example,
FLAC?