Hello,
We are using a mongo 3.2.7 with wiredTiger database to retain sensor data for a couple dozen remote devices. The retention concept we've arrived at is to keep a week worth of all sensor data published, and then perform a 'change only' compression where we delete samples where the sensor value has not changed.
For example, this data set:
Device, Time, Mnemonic, Value
S06, 2017-02-03 09:22:44.420000, CMD_ACCCTR, 224
S06, 2017-02-03 09:22:44.780000, CMD_ACCCTR, 224
S06, 2017-02-03 09:22:46.220000, CMD_ACCCTR, 224
S06, 2017-02-03 09:22:48.380000, CMD_ACCCTR, 225
S06, 2017-02-03 09:22:49.820000, CMD_ACCCTR, 225
S06, 2017-02-03 09:23:47.600000, CMD_ACCCTR, 225
S06, 2017-02-03 09:23:48.590000, CMD_ACCCTR, 225
S06, 2017-02-03 09:23:49.400000, CMD_ACCCTR, 226
S06, 2017-02-03 09:23:50.390000, CMD_ACCCTR, 226
S06, 2017-02-03 09:23:51.200000, CMD_ACCCTR, 226
S06, 2017-02-03 09:23:52.190000, CMD_ACCCTR, 226
S06, 2017-02-03 09:24:42.320000, CMD_ACCCTR, 226
S06, 2017-02-03 09:24:43.400000, CMD_ACCCTR, 227
Would be compressed down to:
Device, Time, Mnemonic, Value
S06, 2017-02-03 09:22:44.420000, CMD_ACCCTR, 224
S06, 2017-02-03 09:22:48.380000, CMD_ACCCTR, 225
S06, 2017-02-03 09:23:49.400000, CMD_ACCCTR, 226
S06, 2017-02-03 09:24:43.400000, CMD_ACCCTR, 227
I am a little concerned about a kind of disk fragmentation related to our access patterns. My understanding is that mongo will reuse disk space, so new data will start being interleaved with the existing data. Over time, I think this is going to negatively affect query performance. For example, a 24 hour query (2017-02-13 00:00:00 - 2017-02-14 00:00:00) may have to access a lot of different pages to pull the data since it will first be populated in existing space opened up by the compression.
One thought we had to help this problem is during the data compression, we actually delete all documents over the time period, and then re-write the entire batch of saved samples in bulk. Typically this type of compression results in a lot of removed data (70% - 90%) so we aren't re-writing a ton of samples. But I think this will open up larger blocks of disk space, and help keep time related records localized?