mongo delete strategy for sensor data

52 views

Skip to first unread message

Michael Barszcz

unread,

Feb 14, 2017, 2:28:56 PM2/14/17

to mongodb-user

Hello,

We are using a mongo 3.2.7 with wiredTiger database to retain sensor data for a couple dozen remote devices. The retention concept we've arrived at is to keep a week worth of all sensor data published, and then perform a 'change only' compression where we delete samples where the sensor value has not changed.

For example, this data set:

Device, Time, Mnemonic, Value

S06, 2017-02-03 09:22:44.420000, CMD_ACCCTR, 224

S06, 2017-02-03 09:22:44.780000, CMD_ACCCTR, 224

S06, 2017-02-03 09:22:46.220000, CMD_ACCCTR, 224

S06, 2017-02-03 09:22:48.380000, CMD_ACCCTR, 225

S06, 2017-02-03 09:22:49.820000, CMD_ACCCTR, 225

S06, 2017-02-03 09:23:47.600000, CMD_ACCCTR, 225

S06, 2017-02-03 09:23:48.590000, CMD_ACCCTR, 225

S06, 2017-02-03 09:23:49.400000, CMD_ACCCTR, 226

S06, 2017-02-03 09:23:50.390000, CMD_ACCCTR, 226

S06, 2017-02-03 09:23:51.200000, CMD_ACCCTR, 226

S06, 2017-02-03 09:23:52.190000, CMD_ACCCTR, 226

S06, 2017-02-03 09:24:42.320000, CMD_ACCCTR, 226

S06, 2017-02-03 09:24:43.400000, CMD_ACCCTR, 227

Would be compressed down to:

Device, Time, Mnemonic, Value

S06, 2017-02-03 09:22:44.420000, CMD_ACCCTR, 224

S06, 2017-02-03 09:22:48.380000, CMD_ACCCTR, 225

S06, 2017-02-03 09:23:49.400000, CMD_ACCCTR, 226

S06, 2017-02-03 09:24:43.400000, CMD_ACCCTR, 227

I am a little concerned about a kind of disk fragmentation related to our access patterns. My understanding is that mongo will reuse disk space, so new data will start being interleaved with the existing data. Over time, I think this is going to negatively affect query performance. For example, a 24 hour query (2017-02-13 00:00:00 - 2017-02-14 00:00:00) may have to access a lot of different pages to pull the data since it will first be populated in existing space opened up by the compression.

One thought we had to help this problem is during the data compression, we actually delete all documents over the time period, and then re-write the entire batch of saved samples in bulk. Typically this type of compression results in a lot of removed data (70% - 90%) so we aren't re-writing a ton of samples. But I think this will open up larger blocks of disk space, and help keep time related records localized?

Ivan Grigolon

unread,

Feb 23, 2017, 1:33:44 AM2/23/17

to mongodb-user

Hi,
Your concern may be true for MMAPv1, however WiredTiger storage engine has a different write operation patterns and memory management compared to MMAPv1. If you are concerned about the query performance of your data, you should add indexes. See also Optimise Query Performance.

There are a few questions and/or potential improvements:

Depending on your requirements, you could write the record only if there's a change to the record. This may increase the cost of a write, but save more time in the querying/processing stage.
An alternative is to have a capped collection to receive all incoming records, and add a process to filter selected data and insert to another collection. The latter collection would then be used for querying purposes.

You could execute MongoDB compact command to release unneeded disk space. On WiredTiger, the effectiveness of this operation is workload dependent and no disk space may be recovered.

If you have further questions, could you please provide more information about your use case ?

As always it is a good idea to test under your specific workload and use case in a staging/development environment.