Data compression

332 views
Skip to first unread message

Peter Speybrouck

unread,
Dec 24, 2012, 5:32:00 AM12/24/12
to open...@googlegroups.com
I am looking into OpenTsdb as a process historian. I am not sure how well HBase/Hadoop would handle this, but I was wondering about data compression.

I know the intention of OpenTsdb is to log everything you send to it, but when comparing to other process historians like OSIsoft PI, GE Proficy Historian and ICONICS Hyper historian (and others), they all have one thing in common: lossy datacompression.

Since the nature of process data inherently contains signal noise, you may want to filter that out with simple deadband detection or more advanced with something like a swinging door algorithm to eliminate noise from a slope.
Suppose that the deadband width is zero, this implicates that you potentialy eliminate a lot of subsequent identical values, while still being able to exactly reproduce the original data trend without loss of information (except for timestamps of the filtered data)

I can imagine that quite a few metrics measured from IT systems have a lot of subsequent zeroes or identical values that could be effectively reduced this way.
Could such a compression (before storing it in opentsdb) improve performance for retrieving big amounts of data from the system or does the more sparsley filled database not care about the amount of data due to the way the timeseries tables are built?

ManOLamancha

unread,
Dec 24, 2012, 7:28:02 PM12/24/12
to open...@googlegroups.com

Do you mean lossy in terms of rolling up data by frequency to reduce storage, i.e. average the data per hour for a day and delete the original? Or do you mean lossless, as you imply below, where if you have a whole day of data where the value doesn't change, you delete the individual data points and record an entry that says "didn't change all day"? I'm working to implement the former, primarily as a means of retrieving data over large timespans quickly.

For the latter, where you delete repetitious data, it would really only have an impact in OpenTSDB if your data didn't change over the course of a day or more. Initially, data is written in a unique "cell" within a row normalized to the hour, so if you wrote a data point every second, you would have 3,600 data points in a single row. If you turn on the Compactions feature for OpenTSDB, then after the hour is up, all of those 3,600 data points are compressed into a single cell, though you still have every data point recorded within the cell, regardless of it's value. You can also enable LZO, Gzip or Snappy compression at the HBase level that will further reduce the amount of space used and this would help the most with redundant data points.

I don't think there has been much interest in this since the data points eat up very little space and it's easy to throw more disks/servers at the cluster if space becomes an issue. But if you want to take a crack at implementing such a compression method, that'd be pretty interesting.

 
 

Peter Speybrouck

unread,
Dec 24, 2012, 7:55:00 PM12/24/12
to open...@googlegroups.com
Thanks for taking the time to answer.

In fact, what I have in mind is none of those 2 option.
I could try and explain, but I found recently a good presentation explaining 2 different methods of compression on process data:
http://www.slideserve.com/taran/ge-proficy-historian-data-compression

you might think of it like JPEG compression of the data.

Throwing more disk or servers at it is one way of looking at it, but if you can have the necessary information with less servers, that sounds like a win to me.

Perhaps a small example can also be of use: suppose you capture data from a measurement device that only has a precision up to 0.01 but you get values like 4.54345 with some noise in the +/-0.003 range. The fact that you know the precision of the device gives you information on how much data is actually worth it to store in the database.

This idea does however require you to be able to store extra information about a metric (or tag) to allow for compression settings per metric (since not all devices have the same noise). As far as I got familiar with opentsdb, this is currently not possible.

Anyway, I don't expect this to be implemented soon, but it I think it could be a nice addition to get it into other domains as well.

Dave Barr

unread,
Dec 25, 2012, 1:56:33 PM12/25/12
to Peter Speybrouck, open...@googlegroups.com
tcollector already does the duplicate value suppression before
submitting it to OpenTSDB. Depending on your data it can help a lot.

On disk, HBase/HDFS does support LZO compression, and improvements are
coming to move that compression to upper layers, such as in RAM and I
believe over the wire.

--Dave

Dave Barr

unread,
Dec 25, 2012, 2:08:59 PM12/25/12
to Peter Speybrouck, open...@googlegroups.com
Oh I forgot to say, one of OpenTSDB's core philosophies is never to
throw away, downsample or alter data. For systems work that OpenTSDB
was designed for, there really is no such thing as "signal noise".
Every datapoint is an accurate snapshot of some system metric at the
time it was read. If there is variability in the data, then that in
and of itself is valuable data to be kept, not thrown away. Your
network traffic could be jittery, you could have bursts of traffic, or
spikes in GC. These things change over time, and you never know when
you need to compare the jitteriness of last month's or last year's
data to that of today.

If you want smoothing or outlier suppression there are good plot-time
algorithms for that, but for us, at storage time is the wrong time to
do that. Lossy storage algorithms run fundamentally counter to the
goal.

With disk prices so cheap, and with all of the lossless compression
and space saving techniques around, there really is no excuse to throw
away data.

--Dave

On Mon, Dec 24, 2012 at 2:32 AM, Peter Speybrouck
<peter.sp...@gmail.com> wrote:

Peter Speybrouck

unread,
Dec 25, 2012, 4:06:30 PM12/25/12
to open...@googlegroups.com, Peter Speybrouck
I know that OpenTSDB is designed to store everything. I'm not saying that such compression (or rather suppression) of data should be done for everything. It's just that from my experience with process historians (mainly osisoft PI) there are a lot of signals that have noise with much higher frequency noise (say every second) than the actual frequency of the data (for example ambient temperature which is unlikely to change faster than 1degree/second).
If you have like 80.000 or more noisy signals that would generate data every second, then knowledge about the noisyness of the data and adapted compression per signal can reduce the amount of data significantly without reducing the information stored in the data.
For data that is known to have zero noise like system monitoring, you could easily disable the compression and store everything.

Less data also allows for faster retrieval for trending and monitoring.

Due to my experience with osisoft PI, I am actually interested in how openTSDB compares to Osisoft PI regarding input and output performance.
Allow me to quote some numbers from the PI 2012 release which is supposed to be able to process around 250.000 samples per second (sustained). If you would store all that data for a year, you would end up with more or less 57TB which in my opinion is a lot. One strenght of PI is that it can still deliver serious output (read) performance while under heavy load for archiving data.

I haven't looked into OpenTSDB in such detail yet to know how much input or output performance it has or the amount of space required for a certain dataset. I suppose some compression (as in reduce amount of actually archived data, not lossless compression like zip) could have an impact as well.

Anyway, it is interesting to compare both systems that have been designed for a different purpose, but could perhaps be used for similar purposes :-)

Sergei Rodionov

unread,
Jan 11, 2013, 6:01:34 AM1/11/13
to open...@googlegroups.com, Peter Speybrouck
Someone at an industry function claimed to me that OSISoft can ingest 1 million metrics per second on 4 NT servers citing PGE deployment as the use case.
I looked at their PI product and seems to have been around for a long while. Now I'm curious what exactly 250K/sec per node really means?

Roger Alexander

unread,
Feb 5, 2013, 7:40:09 PM2/5/13
to open...@googlegroups.com
Hi,

I too am looking to OpenTsdb as a possibility for a process historian. I'm curious what else you've found out about this, or any other Open Source time-series database out there. I also have similar problems with a lot of signal noise that needs to be filtered, but at the time of storage rather than after the fact. Given the volumes of data I have to deal with, It's impractical to store everything and then filter on demand.

Thanks,

Roger Alexander.

tsuna

unread,
Feb 7, 2013, 3:38:53 PM2/7/13
to Roger Alexander, open...@googlegroups.com
How big is the data volume that you're looking at?

OpenTSDB doesn't have any built-in filtering mechanism at this time,
so you'd have to filter the noise out upfront.

--
Benoit "tsuna" Sigoure
Reply all
Reply to author
Forward
0 new messages