The simple-8b compression method has been extended so that data can bedecompressed in reverse order. Backward scanning queries are common intime-series workloads. This means that these types of queries run much faster.
Delta encoding reduces the amount of information required to represent a dataobject by only storing the difference, sometimes referred to as the delta,between that object and one or more reference objects. These algorithms workbest where there is a lot of redundant information, and it is often used inworkloads like versioned file systems. For example, this is how Dropbox keepsyour files synchronized. Applying delta-encoding to time-series data means thatyou can use fewer bytes to represent a data point, because you only need tostore the delta from the previous data point.
For example, imagine you had a dataset that collected CPU, free memory,temperature, and humidity over time. If you time column was stored as an integervalue, like seconds since UNIX epoch, your raw data would look a little likethis:
With delta encoding, you only need to store how much each value changed from theprevious data point, resulting in smaller values to store. So after the firstrow, you can represent subsequent rows with less information, like this:
Applying delta encoding to time-series data takes advantage of the fact thatmost time-series datasets are not random, but instead represent something thatis slowly changing over time. The storage savings over millions of rows can besubstantial, especially if the value changes very little, or doesn't change atall.
Delta-of-delta encoding takes delta encoding one step further and appliesdelta-encoding over data that has previously been delta-encoded. Withtime-series datasets where data collection happens at regular intervals, you canapply delta-of-delta encoding to the time column, which results in only needing tostore a series of zeroes.
In this example, delta-of-delta further compresses 5 seconds in the time columndown to 0 for every entry in the time column after the second row, because thefive second gap remains constant for each entry. Note that you see two entriesin the table before the delta-delta 0 values, because you need two deltas tocompare.
With delta and delta-of-delta encoding, you can significantly reduce the numberof digits you need to store. But you still need an efficient way to store thesmaller integers. The previous examples used a standard integer datatype for thetime column, which needs 64 bits to represent the value of 0 when delta-deltaencoded. This means that even though you are only storing the integer 0, you arestill consuming 64 bits to store it, so you haven't actually saved anything.
Simple-8b is one of the simplest and smallest methods of storing variable-lengthintegers. In this method, integers are stored as a series of fixed-size blocks.For each block, every integer within the block is represented by the minimalbit-length needed to represent the largest integer in that block. The first bitsof each block denotes the minimum bit-length for the block.
This technique has the advantage of only needing to store the length once for agiven block, instead of once for each integer. Because the blocks are of a fixedsize, you can infer the number of integers in each block from the size of theintegers being stored.
In this example, both blocks store about 10 digits worth of data, even thoughsome of the numbers have to be padded with a leading 0. You might also noticethat the second block only stores 9 digits, because 10 is not evenly divisibleby 3.
Simple-8b works in this way, except it uses binary numbers instead of decimal,and it usually uses 64-bit blocks. In general, the longer the integer, the fewernumber of integers that can be stored in each block.
Simple-8b compresses integers very well, however, if you have a large number ofrepeats of the same value, you can get even better compression with run-lengthencoding. This method works well for values that don't change very often, or ifan earlier transformation removes the changes.
Run-length encoding is one of the classic compression algorithms. Fortime-series data with billions of contiguous zeroes, or even a document with amillion identically repeated strings, run-length encoding works incredibly well.
Run-length encoding is also used as a building block for many more advancedalgorithms, such as Simple-8b RLE, which is an algorithm that combinesrun-length and Simple-8b techniques. TimescaleDB implements a variant ofSimple-8b RLE. This variant uses different sizes to standard Simple-8b, in orderto handle 64-bit values, and RLE.
The standard XOR-based compression method has been extended so that data can bedecompressed in reverse order. Backward scanning queries are common intime-series workloads. This means that queries that use backwards scans run muchfaster.
Floating point numbers are usually more difficult to compress than integers.Fixed-length integers often have leading zeroes, but floating point numbers usuallyuse all of their available bits, especially if they are converted from decimalnumbers, which can't be represented precisely in binary.
Techniques like delta-encoding don't work well for floats, because they do notreduce the number of bits sufficiently. This means that most floating-pointcompression algorithms tend to be either complex and slow, or truncatesignificant digits. One of the few simple and fast lossless floating-pointcompression algorithms is XOR-based compression, built on top of Facebook'sGorilla compression.
XOR is the binary function exclusive or. In this algorithm, successivefloating point numbers are compared with XOR, and a difference results in a bitbeing stored. The first data point is stored without compression, and subsequentdata points are represented using their XOR'd values.
One of the earliest lossless compression algorithms, dictionary compression isthe basis of many popular compression methods. Dictionary compression can alsobe found in areas outside of computer science, such as medical coding.
Instead of storing values directly, dictionary compression works by making alist of the possible values that can appear, and then storing an index into adictionary containing the unique values. This technique is quite versatile, canbe used regardless of data type, and works especially well when you have alimited set of values that repeat frequently.
For a dataset with a lot of repetition, this can offer significant compression.In the example, each city name is on average 11 bytes in length, while theindices are never going to be more than 4 bytes long, reducing space usagenearly 3 times. In TimescaleDB, the list of indices is compressed even furtherwith the Simple-8b+RLE method, making the storage cost even smaller.
Dictionary compression doesn't always result in savings. If your dataset doesn'thave a lot of repeated values, then the dictionary is the same size as theoriginal data. TimescaleDB automatically detects this case, and falls back tonot using a dictionary in that scenario.
ff7609af8f