Data compression

458 views
Skip to first unread message

alexandre...@gmail.com

unread,
Jul 19, 2017, 5:52:34 AM7/19/17
to ClickHouse
Hello,

I observe that data are effectively compressed on ClickHouse servers. I want maybe to improve the performance of this task.

I see in the config.xml file that I can use the setting <compression> :


<!-- Uncomment if you want data to be compressed 30-100% better.
         Don't do that if you just started using ClickHouse.
      -->

   
<compression>
       
<!-- Set of variants. Checked in order. Last matching case wins. If nothing matches, lz4 will be used. -->
       
<case>
           
<!-- Conditions. All must be satisfied. Some conditions may be omitted. -->
           
<min_part_size>10000000000</min_part_size>        <!-- Min part size in bytes. -->
           
<min_part_size_ratio>0.01</min_part_size_ratio>    <!-- Min size of part relative to whole table size. -->
           
<!-- What compression method to use. -->
           
<method>zstd</method>    <!-- Keep in mind that zstd compression library is highly experimental. -->
       
</case>
   
</compression>

When you say that it's 30-100% better, what would you say exactly ?
Are data more compressed ? Is it faster to request data with this compression ?


Are the min_part_size and
min_part_size_ratio optimal in the configuration above ?


Are they different methods for compressing than zstd ?

Thank you for your help.

man...@gmail.com

unread,
Jul 24, 2017, 1:23:28 PM7/24/17
to ClickHouse
Hello.

There are just two supported compression methods: lz4 and zstd. And lz4 is used by default and is set implicitly.

lz4 is fast and light.
zstd is stronger - it compresses and decompresses data slower, but with better compression ratio.
When data is compressed with zstd, query could execute slower. It depends on amount of calculations in query and efficiency of disk subsystem.
When some simple query reads data from page cache, performance penalty due to zstd could be up to three times, but this is extreme case.
When a query reads data from HDDs, the query could even go faster with zstd.
And when a query have to perform many difficult calculations, there will be almost no difference in performance between lz4 and zstd, because decompression will spend little amount of time related to all computations.

In usual cases, it is reasonable to compress "cold" data stronger and leave "hot" data with lighter compression.
"min_part_size" and "min_part_size_ratio" are intended for that purpose. In MergeTree tables, data is consisted of "parts" and older data will reside in larger parts. So, you can enable zstd to large enough parts.

To examine data parts, do the following query:
SELECT * FROM system.parts WHERE active
Reply all
Reply to author
Forward
0 new messages