Pinning MergeTree data in memory

S M

unread,

Jul 28, 2016, 3:01:20 AM7/28/16

to ClickHouse

Hello,

We have been evaluating clcikhouse and are very impressed by its performance, but, there seem to be some things that can be tuned even more.

Our servers currently have 256GB-1TB of RAM, assuming we handle sharding of our dataset in such a way that the data always fits in RAM, we see a significant performance difference between 'MEMORY' tables and 'MergeTree' tables, even when the entire dataset of the mergetree fits in RAM.

Some small tests:

SELECT
    uniq(session_id) AS total,
    post_id
FROM hits
WHERE (ts >= '2016-06-01 23:45:00') AND (sr_domain LIKE '%face%')
GROUP BY post_id
ORDER BY total DESC
LIMIT 25

On mergetree (after ensuring all data is in fscache, and after running optimize a few times to get data fully sorted):
25 rows in set. Elapsed: 0.313 sec. Processed 42.53 million rows, 1.47 GB (135.87 million rows/s., 4.69 GB/s.)

On memory
25 rows in set. Elapsed: 0.187 sec. Processed 108.24 million rows, 3.64 GB (578.89 million rows/s., 19.46 GB/s.)

So, even though Memory has to look at 2.2x as much data, it still performs much faster.

In the case of count:
SELECT
    count() AS total,
    post_id
FROM ld_hits_mem
WHERE (ts >= '2016-06-01 23:45:00') AND (sr_domain LIKE '%face%')
GROUP BY post_id
ORDER BY total DESC
LIMIT 25
..

25 rows in set. Elapsed: 0.166 sec. Processed 42.53 million rows, 1.13 GB (255.84 million rows/s., 6.78 GB/s.)
25 rows in set. Elapsed: 0.121 sec. Processed 108.24 million rows, 2.77 GB (892.62 million rows/s., 22.87 GB/s.)

This is after making some changes to server already:
Changing uncompressed_cache_size to 100GB.
Changing max_memory_usage to 200GB
Set use_uncompressed_cache to 1.

I also tried changing merge_tree_max_rows_to_use_cache to a very high value (200000000), but it did not make any difference -- this variable I only changed on a single session, rest in config.xml.

Server has 256GB of RAM, compressed dataset size: 17GB. CSV size: 41GB. Around 25 columns.

My question -- is there any way to force-store mergetrees completely uncompressed and deserialized? So we get benefits of indexing + mvcc + speed of memory?

Thank you!

P.S: The dataset is replicated elsewhere outside of clickhouse, if that makes a difference.

man...@gmail.com

unread,

Jul 28, 2016, 6:36:36 PM7/28/16

to ClickHouse

Hello.
Thanks for interesting observations!

is there any way to force-store mergetrees completely uncompressed and deserialized

Currently, there is no way, but implementation is possible in perspective.

Let's look, what is the difference of performance in detail.
When reading from MergeTree, data need to be (all steps are done per small blocks of data):

1. Copied from page cache to userspace.
It's done by copy_user_* functions in Linux kernel. It is like memcpy, but slightly less efficient. Performance is about 5 GB/sec/core of compressed data.
Sometimes there could be memory management issues. For example, (minor) page faults; finding out of free pages by kernel, etc.

2. Decompressed.
By default, lz4 is used. Depending on dataset, performance is about 700-2000 MB/sec/core of decompressed data.

3. Deserialized.
For fixed-width data types, deserialization is trivial, but it is additional memcpy (from IO buffer to column object in memory).
For Atring/Array it is less trivial, though highly optimized. Expect 3-6 GB/sec/core of decompressed data.

After these steps, data becomes in form such as in Memory table.

To avoid first two steps, there is uncompressed cache.
Usually we have datasets much larger than RAM even in compressed form (for our servers, typical is 128 GiB RAM and ~30 TB of storage).
For us, uncompressed cache helps for repeating queries or differenct queries from same ranges of data.
So, we introduced limits for it, for example, merge_tree_max_rows_to_use_cache - to avoid cache thrashing by large queries.
And we enable uncompressed cache manually only for some type of queries.

So, first you need to check, that uncompressed cache is working.
Is there performance difference between use_uncompressed_cache = 0 and 1?

To overview performance impact of steps mentioned above, you could use 'perf top' utility from Linux.
Just run 'perf top' in one terminal, and run your queries in another. You will see items, like:

copy_user_* - reading from page cache
LZ4_decompress_fast - decompressing
DataType*::deserializeBinary - deserialization
and so on.

When uncompressed_cache is used, you should not see large impact from first two items.

man...@gmail.com

unread,

Jul 28, 2016, 6:58:33 PM7/28/16

to ClickHouse

What else you could tune for maximum performance?

1. max_block_size.
For some queries, may be better to lower this setting to be more CPU cache frendly.
But it is not always the win, because sometimes, there is large per-block overhead.
By default, it is 65536. You could lower to any value greater or equals index granularity of table. If you set lower value, it will be rounded up internally.

2. max_threads.
By default, ClickHouse automatically use number of threads equal to number of CPU cores without hyper-threading.
For example, on Intel CPUs with 2-way hyper-threading, if you have 32 logical cores, ClickHouse will use 16 threads.
But hyperthreading is not useless, it helps, for example, when your queries will use large hash tables to do aggregations.
But performance win of hyperthreading is disproportional, for example, you will get 1.5 times performance boost, when using 32 cores instead of 16.

3. compile.
This option allows to optimize inner loop of GROUP BY with runtime code-generation. It is disabled by default because performance impact is low on average. But for simple queries, it sometimes a big win (query may speed up few times). Documentation only in russian yet: https://clickhouse.yandex/reference_ru.html#compile
To test this option, first set compile = 1, min_count_to_compile = 0. This has effect of synchronous compilation of query, with pause to do compile. If compilation will not succeed, you will get error message in client. For production usage, set min_count_to_compile to higher value.

S M

unread,

Jul 28, 2016, 10:09:11 PM7/28/16

to ClickHouse

Hello!

I will test the performance of max_block_size, max_threads, compile later today and report back my findings.

Few things of note:
Dataset size was actually 4.7GB under clickhouse mergetree, not 17GB as reported (I thought I had waited 8 minutes for all tables to be removed, but I was wrong).
After I made post yesterday, I built a custom version of clickhouse-server with LZ4 using LZ4_compress_fast, with acceleration set to 50. Details about acceleration are available here: https://fastcompression.blogspot.com/2015/04/sampling-or-faster-lz4.html

After re-loading data set (into new table, preserving original) same table was 6.2GB. Query times were absolutely identical between no-acceleration LZ4, and acceleration LZ4. It seems LZ4 is not the bottleneck. I don't think it is serialization either, as the types in above queries are simple (UInt32 for post_id, UInt64 for session_id).

Thank you for your detailed response!

Message has been deleted

S M

unread,

Jul 28, 2016, 11:07:23 PM7/28/16

to ClickHouse

Just wanted to add, I was mistaken with my earlier statement:,

the types in above queries are simple (UInt32 for post_id, UInt64 for session_id)

Filter is on sr_domain which is String. I guess ts (DateTime) is simply stored as UInt4, so not a concern.

When uncompressed_cache is used, you should not see large impact from first two items.

Does this mean data is still stored in serialized format in uncompressed cache?

Finally, is de-serialization for fixed length strings (FixedString) expected to be much faster than String?

man...@gmail.com

unread,

Jul 29, 2016, 12:26:30 AM7/29/16

to ClickHouse

Great that you have time to test LZ4_compress_fast, I was also interested with it.

пятница, 29 июля 2016 г., 5:09:11 UTC+3 пользователь S M написал:

man...@gmail.com

unread,

Jul 29, 2016, 12:30:46 AM7/29/16

to ClickHouse

DateTime is stored as UInt32 (4 bytes).

Deserialization of FixedString expected to be faster than String, only if usage of FixedString will have not much overhead on average length.
I recommend to use FixedString only if data has fixed length naturally (example: IPv6 addresses, UUIDs, cryptographic hashes, stock ticker ids).

man...@gmail.com

unread,

Jul 29, 2016, 12:31:09 AM7/29/16

to ClickHouse

> Does this mean data is still stored in serialized format in uncompressed cache?

Yes.

S M

unread,

Aug 2, 2016, 10:23:41 PM8/2/16

to ClickHouse

Quick update:

For queries not referencing any string columns, with a simple group by:
for 7 mln rows:
memory(compile=1):137 statements/sec
mergetree with uncompressed cache (compile=1): 157 statements/sec
mergetree without uncompressed cache(compile=0): 107 statements/sec
mergetree without uncompressed cache(compile=1): 118 statements/sec

for 38 mln rows: (all with compile=1)
memory: 26 statements/sec
mergetree with uncompressed cache: 21 statements/sec
mergetree without uncompressed cache: 23 statements/sec

for queries referencing a string column (compile=1);
for 7 mln rows:
memory: 93 statements/sec
mergetree with uncompressed cache: 37.9 statements/sec
mergetree without uncompressed cache: 38 statements/sec

for 38 mln rows (compile=1):
memory: 18.8 statements/sec
mergetree with uncompressed cache: 7.2 statements/sec
mergetree without uncompressed cache: 7.1 statements/sec

Notes: for queries referencing a string column, deserialize is much more expensive than compression. For very short queries not referencing strings, memory is outperformed by mergetree (?).

How much work will it be to make the uncompressed cache hold deserialized data? Could you point me to the right places in the source code? All other options had minimal impact.

Thank you!

man...@gmail.com

unread,

Aug 3, 2016, 3:11:23 PM8/3/16

to ClickHouse

> For very short queries not referencing strings, memory is outperformed by mergetree (?).

It may be for two reasons.

1. MergeTree stores data in different order.
Aggregation benefits from identical consecutive keys, for natural reason (cache locality) and also because there is optimization to not calculate key and its hash for hash table for identical consecutive keys.

2. MergeTree allows to read data in blocks of arbitary sizes, that may be tuned by 'max_block_size' (= 65536) per query.
(Block is subset of rows and columns - element that is passed through query execution pipeline and processed by many operations as-a-whole).
In contrary, Memory table stores data in blocks of what was formed during insertion. It is 'max_insert_block_size' (= 1048576) rows. Block sizes read from Memory tables cannot be tuned per query. But such big block size is bad for cache locality during query processing (it usually don't fit in L3 cache).

Such big 'max_insert_block_size' setting is good for insertion to tables like MergeTree because tables of MergeTree type forms 'data part' (directory with many files) on each block.

To show effective block sizes during query execution, write SELECT avg(blockSize()) FROM ...

How much work will it be to make the uncompressed cache hold deserialized data? Could you point me to the right places in the source code? All other options had minimal impact.

Uncompressed cache is exactly cache for map: (file path, offset to compressed block) -> decompressed data.
What code to look:

CompressedReadBuffer: to read compressed data from anywhere;
CompressedReadBufferFromFile: allows to do seeks;
CompressedReadBuffer: implements common functions for decompression;
CachedCompressedReadBuffer: drop-in replacement for CompressedReadBufferFromFile, that uses cache;
UncompressedCache: the cache itself.
MergeTreeReader: reads data for MergeTree tables, it selects between CachedCompressedReadBuffer and CompressedReadBufferFromFile.

You need different cache for deserialized data. For example, it may be cache for map:
(data part in merge tree, column name, mark in sparse index) -> ColumnPtr.

Cache need to be used somewhere in MergeTreeReader, maybe in readRange function, maybe with support from MergeTreeBlockInputStream.
Implementation estimated to has medium difficulty. It is not completely straightforward, there will be some obstacles.

Reply all

Reply to author

Forward