For low-cardinality fields (Country, ...) — store strings as-is or use dictionary and store IDs?

256 views
Skip to first unread message

Max M

unread,
Sep 14, 2016, 3:02:39 AM9/14/16
to ClickHouse
Hello. How ClickHouse fits such case with necessary of storing of dictionary-like items (with cardinality of 1-10k)? Like site countries, cities, browser names, vendor names? Does ClickHouse make internal dictionary like druid.io does, how string matching affects on query speed and increases DB stored size?

man...@gmail.com

unread,
Sep 14, 2016, 2:39:50 PM9/14/16
to ClickHouse
Hello.

It is more efficiently to store low cardinality fields as integers or enums than as strings.
In terms of performance, it is much more efficiently. In terms of database size, it it just slightly more efficiently.

ClickHouse does compression with lz4 library by default, that is implementation of LZ77 algorithm, which do substring matching and works somewhat like dictionary.
But to process data, ClickHouse decompresses it first and works on decompressed data. It means, that GROUP BY, comparison and so on, for low cardinality string fields, work on strings rather than integers, that is worse.

There are Enum8, Enum16 data types, which works like integers internally and looks like strings to user.
https://clickhouse.yandex/reference_en.html#Enum
Reply all
Reply to author
Forward
0 new messages