Using nested structures for metrics, their performance and best practices when working with time-series information

ste...@activitystream.com

unread,

Jun 7, 2017, 12:45:01 PM6/7/17

to ClickHouse

Hi,

We storing metrics without knowing exactly what metrics there are and we have roughly two options with VClickHouse:

Dynamically maintain the schema
Store the metrics in a nested structure

We are trying to understand if there is a considerable performance penalty involved using the second approach (it would be easier for us) and what would be the most efficient way to deal with that.

A simple nested structure might look something like this:

CREATE TABLE metrics
(
entity_id String,
occurred_date Date,
occurred_at DateTime,
metricsMap Nested (
name String,
value Float64
)
) ENGINE = ReplacingMergeTree(occurred_date, (entity_ref, occurred_at), 8192)

Inserting data in JSON format would then look something like this:

insert into etactica.entitysamples_metrics FORMAT JSONEachRow { "entity_id":"some-id", "occurred_date" : "2017-06-07", "occurred_at" : "2017-06-07 00:03:00", "metricsMap.name" : ["current", "kwh", "price"], "metricsMap.value" : [0.1, 0.3, 0.0003]}

Querying for a aggregated metrics (possibly more than one at a time) could look something like this:

select entity_id, sum(metricsMap.value[indexOf(metricsMap.name, 'kwh')]) as kwh, countIf(has(metricsMap.name,'kwh')) as kwh_values from etactica.entitysamples_metrics where has(dimensionsMap.name, 'value') group by entity_id;

Can anyone please tell me if there is another faster, more elegant, way to achieve this and if this is known to degrade performance to a considerable degree when working with a large data sets.

Best regards,

-Stefán

Vitaliy Lyudvichenko

unread,

Jun 8, 2017, 9:19:59 AM6/8/17

to ClickHouse

The efficiency of approach with nested structures significantly depends on how much keys you plan to store inside each metricsMap.

Let, you plan to store N unique keys in each in average.

As I can see, each of your queries affects only single key in metricsMap, so remaining N-1 will be read from disk and consume CPU cache without real need.

If N is big, then you will lose benefits of column-oriented DB which reads only those columns that really required (in fact you emulate "row-oriented" storage).

Also you will lose ability to drop unnecessary or irrelevant keys from events.

среда, 7 июня 2017 г., 19:45:01 UTC+3 пользователь ste...@activitystream.com написал:

ste...@activitystream.com

unread,

Jun 8, 2017, 9:28:35 AM6/8/17

to ClickHouse

Hi Vitaliy,

These will always be few (1 - 5). We have a schema structure that does not require this but we also need some flexibility (regarding parameter names, custom metrics and custom dimensions).

I'm hoping our tests reveal that this is doable as the only alternative is to store these as multiple records (as I see it) (more like traditional TS databases that would keep them in separated time series.

So far we are pretty happy but I would love to get some assistance in aggregating:

metrics.name = ['a','b','d'] metrics.value = [0.1, 0.2, 0.4] with a slightly different set of metrics.name = ['a','b','c'] metrics.value = [0.1, 0.2, 0.3]
to become: metrics.name = ['a','b','c','d'] metrics.value = [0.2, 0.4, 0.3, 0.4]

I'm reading the dos trying to find suitable, not to draining, ways to do this and I think it's there (the aggregations support is excellent) but an example code from someone would be highly appreciated :)

Regards,

-Stefán

Vitaliy Lyudvichenko

unread,

Jun 8, 2017, 3:17:57 PM6/8/17

to ClickHouse

In your case make sense to try to use ARRAY JOIN clause.

We actively use it with nested fields.

Also pay attention to aggregate function combinators like -Merge and aggregate function groupUniqArray.

Also there is the following bad possibility, don't use it:

SELECT 
    ['a', 'b', 'd'] AS k1, 
    [0.1, 0.2, 0.4] AS v1, 
    ['a', 'b', 'c'] AS k2, 
    [0.1, 0.2, 0.3] AS v2, 
    arrayReduce('groupArrayArray', [k1, k2]) AS k_concat, 
    arrayReduce('groupArrayArray', [v1, v2]) AS v_concat, 
    arrayEnumerateUniq(k_concat) AS k_is_uniq, 
    arrayFilter(x -> (x != ''), arrayMap(x -> if(k_is_uniq[x] = 1, k_concat[x], ''), arrayEnumerate(k_concat))) AS k_result, 
    arrayFilter(x -> (x = x), arrayMap(x -> if(k_is_uniq[x] = 1, v_concat[x], nan), arrayEnumerate(k_concat))) AS v_result

четверг, 8 июня 2017 г., 16:28:35 UTC+3 пользователь ste...@activitystream.com написал:

Vitaliy Lyudvichenko

unread,

Jun 8, 2017, 3:27:27 PM6/8/17

to ClickHouse

Slightly better option:

SELECT 
    ['a', 'b', 'd'] AS k1, 
    [0.1, 0.2, 0.4] AS v1, 
    ['a', 'b', 'c'] AS k2, 
    [0.1, 0.2, 0.3] AS v2, 
    arrayReduce('groupArrayArray', [k1, k2]) AS k_concat, 
    arrayReduce('groupArrayArray', [v1, v2]) AS v_concat, 
    arrayEnumerateUniq(k_concat) AS k_is_uniq,

 
    arrayFilter((x, y) -> (y = 1), k_concat, k_is_uniq) AS k_result, 
    arrayFilter((x, y) -> (y = 1), v_concat, k_is_uniq) AS v_result

четверг, 8 июня 2017 г., 22:17:57 UTC+3 пользователь Vitaliy Lyudvichenko написал:

ste...@activitystream.com

unread,

Jun 8, 2017, 4:08:01 PM6/8/17

to ClickHouse

Thank you Vitaliy, this is great.

Can you clarify for me how you would use these with aggregations (sorry to be thick) ?

Regards,

-Stefan

Vitaliy Lyudvichenko

unread,

Jun 13, 2017, 8:20:12 AM6/13/17

to ClickHouse

I didn't remember what I correctly meant, but probably a something like the following:

CREATE TABLE a (k Array(String), v Array(Float32)) ENGINE = Log
INSERT INTO a VALUES (['a', 'b', 'd'], [0.1, 0.2, 0.4]) (['a', 'b', 'c'], [0.1, 0.2, 0.3])
SELECT * FROM a ARRAY JOIN k, v
SELECT uniq_pair.1 AS k, uniq_pair.2 AS v FROM (SELECT groupUniqArray((k, v)) AS uniq_pair FROM a ARRAY JOIN k, v) ARRAY JOIN uniq_pair

четверг, 8 июня 2017 г., 23:08:01 UTC+3 пользователь ste...@activitystream.com написал:

ste...@activitystream.com

unread,

Jun 13, 2017, 9:17:12 AM6/13/17

to ClickHouse

Thanks a lot. I will give this a try :)

ste...@activitystream.com

unread,

Jun 21, 2017, 12:18:10 PM6/21/17

to ClickHouse

Hi Vitaliy,

This does indeed return the unique group values.

I'm looking for something that would return the sum of the values in v.

It now return:
┌─uniq_pair─────────────────────────────────┐
│ [('c',0.3),('d',0.4),('b',0.2),('a',0.1)] │
└───────────────────────────────────────────┘

But with summing it would return:
┌─uniq_pair─────────────────────────────────┐
│ [('c',0.3),('d',0.4),('b',0.4),('a',0.2)] │
└───────────────────────────────────────────┘

That would also avoid selecting all the unique combinations if the values differ a lot.

any ideas?

Regards,

-Stefán

ste...@activitystream.com

unread,

Jul 2, 2017, 1:41:44 PM7/2/17

to ClickHouse

anyone? :)

Vitaliy Lyudvichenko

unread,

Jul 2, 2017, 3:06:43 PM7/2/17

to ClickHouse

Hi-)

You could try the following:

SELECT 
    [('c', 0.3), ('d', 0.4), ('b', 0.2), ('a', 0.1)] AS uniq_pair, 
    arrayReduce('sum', arrayMap(x -> x.2, uniq_pair))

воскресенье, 2 июля 2017 г., 20:41:44 UTC+3 пользователь ste...@activitystream.com написал:

Vitaliy Lyudvichenko

unread,

Jul 2, 2017, 3:34:45 PM7/2/17

to ClickHouse

BTW

Also consider the following way

SELECT sum(v_uniq)
FROM 
(
    SELECT 
        n AS n_uniq, 
        any(v) AS v_uniq
    FROM array_kv 
    ARRAY JOIN 
        name AS n, 
        value AS v
    GROUP BY n
) 


┌─sum(v_uniq)─┐
│           1 │
└─────────────┘

Where array_kv is:

SELECT *
FROM array_kv 


┌─name──────────┬─value─────────────────────────┐
│ ['a','b','d'] │ [0.1,0.2,0.4]                 │
│ ['a','b','c'] │ [0.1,0.2,0.30000000000000004] │
└───────────────┴───────────────────────────────┘

воскресенье, 2 июля 2017 г., 22:06:43 UTC+3 пользователь Vitaliy Lyudvichenko написал:

Reply all

Reply to author

Forward