Using nested structures for metrics, their performance and best practices when working with time-series information

1,523 views
Skip to first unread message

ste...@activitystream.com

unread,
Jun 7, 2017, 12:45:01 PM6/7/17
to ClickHouse
Hi,

We storing metrics without knowing exactly what metrics there are and we have roughly two options with VClickHouse:
  • Dynamically maintain the schema
  • Store the metrics in a nested structure
We are trying to understand if there is a considerable performance penalty involved using the second approach (it would be easier for us) and what would be the most efficient way to deal with that.

A simple nested structure might look something like this:
CREATE TABLE metrics
(
  entity_id String,
  occurred_date Date,
  occurred_at DateTime,
  metricsMap Nested (
    name String,
    value Float64
  )
) ENGINE = ReplacingMergeTree(occurred_date, (entity_ref, occurred_at), 8192)


Inserting data in JSON format would then look something like this:
insert into etactica.entitysamples_metrics FORMAT JSONEachRow { "entity_id":"some-id", "occurred_date" : "2017-06-07", "occurred_at" : "2017-06-07 00:03:00", "metricsMap.name" : ["current", "kwh", "price"], "metricsMap.value" : [0.1, 0.3, 0.0003]}

Querying for a aggregated metrics (possibly more than one at a time) could look something like this:
  • select entity_id, sum(metricsMap.value[indexOf(metricsMap.name, 'kwh')]) as kwh, countIf(has(metricsMap.name,'kwh')) as kwh_values from etactica.entitysamples_metrics where has(dimensionsMap.name, 'value') group by entity_id;

Can anyone please tell me if there is another faster, more elegant, way to achieve this and if this is known to degrade performance to a considerable degree when working with a large data sets.


Best regards,
 -Stefán

Vitaliy Lyudvichenko

unread,
Jun 8, 2017, 9:19:59 AM6/8/17
to ClickHouse
The efficiency of approach with nested structures significantly depends on how much keys you plan to store inside each metricsMap.
Let, you plan to store N unique keys in each in average.
As I can see, each of your queries affects only single key in metricsMap, so remaining N-1 will be read from disk and consume CPU cache without real need.
If N is big, then you will lose benefits of column-oriented DB which reads only those columns that really required (in fact you emulate "row-oriented" storage).
Also you will lose ability to drop unnecessary or irrelevant keys from events.

среда, 7 июня 2017 г., 19:45:01 UTC+3 пользователь ste...@activitystream.com написал:

ste...@activitystream.com

unread,
Jun 8, 2017, 9:28:35 AM6/8/17
to ClickHouse
Hi Vitaliy,

These will always be few (1 - 5). We have a schema structure that does not require this but we also need some flexibility (regarding parameter names, custom metrics and custom dimensions).

I'm hoping our tests reveal that this is doable as the only alternative is to store these as multiple records (as I see it) (more like traditional TS databases that would keep them in separated time series.

So far we are pretty happy but I would love to get some assistance in aggregating:
  • metrics.name = ['a','b','d'] metrics.value = [0.1, 0.2, 0.4] with a  slightly different set of  metrics.name = ['a','b','c'] metrics.value = [0.1, 0.2, 0.3] 
  • to become:  metrics.name = ['a','b','c','d']  metrics.value = [0.2, 0.4, 0.3, 0.4] 
I'm reading the dos trying to find suitable, not to draining, ways to do this and I think it's there (the aggregations support is excellent) but an example code from someone would be highly appreciated :) 

Regards,
   -Stefán

Vitaliy Lyudvichenko

unread,
Jun 8, 2017, 3:17:57 PM6/8/17
to ClickHouse
In your case make sense to try to use ARRAY JOIN clause.
We actively use it with nested fields.
Also pay attention to aggregate function combinators like -Merge and aggregate function groupUniqArray.




Also there is the following bad possibility, don't use it:

SELECT
   
['a', 'b', 'd'] AS k1,
   
[0.1, 0.2, 0.4] AS v1,
   
['a', 'b', 'c'] AS k2,
   
[0.1, 0.2, 0.3] AS v2,
    arrayReduce
('groupArrayArray', [k1, k2]) AS k_concat,
    arrayReduce
('groupArrayArray', [v1, v2]) AS v_concat,
    arrayEnumerateUniq
(k_concat) AS k_is_uniq,
    arrayFilter
(x -> (x != ''), arrayMap(x -> if(k_is_uniq[x] = 1, k_concat[x], ''), arrayEnumerate(k_concat))) AS k_result,
    arrayFilter
(x -> (x = x), arrayMap(x -> if(k_is_uniq[x] = 1, v_concat[x], nan), arrayEnumerate(k_concat))) AS v_result


четверг, 8 июня 2017 г., 16:28:35 UTC+3 пользователь ste...@activitystream.com написал:

Vitaliy Lyudvichenko

unread,
Jun 8, 2017, 3:27:27 PM6/8/17
to ClickHouse
Slightly better option:
SELECT
   
['a', 'b', 'd'] AS k1,
   
[0.1, 0.2, 0.4] AS v1,
   
['a', 'b', 'c'] AS k2,
   
[0.1, 0.2, 0.3] AS v2,
    arrayReduce
('groupArrayArray', [k1, k2]) AS k_concat,
    arrayReduce
('groupArrayArray', [v1, v2]) AS v_concat,
    arrayEnumerateUniq
(k_concat) AS k_is_uniq,

    arrayFilter
((x, y) -> (y = 1), k_concat, k_is_uniq) AS k_result,
    arrayFilter
((x, y) -> (y = 1), v_concat, k_is_uniq) AS v_result



четверг, 8 июня 2017 г., 22:17:57 UTC+3 пользователь Vitaliy Lyudvichenko написал:

ste...@activitystream.com

unread,
Jun 8, 2017, 4:08:01 PM6/8/17
to ClickHouse

Thank you Vitaliy, this is great.

Can you clarify for me how you would use these with aggregations (sorry to be thick) ?

Regards,
 -Stefan

Vitaliy Lyudvichenko

unread,
Jun 13, 2017, 8:20:12 AM6/13/17
to ClickHouse
I didn't remember what I correctly meant, but probably a something like the following:

CREATE TABLE a (k Array(String), v Array(Float32)) ENGINE = Log
INSERT INTO a VALUES
(['a', 'b', 'd'], [0.1, 0.2, 0.4]) (['a', 'b', 'c'], [0.1, 0.2, 0.3])
SELECT
* FROM a ARRAY JOIN k, v
SELECT uniq_pair
.1 AS k, uniq_pair.2 AS v FROM (SELECT groupUniqArray((k, v)) AS uniq_pair FROM a ARRAY JOIN k, v) ARRAY JOIN uniq_pair



четверг, 8 июня 2017 г., 23:08:01 UTC+3 пользователь ste...@activitystream.com написал:

ste...@activitystream.com

unread,
Jun 13, 2017, 9:17:12 AM6/13/17
to ClickHouse
Thanks a lot. I will give this a try :)

ste...@activitystream.com

unread,
Jun 21, 2017, 12:18:10 PM6/21/17
to ClickHouse
Hi Vitaliy,

This does indeed return the unique group values.
I'm looking for something that would return the sum of the values in v.

It now return:
┌─uniq_pair─────────────────────────────────┐
│ [('c',0.3),('d',0.4),('b',0.2),('a',0.1)] │
└───────────────────────────────────────────┘

But with summing it would return:
┌─uniq_pair─────────────────────────────────┐
│ [('c',0.3),('d',0.4),('b',0.4),('a',0.2)] │
└───────────────────────────────────────────┘

That would also avoid selecting all the unique combinations if the values differ a lot.

any ideas?

Regards,
  -Stefán

ste...@activitystream.com

unread,
Jul 2, 2017, 1:41:44 PM7/2/17
to ClickHouse
anyone? :)

Vitaliy Lyudvichenko

unread,
Jul 2, 2017, 3:06:43 PM7/2/17
to ClickHouse
Hi-)

You could try the following:
SELECT
   
[('c', 0.3), ('d', 0.4), ('b', 0.2), ('a', 0.1)] AS uniq_pair,
    arrayReduce
('sum', arrayMap(x -> x.2, uniq_pair))



воскресенье, 2 июля 2017 г., 20:41:44 UTC+3 пользователь ste...@activitystream.com написал:

Vitaliy Lyudvichenko

unread,
Jul 2, 2017, 3:34:45 PM7/2/17
to ClickHouse
BTW

Also consider the following way

SELECT sum(v_uniq)
FROM
(
    SELECT
        n AS n_uniq
,
        any
(v) AS v_uniq
    FROM array_kv
    ARRAY JOIN
        name AS n
,
        value AS v
    GROUP BY n
)


┌─sum(v_uniq)─┐
          1
└─────────────┘
 

Where array_kv is:
SELECT *
FROM array_kv


┌─name──────────┬─value─────────────────────────┐
['a','b','d'] [0.1,0.2,0.4]                
['a','b','c'] [0.1,0.2,0.30000000000000004]
└───────────────┴───────────────────────────────┘

воскресенье, 2 июля 2017 г., 22:06:43 UTC+3 пользователь Vitaliy Lyudvichenko написал:
Reply all
Reply to author
Forward
0 new messages