How to properly use the external dictionary with millions of keys?

285 views
Skip to first unread message

Ivan Ladelschikov

unread,
Aug 3, 2017, 4:30:46 AM8/3/17
to ClickHouse
I have a classical wide table with about a hundred of columns, and I need to store some additional data (scores) about some of the rows. These scores should be updatable, so I decide to implement an external dictionary to meet this goal:

``` 
  the_data (Clickhouse table)
Id, field1, field2, .... field99
     |
     |  connect using external dictionary
     |  layout = cache, id = Id
     |
  scores (PostgreSQL table)
Id, score1, score2, score3
```

Now I want to get the total report about those scores, so I query:

```
SELECT 
    avg(dictGetFloat32OrDefault('scores', 'score1', Id, toFloat32(0))) AS score1,
    avg(dictGetUInt16OrDefault('scores', 'score2', Id, toUInt16(0))) AS score2
FROM the_data

SELECT 
    field1
    avg(dictGetFloat32OrDefault('scores', 'score1', Id, toFloat32(0))) AS score1,
    avg(dictGetUInt16OrDefault('scores', 'score2', Id, toUInt16(0))) AS score2
FROM the_data
GROUP BY field1
```

The problem arises when the number of rows with scores becomes millions, because these simple reports takes too much time to finish.

How can I pre-aggregate those values or maybe should I use another storage option for this dictionary?
Or maybe this use case does not even possible for CH to implement efficiently?

Николай Кочетов

unread,
Aug 3, 2017, 9:03:56 AM8/3/17
to ClickHouse
Increasing <size_in_cells> setting for cache layout may help if it's enough memory.

If it's possible, another solution would be creating that table inside clickhouse with simple engine like Log.

четверг, 3 августа 2017 г., 11:30:46 UTC+3 пользователь Ivan Ladelschikov написал:
Reply all
Reply to author
Forward
0 new messages