Hello. First of all let me say you thx for such great product, it looks very promissing.
We are going to use CH as timeseries DB and we have worst case as usual -> we don't have predefined list of tags or metrics.
To avoid problems with performance (our current rate is 120000 datapoints per second (each with 3-10 or more tags), and will be in 5-10 times more) we reject schema like:
CREATE TABLE metrics
(
accountId Int,
timestamp DateTime,
measures Nested (
name String,
value Float64
),
tags Nested (
name String,
value Float64
)
) ENGINE = MergeTree(timestamp, (accountId, timestamp), 8192)
Our current idea is use wide table with 900 (or 500) columns for measures and 100 (or 50) columns for tags + mapping in dictionary with information
what exactly we store in measures and tags columns. We will have something like:
Measures Table columns:
ShardKey (accountId or better add accountId->shardKeys table one to many, to have ability scale in the future)
Timestamp
M1
M2
...
M900
T1
T2
...
T100
Metainfo table columns:
ShardKey
Name
ColumnName
As you see depending on "ShardKey" we may have different metrics in Mx and Tx columns. For example, for ShardKey=1 M1 may contain count of apples, but for ShardKey=2 M1 may contain count of oranges.
We have couple open questions here, maybe somebody will able make them more clear:
1. Will this approach work in general with CH? Looks like this is good scenario, wide table with sparse data, please confirm.
2. I see enough flexibility (for us at least), does somebody see some underwater stones in this approach?
3. I see/expect some problems with MetaInfo table, will be tricky keep it is actual state, mapping must be unique, but I didn't find some way how to do this in CH. Is there maybe some function like insertOnlyIfNotExist(pk, row) which will return existing row if already exist? Will be very helpful. Or maybe we must use external source here for MetaInfo table? Can somebody suggest something?
4. Very ofter (almost always) we have 2-10 datapoint with the same set of tags. There is 2 ways to insert them:
a: like separate rows, so we will produce 2-10 very sparse rows.
b: group by tags on our side and store as 1 row with 2-10 filled columns.
(we will cache and insert in batches in any case (on our side or use Buffer on CH side, we didn't decide yet), Q is should we group before, will it give good speedup?)
Thx again, looking forward to try CH.