SAMPLE N, statistical "uniform" sampling, later non-PK components

52 views
Skip to first unread message

Thomas Dullien

unread,
Mar 15, 2021, 8:12:53 AM3/15/21
to ClickHouse
Hey everybody,

I have multiple questions about the effective use of SAMPLE N.

Background:
We collect telemetry from a large number of sources and store it in Clickhouse with a timestamp every 5 seconds from each source. The telemetry is prone to "seasonality" and "spikes" -- e.g. the distribution of the data between the different sources is correlated, and changes over time.
In order to analyze the data, we currently use SAMPLE N (we know that we need ~200k samples to have statistically valid estimates for our use case, so N in this case is ~200k).

If I understand the SAMPLE N source code correctly, it does the following:

1. Use partitions to prune unneeded stuff.
2. Use the primary key to identify the "parts" (granules?) that need to be read. Call this the "universe".
3. Estimate how many rows are in these parts.
4. Calculate what fraction F of the number calculated in 3 would be approximately N.
5. Calculate a lower and upper bound to sample from the universe, based on the fraction F and the offset into the universe that was specified. (code is here).
6. Convert the lower and upper bounds to extra WHERE conditions.
7. Apply data skipping indices etc.
8. Read the data as in a non-sampling query.

This causes my first question:

A) Given that SAMPLE N converts to a range query on the primary key, and then reads a certain amount of data sequentially from there, it seems to me that the sampled data would not be reasonably randomly distributed over the time period?
E.g. if we had data for a year, ordered by day, then it looks to me as if a SAMPLE N query would sample a consecutive number of days that are not well-distributed over the full year?

Would an implementation that randomly discards parts-to-be-read until there are only expected N rows left not be better from the statistical side? (of course with performance tradeoffs).

If SAMPLE N will sample consecutive datapoints, then my next question becomes relevant:

B) From a statistical point of view (not from the implementation point of view), a user that issues a SAMPLE N usually wants a sample that is (a) as uniformly distributed over the data range as possible and (b) retrieves approximately N items, even if there are additional non-primary-key conditions. So at the moment, the only way I can see of doing it would be:
   B1. Choose a value k. Do k different SAMPLE N/(100*k) with random offset and the additional side conditions.
   B2. Calculate what the size of the results is to estimate how much "bigger" I need to make N to compensate for the non-PK conditions.
   B3. Do k different SAMPLE N*factor/k with random offsets.
That should give me both a more uniform distribution and a number of results that is close to N. But it seems complicated / cumbersome?

Is there a better way of achieving a good random sample of N elements, including non-PK conditions?

Cheers,
Thomas
Reply all
Reply to author
Forward
0 new messages