More information on SAMPLE.

628 views

Skip to first unread message

S M

unread,

Jul 28, 2016, 3:39:34 AM7/28/16

to ClickHouse

Hello,

I have a MergeTree defined like so:
ENGINE = MergeTree(dts, session_id, (ts, session_id), 8192)

session_id is a bigint, ts is DateTime. I am trying to test out sampling, but seems to only decrease, not increase performance?

SELECT count() AS total
FROM hits
WHERE ts >= '2016-06-01 23:45:00'

1 rows in set. Elapsed: 0.020 sec. Processed 42.53 million rows, 170.13 MB (2.17 billion rows/s., 8.68 GB/s.)

SELECT count() AS total
FROM hits
SAMPLE 1 / 10
WHERE ts >= '2016-06-01 23:45:00'

1 rows in set. Elapsed: 0.109 sec. Processed 42.53 million rows, 510.38 MB (389.27 million rows/s., 4.67 GB/s.)

Hmn? Maybe it's just with count, let me try more complex:

SELECT
count() AS total,
post_id
FROM ld_hits
WHERE ts >= '2016-06-01 23:45:00'
GROUP BY post_id
ORDER BY total DESC
LIMIT 25

25 rows in set. Elapsed: 0.066 sec. Processed 42.53 million rows, 340.25 MB (639.39 million rows/s., 5.12 GB/s.)

With SAMPLE 0.1:
25 rows in set. Elapsed: 0.100 sec. Processed 42.53 million rows, 680.50 MB (427.14 million rows/s., 6.83 GB/s.)

What am I doing wrong?

Thanks!

man...@gmail.com

unread,

Jul 29, 2016, 12:15:56 AM7/29/16

to ClickHouse

Hello.

Sampling works this way:
For each value of primary key prefix before sampling key (ts in your case),
from table will be read part of data - for part of all possible values of sample key (session_id in your case).

Data in table is ordered by primary key ((ts, session_id) in your case).
If there are many rows (at least hundred thouthand) for single ts, then when using SAMPLE, part of them will be skipped.
But if, for single ts, there are not much rows, then data couldn't be skipped.

For comparison.
In Yandex.Metrica, in one of tables, primary key is (CounterID, EventDate, intHash32(UserID)) and sample key is intHash32(UserID).
CounterID is identifier of tracked web site. Thus, sampling allows to read part of data for each day for each web site.
And sampling is efficient, if web site is large - has much data each day.

Another table we use for global reporting, and its key is just (EventDate, intHash32(UserID)).
Sampling for that table works efficienly, because we have much data for each EventDate.

And keep in mind, that when using SAMPLE, sampling key must be read from table. If otherwise, sampling key not needed, then sampling could make query slower.

Also note, that sampling key must be uniformly distributed across whole range of its data type, for good results.
Look at intHash32(UserID) for sampling key. With that key, when using sampling, we get uniformly pseudorandom sample of all possible users.

How to take advantage of sampling in your case?

You may use not exact timestamp, but rounded timestamp in primary key (it may be stored in another column).
Round it, so you have millions of rows for each rounded value.
For sampling key, use something like hash of full ts and session_id. Better to precalculate this hash and store in separate column, because otherwise it will be calculated each time when sampling is used.
As it is quite difficult, you should decide, do you really need sampling.

Note that primary key is not unique in ClickHouse - so no need to add full ts and session_id to primary key.

Reply all

Reply to author

Forward

0 new messages