EC2: 1 x i3.large node (2 cores, 15.25 GB RAM, 1 x 0.475 NVMe SSD )
Dataset: 3 months of data, 1.41 billion rows, Table size ~100GB on disk, Engine: MergeTree
Clickhouse-server: v 1.1.54383
OS: Ubuntu 16.04
SELECT uniqExact(user_id) AS uniques FROM test.merge_tree_logs┌──uniques─┐
│ 32287758 │
└──────────┘
1 rows in set. Elapsed: 232.710 sec. Processed 1.41 billion rows, 61.75 GB (6.06 million rows/s., 265.33 MB/s.)
SELECT uniq(user_id) AS uniques FROM test.merge_tree_logs┌──uniques─┐
│ 32273105 │
└──────────┘
1 rows in set. Elapsed: 90.572 sec. Processed 1.41 billion rows, 61.75 GB (15.57 million rows/s., 681.73 MB/s.)
>SELECT uniqExact(user_id) AS uniques FROM test.merge_tree_logs
To solve this query CH will read all 100GB. If you system can read 1GB/s, the query execution time will be ~100s.
>SELECT uniqExact(user_id) AS uniques FROM test.merge_tree_logsTo solve this query CH will read all 100GB. If you system can read 1GB/s, the query execution time will be ~100s.
uniq and uniqExact functions: 265.33 MB/s vs 681.73 MB/s. b. Is this real requirement or just speed test?In real life I would create some aggregating MV and pre-calculate this numbers. If you store count by day (month,hour,,,), you will need just sum a couple numbers for 0.0 seconds.
| > 100GB is a size of the whole table. Query output is 'Processed 1.41 billion rows, 61.75 GB'. ah, yes, my mistake. CH reads only one column (compressed) to solve this query. Check a table system.columns (columns data_compressed_bytes / data_uncompressed_byte). data_uncompressed_byte will match 61.75 GB |
>Another thing I don't understand is difference in processing speed between >681.73 MB/s is faster than max disk read speed so what does it actually show?uniq and uniqExact functions: 265.33 MB/s vs 681.73 MB/s. |
SELECT uniqExact(user_id) AS uniques FROM test.merge_tree_logs