population of AggregatingMergeTree MV from existing table of large size

1,099 views
Skip to first unread message

Maxim Ulanovskiy

unread,
Aug 6, 2018, 11:33:23 AM8/6/18
to ClickHouse

Hello,

I'm trying to aggregate dataset of historical user events by session from existing CH table however query fails. See details below:

Test env:

EC2: 1 x i3.large node (2 cores, 15.25 GB RAM, 1 x 0.475 NVMe SSD )

Clickhouse-server: v 1.1.54383

OS: Ubuntu 16.04


Source table:

1.41 billion rows, size ~100GB compressed


Target MV:

Engine: AggregatingMergeTree

Query: group by 16 columns including joined metadata from dicts and high-cardinality columns: session/user_id. Each group supposed to aggregate 1-180 rows.


CREATE MATERIALIZED VIEW test.src_aggr_view
ENGINE
= AggregatingMergeTree(date, (col1, .., col_n), 8192) POPULATE AS
SELECT
... FROM source_table GROUP BY col1, .., col_n


Result


Query result doesn't fit into RAM and execution fails very quickly:


↘ Progress: 17.69 million rows, 3.14 GB (294.63 thousand rows/s., 52.25 MB/s.) █▌   1%Received exception from server (version 1.1.54383):
Code: 173. DB::Exception: Received from 172.18.37.183:9000. DB::Exception: Allocator: Cannot mmap 128.00 MiB., errno: 12, strerror: Cannot allocate memory.

0 rows in set. Elapsed: 60.640 sec. Processed 17.69 million rows, 3.14 GB (291.80 thousand rows/s., 51.75 MB/s.)


Questions
  1. Is it correct use case for AggregatingMergeTree: query with group by high-cardinality columns like session/user_id?
  2. What should be the general approach for such task: Populate new MV with large historical dataset which doesn't fit memory? Is it possible at all?
  3. Does CH actually reads ALL the selected rows from source table into RAM before population of MV?
  4. Is it possible to make CH to split read -> insert -> aggregate sequence into chunks to solve this task?
Thanks,
Max

Denis Zhuravlev

unread,
Aug 8, 2018, 8:43:57 PM8/8/18
to ClickHouse
Without poplulate:

create table z(a date, b Int64) Engine=MergeTree Partition by toYYYYMM(a) order by a;
insert into z select today(), number from numbers(1000000000);
insert into z select yesterday(), number from numbers(1000);

create table mv_z_store(a date, max_b AggregateFunction(MAX,Int64)) ENGINE = AggregatingMergeTree Partition by toYYYYMM(a) order by a;
create table temp(a date, b Int64) Engine=Null;
create MATERIALIZED VIEW mv_z to mv_z_store AS SELECT a, maxState(b) AS max_b FROM temp GROUP BY a;
insert into temp select * from z;
drop table mv_z;
drop table temp;
create MATERIALIZED VIEW mv_z to mv_z_store AS SELECT a, maxState(b) AS max_b FROM z GROUP BY a;

insert into z select yesterday()-1, number from numbers(100);
select a, maxMerge(max_b) from mv_z group by a
amaxMerge(max_b)
2018-08-0799
2018-08-08999
2018-08-09999999999

Denis Zhuravlev

unread,
Aug 8, 2018, 8:51:22 PM8/8/18
to ClickHouse
In this case [insert into temp select * from z;] select will retrieve data by 1mil. rows (check max_block_size/max_insert_block_size) and will insert by this batches, after that MV will group by over this 1mil., eventually during the merge process AggregatingMergeTree enging will finish aggregations. 

On Monday, 6 August 2018 12:33:23 UTC-3, Maxim Ulanovskiy wrote:

tatiana....@revjet.com

unread,
Aug 8, 2018, 9:04:06 PM8/8/18
to ClickHouse
1. Create an MV without populate. Or better yet, create a normal table and then create your MV pointing to this table (the MV is easier to manage this way)

create table test.src_aggr_table (...) ENGINE = AggregatingMergeTree(date, (col1, .., col_n), 8192); 
CREATE MATERIALIZED VIEW test.src_aggr_view to test.src_aggr_table
AS SELECT ... FROM source_table GROUP BY col1, .., col_n

2. Insert data into your table WITHOUT aggregation. Do not forget to increase max_block_size

insert into test.src_aggr_table SELECT ... FROM source_table

The data will be aggregated (merged) in background by the AggregatingMergeTree engine.
Reply all
Reply to author
Forward
0 new messages