Hi Fangjin,
We learned a new use case and I probably need your advice.
The use case is a nested groupby. The total number of rows in the final query result is guaranteed to be small. However, the inner groupy query will generate a intermediate table with a huge number of rows. I'll give an example here.
Suppose we have a schema like following.
D_1, D_2, D_3, .., D_k, M_1, M_2, ..., M_l
where D_i is the dimensions, and M_i is the metrics.
The nested groupby works like this.
Step 1: get the aggregation value of sum of AggM_1 as
D_1, D_2, D_3, SUM(M_1)
Step 2: get the aggregation value of square sum of aggM_1 from the table we got in step 1 as
D_1, D_2, SUM(AggM_1^2)
The cardinality of D_1 and D_2 are very small so the final result size is guaranteed to be small. However, the cardinality of D_3 is large so the intermediate table is very large.
Queries like this are slow. I guess its probably because the inner query produces too many rows in the result? Does that have anything to do with memory cost, which can be optimized by allocating more memory to the historical node in its configuration file? Is there any other way to improve?
We cannot pre-compute all the intermediate tables since there are too much of them.
Thanks!
Qi