Can I (pre)agregate result of this select somehow?

128 views
Skip to first unread message

kriticar

unread,
Oct 14, 2019, 8:26:54 AM10/14/19
to ClickHouse
Hi,

I have a select that counts how many customerId's in previous 15 min period doesn't exist in current 15 min period.
Something like this:

select toDate(subtractMinutes(toDateTime('2019-10-14 13:15:00'), 15)) startOf15Min, count() nr_of_custs_not_in_next_interval
from (
        select distinct customerId from default.indicators_15_min
        where startOf15MinDay = toDate(subtractMinutes(toDateTime('2019-10-14 13:15:00'), 15))
        and startOf15Min = subtractMinutes(toDateTime('2019-10-14 13:15:00'), 15)
        and customerId not in (
            select distinct customerId from default.indicators_15_min
            where startOf15MinDay = toDate('2019-10-14 13:15:00') -- partition by is startOf15MinDay
            and startOf15Min = toDateTime('2019-10-14 13:15:00')
            )
        )


Because default.indicators_15_min is a huge distributive table, can I somehow preagregate its result for each of the 15 minutes period during every day?

For now, I am using cron every 15 minutes to execute select similar to the one above with delay of 5 hours to be sure all records were in database.

Maybe there is a better solution.

Regards.

Denis Zhuravlev

unread,
Oct 14, 2019, 1:11:29 PM10/14/19
to ClickHouse

kriticar

unread,
Oct 15, 2019, 3:15:47 AM10/15/19
to ClickHouse
Can you please be a little bit more precise?
How uniqState can help with customerId not in (next 15 minutes period) condition?

What would help is aggregating that will aggregate uniq customerId's for each of 15 minute interval in order to speed up customerId not in (next 15 minutes period) condition.

Or I am missing something.

Regards.

kriticar

unread,
Oct 15, 2019, 3:24:47 AM10/15/19
to ClickHouse
It would help if I can check whether all customerId's in one 15 minutes period are (as a whole) a subset of all customerId's in next 15 minutes period, and if not how are not in next 15 minutes period.

kriticar

unread,
Oct 15, 2019, 5:09:41 AM10/15/19
to ClickHouse
I have created an example:

insert into default.missing_custs
(startOf15Min, customerId)
values
('2019-10-15 07:30:00', 'A')
('2019-10-15 07:30:00', 'B')
('2019-10-15 07:45:00', 'A')
('2019-10-15 07:45:00', 'C')
('2019-10-15 08:00:00', 'A')
('2019-10-15 08:00:00', 'C')
('2019-10-15 08:00:00', 'D')
('2019-10-15 08:15:00', 'A')


select          *
from             (
                select     startOf15MinNew startOf15Min
                , groupUniqArrayMerge(prevUniqCustomerIds) prevUniqCustomerIds
                , groupUniqArrayMerge(currUniqCustomerIds) currUniqCustomerIds
                , hasAll(currUniqCustomerIds, prevUniqCustomerIds) hasAllPrevInCurr
                , hasAny(currUniqCustomerIds, prevUniqCustomerIds) hasAnyPrevInCurr
                , arrayUniq(prevUniqCustomerIds) prevUniqCustIds
                , arrayUniq(currUniqCustomerIds) currUniqCustIds
                from (
                                    SELECT startOf15Min startOf15MinNew
                                        ,groupUniqArrayState(customerId) prevUniqCustomerIds
                                    FROM default.missing_custs
                                    where startOf15MinDay = today()
                                    GROUP BY startOf15Min
                                    ORDER BY startOf15Min
                    )
                any left join         (
                                    SELECT subtractMinutes(startOf15Min, 15) startOf15MinNew
                                    ,groupUniqArrayState(customerId) currUniqCustomerIds
                                    FROM default.missing_custs
                                    where startOf15MinDay = today()
                                    GROUP BY startOf15Min
                                    ORDER BY startOf15Min
                ) using startOf15MinNew
                group by startOf15Min
                )
order by        startOf15Min




For the first row I need an info how many prevUniqCustomerIds doesn't exit in currUniqCustomerIds, which is 1 (B doesn't exists in [A, C].
I also need all elements that doesn't exist in curr, which is for the 1st record B

These two information for second row would be 0 (all previous [A, C] exit in curr [D, A, C]), element that doesn't exit is null.

For third row, from [D, A, C] only A exists in curr, so count of not existing is 2 for D and C, and elements that are missing are [D, C]

Table missing_custs is huge.
I am looking for a way to aggregate everything as fast as possible.

Regards.

kriticar

unread,
Oct 15, 2019, 7:13:33 AM10/15/19
to ClickHouse
The closest I have come is this:

select          *
from             (
                select     addMinutes(startOf15MinNew, 15) startOf15Min
                , groupUniqArrayMerge(prevCustomerIds) prevCustomerIds
                , groupUniqArrayMerge(currCustomerIds) currCustomerIds
                , hasAll(currCustomerIds, prevCustomerIds) hasAllPrevInCurr
                , arrayUniq(prevCustomerIds) prevCustIdsCnt
                , arrayUniq(currCustomerIds) currCustIdsCnt
                , arrayIntersect(prevCustomerIds, currCustomerIds) commonElements
                , length(commonElements) commonElementsCnt
                , prevCustIdsCnt - commonElementsCnt missCustomerIdsCnt

                from (
                                    SELECT startOf15Min startOf15MinNew
                                        ,groupUniqArrayState(customerId) prevCustomerIds

                                    FROM default.missing_custs
                                    where startOf15MinDay = today()
                                    GROUP BY startOf15Min
                                    ORDER BY startOf15Min
                    )
                any left join         (
                                    SELECT subtractMinutes(startOf15Min, 15) startOf15MinNew
                                    ,groupUniqArrayState(customerId) currCustomerIds

                                    FROM default.missing_custs
                                    where startOf15MinDay = today()
                                    GROUP BY startOf15Min
                                    ORDER BY startOf15Min
                ) using startOf15MinNew
                group by startOf15Min
                )
order by        startOf15Min   


But instead of commonElements, I need elements from prevCustomerIds that are not exist in currCustomerIds.
I don't se a clickhouse function that would be the opposite of the arrayIntersect.

Regards.

Denis Zhuravlev

unread,
Oct 15, 2019, 7:47:22 PM10/15/19
to ClickHouse
If we had 1100 uniqs for today and yesterday in total and today we had 1000 uniqs only, then today we got 100 new uniqs.

create table z(d Date, z String, u String)
Engine=MergeTree partition by tuple() order by tuple();

CREATE MATERIALIZED VIEW mvz
ENGINE = AggregatingMergeTree order by (z,d) settings index_granularity = 8 
as select d, z,uniqState(u) as us from z group by z,d;

insert into z select today()-1, 'g1' , toString(number) from numbers(1000);
insert into z select today(), 'g1' , toString(number+100) from numbers(1000);

select d, uniqMerge(us) from mvz group by d
┌──────────d─┬─uniqMerge(us)─┐
│ 2019-10-14 │          1000 │
│ 2019-10-15 │          1000 │
└────────────┴───────────────┘

select uniqMerge(us) from mvz
┌─uniqMerge(us)─┐
│          1100 │
└───────────────┘


select (select uniqMerge(us) from mvz where d in (today(), today()-1)) - (select uniqMerge(us) from mvz where d in (today()))
┌─minus(_subquery23945, _subquery23946)─┐
│                                   100 │
└───────────────────────────────────────┘

The same result as yours
select count() from z where d=today() and u not in (select u from z where d=today()-1)
┌─count()─┐
│     100 │
└─────────┘

kriticar

unread,
Oct 16, 2019, 6:06:57 AM10/16/19
to ClickHouse
Yes this works, but how will I know prevCustomerIds that are not transferred to the current period?

Denis Zhuravlev

unread,
Oct 16, 2019, 9:10:03 AM10/16/19
to ClickHouse
it's the same but backward.

kriticar

unread,
Oct 16, 2019, 9:22:24 AM10/16/19
to ClickHouse
I don't need a number, I need customerIds like A, B...

Denis Zhuravlev

unread,
Oct 16, 2019, 9:48:30 AM10/16/19
to ClickHouse
Then aggregatingMT is not applicable. Use not in

kriticar

unread,
Oct 21, 2019, 4:12:23 AM10/21/19
to ClickHouse
Denis,

in example you have provided, you have "hard coded" today() and today() -1 in:

select (select uniqMerge(us) from mvz where d in (today(), today()-1)) - (select uniqMerge(us) from mvz where d in (today()))

If I have to provide information like number of uniq customers between two successive intervals (in my case intervals are start of 15 minutes) for historical data, I need something like runningUnique for window of two 15 minutes (half an hour) periods, from which I would subtract current number of unique.

In my case row for certain start of 15 minutes can come few hours latter, so I need kind of moving window (python terminology) for half an hour and uniq function.

Does clickhouse have a solution for such purpose?
I cannot find any.

Regards.

Denis Zhuravlev

unread,
Oct 21, 2019, 10:03:20 PM10/21/19
to ClickHouse
I don't know. I think CH does not have such solution.

Maybe smthg like this

select m1, m2 from (
SELECT
        groupArray(d) AS gd,
        arrayMap(x -> toString(gd[x+1])||' - '||toString(gd[x+2]), range(toUInt64(length(gd)-1))) m1,
        groupArray(us) AS gus,
        arrayMap(x -> (arrayReduce('uniqMerge', [gus[x+1]+gus[x+2]]) - arrayReduce('uniqMerge', [gus[x+2]])) , range(toUInt64(length(gd)-1))) m2
          from (select d, us FROM mvz  order by d ) )
    Array Join m1, m2

kriticar

unread,
Oct 22, 2019, 8:23:14 AM10/22/19
to ClickHouse
Very nice solution, but I cannot execute it due to Memory limit.
Thanks anyway. By the way, very nice usage of array functions.

Best regards.

kriticar

unread,
Oct 22, 2019, 9:29:19 AM10/22/19
to ClickHouse
The problem is that I have a raw table on each node, as well as its aggregating table that keeps pre agregated states.
Now I need distributive table that will point to the aggregated table with pre aggregated states on all nodes.

The subselect in your select above, should collect all preagregated states from all nodes, and as tables are quite big, it is a Memory limit problem (for 10 GB per node, 10 servers, 5 shards, and last 5 hours of data).

Anyway, as an idea, I really like it.

Best regards.

Denis Zhuravlev

unread,
Oct 23, 2019, 9:41:07 AM10/23/19
to ClickHouse
Fixed version select d, uniqMergeState(us) us FROM mvz group by d  order by d

select m1, m2 from (
SELECT
        groupArray(d) AS gd,
        arrayMap(x -> toString(gd[x+1])||' - '||toString(gd[x+2]), range(toUInt64(length(gd)-1))) m1,
        groupArray(us) AS gus,
        arrayMap(x -> (arrayReduce('uniqMerge', [gus[x+1]+gus[x+2]]) - arrayReduce('uniqMerge', [gus[x+2]])) , range(toUInt64(length(gd)-1))) m2
          from (select d, uniqMergeState(us) us FROM mvz group by d  order by d ) )
    Array Join m1, m2

kriticar

unread,
Oct 25, 2019, 3:16:57 AM10/25/19
to ClickHouse
If you are selecting from (select d, uniqMergeState(us) us FROM mvz group by d  order by d ), arrayReduce('uniqMerge', [gus[x+1]+gus[x+2]]) will not work anymore, because you don't have uniq of two periods, because you have merged unique by d upfront.

Idea would be to merge states of two consecutive periods minus merged state of second period.

Regards.

Denis Zhuravlev

unread,
Oct 25, 2019, 10:40:59 AM10/25/19
to ClickHouse
Probably we have some misunderstanding.

In my examples I use a period equal day and my aggregating table has groupping by day (order by day).
It does not really matter what do you use as a dimension, you can use day, 15min, or oranges. Just replace group by day to group by orange.
Reply all
Reply to author
Forward
0 new messages