Sessionizing Clickstream Log

866 views
Skip to first unread message

Владислав Денисов

unread,
Feb 26, 2017, 1:02:09 PM2/26/17
to ClickHouse
What is the best way to extract session from clickstream table?

It is possible in Postgres using window functions: http://randyzwitch.com/sessionizing-log-data-sql/

I've tried to define session's start using runningDifference:
SELECT
    event_time,
    runningDifference(event_time),
    (runningDifference(event_time) > 900) OR (runningDifference(event_time) <= 0) AS is_new,
    device_id
FROM apps_screens
ORDER BY
    device_id ASC,
    event_time ASC


┌──────────event_time─┬───diff─┬─is_new─┬─device─┐
│ 2016-12-08 10:35:40 │      0 │      1 │ id1    │
│ 2016-12-08 10:35:53 │     13 │      0 │ id1    │
│ 2017-01-22 09:02:19 │      0 │      1 │ id2    │
│ 2017-01-22 09:02:29 │     10 │      0 │ id2    │
│ 2017-01-22 09:03:11 │     42 │      0 │ id2    │
│ 2017-01-22 09:03:59 │     48 │      0 │ id2    │
│ 2017-01-26 16:57:05 │      0 │      1 │ id2    │
│ 2017-01-26 16:57:32 │     27 │      0 │ id2    │


How can I add column with session start time (is_new = 1) to all (is_new = 0) rows before next 1?

Is there a better way to get session id?

Alex Zatelepin

unread,
Mar 2, 2017, 2:54:01 PM3/2/17
to ClickHouse
There no window functions at the moment in ClickHouse.

Actually, I cannot think of any way to assign session id to the events using only ClickHouse functions. The solution suggested in the article (use running sum of is_new as session_id) looks elegant though, maybe we should add runningSum() function.

Alex Zatelepin

unread,
Mar 6, 2017, 8:52:21 AM3/6/17
to ClickHouse
Turns out there is a (slightly ugly) way to assign session ids by calculating the running sum of is_new column using the undocumented runningAccumulate() function. It operates on AggregationStates, so an additional subselect is needed to run a GROUP BY event_time, device_id. Here goes:

SELECT
    device_id,
    event_time,
    runningAccumulate(s) AS session_id
FROM
(
    SELECT
        device_id,
        event_time,
        sumState(is_new) AS s
    FROM
    (
        SELECT
            event_time,
            (runningDifference(event_time) > 900) OR (runningDifference(event_time) <= 0) AS is_new,
            device_id
        FROM apps_screens
        ORDER BY
            device_id ASC,
            event_time ASC
    )
    GROUP BY
        device_id,
        event_time
    ORDER BY
        device_id ASC,
        event_time ASC
)

┌─device_id─┬──────────event_time─┬─session_id─┐
│ id1       │ 2016-12-08 10:35:40 │          1 │
│ id1       │ 2016-12-08 10:35:53 │          1 │
│ id2       │ 2017-01-22 09:02:19 │          2 │
│ id2       │ 2017-01-22 09:02:29 │          2 │
│ id2       │ 2017-01-22 09:03:11 │          2 │
│ id2       │ 2017-01-22 09:03:59 │          2 │
│ id2       │ 2017-01-26 16:57:05 │          3 │
│ id2       │ 2017-01-26 16:57:32 │          3 │
└───────────┴─────────────────────┴────────────┘


Vladislav Denisov

unread,
Mar 6, 2017, 9:02:20 AM3/6/17
to ClickHouse
That's it! Thanks!

понедельник, 6 марта 2017 г., 16:52:21 UTC+3 пользователь Alex Zatelepin написал:
Reply all
Reply to author
Forward
0 new messages