Forward fill column values when NULL

1,111 views
Skip to first unread message

Dan Walters

unread,
Feb 12, 2020, 9:25:59 PM2/12/20
to ClickHouse
Hi all-

I'm looking for some guidance on how to forward fill missing data, basically a "if NULL, use the previous row's value", but recursively.

The neighbor() function works if just one row is missing data (at least within a block) - but sometimes I have several rows of NULLs.  As an example, I can generate some dummy data:

$ clickhouse-client -q "SELECT
  t,
  v1
FROM (
  WITH
    toDateTime('2020-01-01 12:10:00') AS start_datetime
  SELECT
    addMinutes(start_datetime, number) AS t,
    nullIf(toMinute(t), 0) AS v1
  FROM system.numbers
  WHERE number NOT IN (3, 4)
  LIMIT 5
) ORDER BY t WITH FILL STEP 60"

2020-01-01 12:10:00    10
2020-01-01 12:11:00    11
2020-01-01 12:12:00    12
2020-01-01 12:13:00    \N
2020-01-01 12:14:00    \N
2020-01-01 12:15:00    15
2020-01-01 12:16:00    16

So the 12:13 and 12:14 rows are null - to forward fill from the last non-null value at 12:12, neighbor() works fine to fill in 12:13, but not 12:14:

$ clickhouse-client -q "SELECT
  t,
  coalesce(v1, neighbor(v1, -1)) AS v1
FROM (
  SELECT
    t,
    v1
  FROM (
    WITH
      toDateTime('2020-01-01 12:10:00') AS start_datetime
    SELECT
      addMinutes(start_datetime, number) AS t,
      nullIf(toMinute(t), 0) AS v1
    FROM system.numbers
    WHERE number NOT IN (3, 4)
    LIMIT 5
  ) ORDER BY t WITH FILL STEP 60
)"

2020-01-01 12:10:00    10
2020-01-01 12:11:00    11
2020-01-01 12:12:00    12
2020-01-01 12:13:00    12
2020-01-01 12:14:00    \N
2020-01-01 12:15:00    15
2020-01-01 12:16:00    16

There may be several rows of NULLs, not just 2, and I need to forward fill several (10-20) columns.  Given the column oriented storage model, I can't help but think there has to be a simple way to do this that I'm missing?  Maybe something using runningAccumulate()?

Any help is much appreciated!

Thanks,
-Dan

Артем Зуйков

unread,
Feb 13, 2020, 2:14:39 AM2/13/20
to Dan Walters, ClickHouse
Hi Dan

If you have enough memory you can use ASOF JOIN for this task.

чт, 13 февр. 2020 г. в 05:26, Dan Walters <walt...@gmail.com>:
--
You received this message because you are subscribed to the Google Groups "ClickHouse" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clickhouse+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clickhouse/678a1c3e-00f7-452c-bfbc-2df28b0b2fe1%40googlegroups.com.
--
Артем

Dan Walters

unread,
Feb 13, 2020, 8:12:21 AM2/13/20
to ClickHouse
Hi, thanks for the reply!  I’m playing with ASOF JOIN now, and it does work good for just one column.  The thing is I have some queries with ~10-20 columns that I’m hoping to forward fill, and each column may need to go back a different number of rows to find it’s last known value - most often it's just 1 row, but sometimes it's 100.  I guess I can do an ASOF JOIN query for each column and UNION the results, but for a large number of columns it gets pretty crazy.

I've looked at the source for neighbor() and it looks like it'd be pretty easy to implement a fillForward() function with similar limitations.  e.g., if the first value in a block is null, it'd just have to be set to a default (or maybe backfilled.)  Does that seem like a reasonable thing to implement?

I thought there might be a way to accomplish this with runningAccumulate() and maybe anyLast(), since the data is ordered in the subquery, but I can't seem to figure it out...
To unsubscribe from this group and stop receiving emails from it, send an email to click...@googlegroups.com.
--
Артем

Dan Walters

unread,
Feb 13, 2020, 9:23:33 AM2/13/20
to ClickHouse
OK, I think I finally figured it out with runningAccumulate()!  Also seems to perform well on a large query.

For the benefit of anyone else looking to forward fill several columns, the pattern is:

SELECT *
FROM test_data

┌───────────────────t─┬───v1─┬───v2─┐
2020-01-01 12:10:00 ᴺᵁᴸᴸ ᴺᵁᴸᴸ
2020-01-01 12:11:00   56   34
2020-01-01 12:12:00   57   35
2020-01-01 12:13:00 ᴺᵁᴸᴸ ᴺᵁᴸᴸ
2020-01-01 12:14:00 ᴺᵁᴸᴸ ᴺᵁᴸᴸ
2020-01-01 12:15:00 ᴺᵁᴸᴸ   38
2020-01-01 12:16:00   61 ᴺᵁᴸᴸ
2020-01-01 12:17:00   62   40
2020-01-01 12:18:00   63   41
2020-01-01 12:19:00   64 ᴺᵁᴸᴸ
2020-01-01 12:20:00 ᴺᵁᴸᴸ   43
2020-01-01 12:21:00   66   44
└─────────────────────┴──────┴──────┘


SELECT
    t
,
    runningAccumulate
(v1State) AS v1,
    runningAccumulate
(v2State) AS v2
FROM
(
    SELECT
        t
,
        anyLastState
(v1) AS v1State,
        anyLastState
(v2) AS v2State
    FROM test_data
    GROUP BY t
    ORDER BY t ASC
)

┌───────────────────t─┬─v1─┬─v2─┐
2020-01-01 12:10:00  0  0
2020-01-01 12:11:00 56 34
2020-01-01 12:12:00 57 35
2020-01-01 12:13:00 57 35
2020-01-01 12:14:00 57 35
2020-01-01 12:15:00 57 38
2020-01-01 12:16:00 61 38
2020-01-01 12:17:00 62 40
2020-01-01 12:18:00 63 41
2020-01-01 12:19:00 64 41
2020-01-01 12:20:00 64 43
2020-01-01 12:21:00 66 44
└─────────────────────┴────┴────┘


Reply all
Reply to author
Forward
0 new messages