Nulls, missing values, empty strings, empty lists, empty lists with empty strings

542 views
Skip to first unread message

Fangjin Yang

unread,
Nov 26, 2014, 1:41:20 PM11/26/14
to druid-de...@googlegroups.com
Hey guys,

There have been several discussions about this topic and we've never come to some kind of agreement. There's also an outstanding PR to make this behavior a bit more consistent: https://github.com/metamx/druid/issues/665

The behavior defined in the PR is:
  • for multi-value dimensions, if the dimension has multiple values:

    1. dim extraction returns null -> null
    2. missing column -> null
    3. [] -> []
    4. [""] -> [null]
    5. ["", "a"] -> [null, "a"]
  • for single value dimensions

    1. missing column -> null
    2. dim extraction returns null -> null
    3. [""] -> null
    4. [] -> null

How does everyone feel about this definition?

Fangjin Yang

unread,
Nov 26, 2014, 1:46:48 PM11/26/14
to druid-de...@googlegroups.com

Eric Tschetter

unread,
Nov 26, 2014, 2:49:19 PM11/26/14
to druid-de...@googlegroups.com
Fwiw, I think that

multi-value:

null -> null
[] -> null

is much more realistic in terms of implementability. One piece of
evidence towards that is that the PRs linked to actually implement
[]->null, not []->[].

--Eric
> --
> You received this message because you are subscribed to the Google Groups
> "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to druid-developm...@googlegroups.com.
> To post to this group, send email to druid-de...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/druid-development/39a4de58-f391-4d9c-8891-a29ff3c989f0%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

Xavier

unread,
Nov 26, 2014, 3:03:55 PM11/26/14
to druid-de...@googlegroups.com
The PRs are not directly related to the issue, they were created to address one particular use case, but before we started to define things in #665.
We can close those PRs and define what we want to do first, and then we'll create new PRs to address what we've decided.

Gian Merlino

unread,
Nov 26, 2014, 3:24:58 PM11/26/14
to druid-de...@googlegroups.com
Being blissfully unaware of what is and isn't easy to implement in querying, my opinion from the ingestion side of things is that life is simpler if (a) null and missing behave the same; and (b) single-element arrays and that single element behave the same. So ideally JSONs in the same bullet below should behave the same when ingested:

- {"x":null} and {"x":[null]} and {}
- {"x":""} and {"x":[""]}
- {"x":"a"} and {"x":["a"]}
- {"x":[null, "a"]} and {"x":["a"]}

I'm not a fan of "" and null being the same thing, but I could live with it. Meaning ideally I think JSONs in the same bullet below should behave *differently* but I could live with them behaving the same:

- {"x":null} and {"x":""}
- {"x":[null]} and {"x":[""]}
- {} and {"x":""}
- {"x":[null, "a"]} and {"x":["", "a"]}

Eric Tschetter

unread,
Nov 26, 2014, 3:29:25 PM11/26/14
to druid-de...@googlegroups.com
> - {"x":[null, "a"]} and {"x":["a"]}

Can you elaborate more on this one?

It seems like you are throwing away the fact that there was a null.
In all other cases, you are preserving null, but in this one you are
determining that the null doesn't matter.

--Eric

Gian Merlino

unread,
Dec 1, 2014, 1:29:21 PM12/1/14
to druid-de...@googlegroups.com
That comes from deciding that nulls mean "nothing", so nulls and missing values should behave the same. So if you believe that makes sense, you actually do want to throw away the fact that there is a null.

FWIW I think it's not super important which behavior things actually have as long as it is useful (people find it easy to get the query results they want) and coherent (query behavior depends only on each individual incoming JSON and not things internal to the indexer like whether a column ends up single or multi-valued). I think you could also find a coherent set of behavior that treats nulls-inside-lists differently from top-level nulls (maybe nulls-inside-lists are treated like empty strings, but top level nulls are treated the same as missing values) if that would make more sense to people.

I think the things that are interesting in terms of 'is the behavior useful?' and 'is the behavior coherent?' are missing values, empty lists, top-level nulls, top-level empty strings, nulls-inside-lists, and empty-strings-inside-lists. My opinion is for missing values and all nulls to behave the same (be ignored, basically), for empty strings to be their own thing, for empty lists to be their own thing, and for singleton lists to behave the same as the thing inside the list. But I'm OK with anything that works for people.

Fangjin Yang

unread,
Dec 1, 2014, 2:31:33 PM12/1/14
to druid-de...@googlegroups.com
I'm fine with what was described in the initial post and 

null -> null 
[] -> null 

because I think it is an easier story to tell people and understand versus having to consider different null and missing scenarios and what they mean when you query.

Charles Allen

unread,
Dec 1, 2014, 5:44:38 PM12/1/14
to druid-de...@googlegroups.com
Imagine this behavior:

X = ['a','b','c']

Y = [ ]

Z = null

Depending on which language you are referring to, the following may yield different things:

'c' NOT IN X : false
'c' NOT IN Y : true
'c' NOT IN Z : null (false)

The big annoying one is that last one, 'c' NOT IN Z when Z is null

What this ask is saying is that 'c' NOT IN Z would return TRUE since null and [ ] would be the same.

I'm all for that because 9/10 that's what the query-er is wanting for the results. This simply would need to be very clear.

Eric Tschetter

unread,
Dec 1, 2014, 5:56:47 PM12/1/14
to druid-de...@googlegroups.com

Charles, I don't think I understand.  Which filter operator are you referring to with "NOT IN Z"?

--Eric
--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

Charles Allen

unread,
Dec 1, 2014, 6:54:40 PM12/1/14
to druid-de...@googlegroups.com
@Eric, it is not necessarily one that would be common.

The most straightforward one would be the regex filter.

In general though, anytime there is a set, it would need to be understood that druid would make no distinction between an empty set and a null value.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Eric Tschetter

unread,
Dec 23, 2014, 8:41:54 PM12/23/14
to druid-de...@googlegroups.com
I just created a PR to fix various inconsistencies with null handling in Druid:

https://github.com/druid-io/druid/pull/995

--Eric
>>> email to druid-developm...@googlegroups.com.
>>> To post to this group, send email to druid-de...@googlegroups.com.
> --
> You received this message because you are subscribed to the Google Groups
> "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to druid-developm...@googlegroups.com.
> To post to this group, send email to druid-de...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/druid-development/a8279b6f-44f9-4b69-be25-be82d32d6487%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages