Dedup empty and null in druid aggregation?

613 views
Skip to first unread message

jbae

unread,
Oct 8, 2013, 6:27:54 PM10/8/13
to druid-de...@googlegroups.com, George Abraham
For some queries druid returns duplicate rows for the same dimension with different resCounts. This is because for on one row the dimension's value is an empty string but on the other it is null. 

This causes our aggregations to be broken up into two rows. I notice that once some time has passed, they will be deduped and the result will show up on the same row. i.e if the same query (with the same end time) is ran again - then the results show up correctly

For instance - for the query

{
    "dataSource": "customer_message_tracker",
    "dimensions": [
        {
            "type": "default",
            "dimension": "CHECKPOINT",
            "outputName": "checkpoint"
        },
        {
            "type": "default",
            "dimension": "TRACKER_MESSAGE",
            "outputName": "tracker_message"
        }
    ],
    "queryType": "groupBy",
    "orderBy": {
        "type": "default",
        "columns": [
            {
                "dimension": "resCount",
                "direction": "DESCENDING"
            }
        ],
        "limit": 20
    },
    "intervals": {
        "intervals": [
            "2013-10-08T21:00:00/2013-10-08T22:14:00"

The response was

[
    {
        "event": {
            "checkpoint": "ARCHIVED",
            "resCount": 1563374,
            "tracker_message": "SENT"
        },
        "timestamp": "2013-10-08T21:00:00.000Z"
    },
    {
        "event": {
            "checkpoint": "CONSUMED",
            "resCount": 1375112
        },
        "timestamp": "2013-10-08T21:00:00.000Z"
    },
    {
        "event": {
            "checkpoint": "PROCESSED",
            "resCount": 1349975
        },
        "timestamp": "2013-10-08T21:00:00.000Z"
    },
    {
        "event": {
            "checkpoint": "CONSUMED",
            "resCount": 398168,
            "tracker_message": ""
        },
        "timestamp": "2013-10-08T21:00:00.000Z"
    },
    {
        "event": {
            "checkpoint": "PROCESSED",
            "resCount": 391711,
            "tracker_message": ""
        },
        "timestamp": "2013-10-08T21:00:00.000Z"
    },
    {
        "event": {
            "checkpoint": "ARCHIVED",
            "resCount": 178268,
            "tracker_message": "HOLDOUT"
        },
        "timestamp": "2013-10-08T21:00:00.000Z"
    },
    {
        "event": {
            "checkpoint": "ARCHIVED",
            "resCount": 31112,
            "tracker_message": "FAILED"
        },
        "timestamp": "2013-10-08T21:00:00.000Z"
    },
    {
        "event": {
            "checkpoint": "RETRIED",
            "resCount": 58,
            "tracker_message": "SUBSCRIBER_CLIENT_EXCEPTION||"
        },
        "timestamp": "2013-10-08T21:00:00.000Z"
    },
    {
        "event": {
            "checkpoint": "ARCHIVED",
            "resCount": 10,
            "tracker_message": "FILTER"
        },
        "timestamp": "2013-10-08T21:00:00.000Z"
    }
]

We are using 0.4.32.2 and if this bug is know issue and fixed in 0.5.x, please let me know, I should update.

Thank you
Best, Jae

jbae

unread,
Oct 8, 2013, 6:42:02 PM10/8/13
to druid-de...@googlegroups.com, George Abraham
As the quick fix, the client can define alias for null or empty string such as "N/A" but I cannot push all data producers to define aliases.

One more observation is, with about 5 minutes latency, null and empty string are being de-duped, in other words, the end of time interval is NOW, it cannot de-deup empty and null.

Fangjin Yang

unread,
Oct 9, 2013, 12:36:11 PM10/9/13
to druid-de...@googlegroups.com, George Abraham
Hi Jae,

Druid should treat nulls and empty strings as the same value. Are you seeing these issues on the real-time node? When you see correct results over some period of data, is that after the real-time has handed that data off or after some persist occurs?

Thanks,
FJ


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/49fe3cf8-9270-4fbb-955c-98e7c12ace34%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Eric Tschetter

unread,
Oct 9, 2013, 3:18:03 PM10/9/13
to druid-de...@googlegroups.com, George Abraham
Jae,

What you are seeing is an issue with differences in the way that the "realtime" index and the "persisted" index process nulls.  They have been fixed up to be a lot more consistent in the code in the current of 0.5.x.  If you could try those out and see if those fixes work for you, we could cut an actual tag and let you run with it.

--Eric

George Abraham

unread,
Oct 9, 2013, 3:48:19 PM10/9/13
to Eric Tschetter, druid-de...@googlegroups.com
Hi Eric,

FYI:

After reading your post in the thread referenced below:


I set a default value for empty/null strings (e.g. "NONE"). 




Thanks,
/george

Eric Tschetter

unread,
Oct 9, 2013, 4:21:42 PM10/9/13
to George Abraham, druid-de...@googlegroups.com
Yeah, that's the simplest fix for everything while we still get nulls/empty_string sorted out.

--Eric
Reply all
Reply to author
Forward
0 new messages