Druid - Using Timestamps in Aggregations

1,023 views
Skip to first unread message

Bassel Dagher

unread,
Jan 13, 2015, 4:34:33 PM1/13/15
to druid-de...@googlegroups.com
Hello everyone.

I am a rookie using Druid and before start I would to thank in advance any one that can help me with this doubts, I would appreciate any kind of help.


### First Problem

I am trying to use the MIN aggregator
 
 { "type" : "min", "name" : <output_name>, "fieldName" : <metric_name>

with timestamps, like for example

 { "type" : "min", "name" : "timestamp", "fieldName" : "timestamp" 

However, it is always returning zero, so I would like to know if there is another way to get the min or max of a timestamp? 


### Second Problem:

I am trying to compare timestamps with some dynamic parameters inside a javascript aggregator function, like for example

"aggregations": [
                {
                    "type": "javascript",
                    "name": "sessionsInsideRange",
                    "fieldNames": [
                        "timestamp"
                    ],
                    "fnAggregate": "function newVisitor(current, timestamp) {
                        if (timestamp >= param_start_range && timestamp <= param_end_range) {
                            return current + 1;
                        } else {
                            return current + 0;
                        }
                    }",
                    "fnCombine": "function combinePartialResults(partialA, partialB) {
                        return partialA + partialB;
                    }",
                    "fnReset": "function reset() {
                        return 0;
                    }"
                }
            ]

However the result of sessionsInsideRange is always returning zero (I already made sure that the intervals for this query are making sense). Therefore, it is something that I am doing wrong or missing here or perhaps is there any workaround. Moreover, what would be the format of param_start_range and param_end_range it would be this correct YYYY-MM-ddTHH:mm:ss

Thanks again for all your help, support and time.

ste...@activitystream.com

unread,
Jan 13, 2015, 4:53:51 PM1/13/15
to druid-de...@googlegroups.com
Hi,

This is from an earlier discussion on this topic:

The dimension extraction functions should apply just fine to the "__time" dimension in 0.6 on all but "realtime" segments.  The realtime indexing code doesn't know about the "__time" dimension in 0.6 (that's added in 0.7) so it won't actually return that column.  This will mean that on 0.6 you will get results for everything that has been persisted (note, this doesn't mean hand off, it just mean that it has been persisted to local disk at least).

Regards,
 -Stefan

basdag

unread,
Jan 13, 2015, 5:56:22 PM1/13/15
to druid-de...@googlegroups.com
Hi Stefan.

First, thanks for taking the time and send me this information I appreciated! Moreover, I tried the use of __time on the dimension extraction functions but I did not got the expected result just zeros. It would be possible to see a really simple example of what you had in mind. 

Thanks again for everything!

------

Fangjin Yang

unread,
Jan 13, 2015, 7:50:23 PM1/13/15
to druid-de...@googlegroups.com
Hi Basdag,

If you try:

"aggregations": [
                {
                    "type": "javascript",
                    "name": "sessionsInsideRange",
                    "fieldNames": [
                        "__time"
                    ],
                    "fnAggregate": "function newVisitor(current, timestamp) {
                        if (timestamp >= param_start_range && timestamp <= param_end_range) {
                            return current + 1;
                        } else {
                            return current + 0;
                        }
                    }",
                    "fnCombine": "function combinePartialResults(partialA, partialB) {
                        return partialA + partialB;
                    }",
                    "fnReset": "function reset() {
                        return 0;
                    }"
                }
            ]

does that make a difference?

basdag

unread,
Jan 14, 2015, 5:32:01 PM1/14/15
to druid-de...@googlegroups.com
Hi Fangjin,

Thanks for all your help, it did work the example provided. Moreover, I would like to highlight that __time is coming in the form of UNIX TimeStamp.

Thanks again everyone!

-----------------------------------------------

Charles Allen

unread,
Jan 14, 2015, 8:45:25 PM1/14/15
to druid-de...@googlegroups.com
I don't think the original time format is saved once it has been ingested. All values return as milliseconds since epoch iirc (as opposed to seconds as per unix timestamp)

basdag

unread,
Jan 15, 2015, 1:43:18 PM1/15/15
to druid-de...@googlegroups.com
Hi Charles,

Yes you are right, times returned by Druid are in milliseconds since epoch time instead of seconds.

----------

Agustin Schapira

unread,
Jan 30, 2015, 1:06:35 PM1/30/15
to druid-de...@googlegroups.com
Hi Fangjin, Charles, Basdag,

I have tried running pretty much the exact same query as above, but __time doesn't seem to be set at all. The query is:

{
"dataSource": "operational_events",
"queryType": "timeseries",
"intervals": [
"2015-01-01T00:00:00/2015-01-01T01:00:00"
],
"granularity": {
"type": "period",
"period": "PT1M",
"origin": "2015-01-01T00:00:00"
},
"aggregations": [
   {
       "type": "javascript",
       "name": "sessionsInsideRange",
       "fieldNames": [
           "__time"
       ],
       "fnAggregate": "function X(current, timestamp) { if (timestamp > 1) { return current + 1; } else { return current + 0;} }",
       "fnCombine": "function combinePartialResults(partialA, partialB) { return partialA + partialB; }",
       "fnReset": "function reset() { return 0;}"
   }
]
}

The result, for every minute, is 0. Notice that I am comparing timestamp with 1 --I get the same result when the check is timestamp !== null ... and I do get a valid count when the check is for timestamp === null, which shows that the Javascript aggregator itself is working. I also add a rowCount aggregator and its results make sense. 

It's as if the __time column didn't exist (I get the same result if I use "foobar" or other non-existent dimension). However, the segment query does report the existence of the __time column:

{
"queryType":"segmentMetadata",
"dataSource":"operational_events",
"intervals":["2015-01-01T00/2015-01-01T01"],
"toInclude":{ "type": "list", "columns":["__time"]}
}

===> 

[ {
  "id" : "operational_events_2015-01-01T00:00:00.000Z_2015-01-01T01:00:00.000Z_2015-01-01T00:01:27.514Z",
  "intervals" : [ "2015-01-01T00:00:00.000Z/2015-01-01T01:00:00.000Z" ],
  "columns" : {
    "__time" : {
      "type" : "LONG",
      "size" : 480,
      "cardinality" : null,
      "errorMessage" : null
    }
  },
  "size" : 5238
} ]


There are no errors in the logs on the historical nodes.  This is using v. 0.6.152.

Any thoughts? Does anything stand out to you? 

Thanks much,

+ Agustin


Fangjin Yang

unread,
Feb 3, 2015, 1:25:03 AM2/3/15
to druid-de...@googlegroups.com
Hi Agustin, I think there were some bugs with this that we fixed in later versions. Can you try to reproduce the problem using 0.6.172?

Xavier Léauté

unread,
Feb 3, 2015, 2:13:20 AM2/3/15
to druid-de...@googlegroups.com
Hi Augustin, the following query works with recent versions of Druid.
I don't see a reason why yours wouldn't, but see if this one works for you.

{
    "intervals": [
        "2015-01-01T00:00:00.000+00:00/2015-01-05T00:00:00.000+00:00"
    ],
    "aggregations": [
        {
            "type": "count",
            "name": "cnt"
        },
        {
            "type": "javascript",
            "fieldNames": [
                "__time"
            ],
            "name": "aggtime",
            "fnReset": "\nfunction reset() { return 0 }\n",
            "fnCombine": "\nfunction combine(a,b) { return a + b }\n",
            "fnAggregate": "function aggregate(current, time) {\nif(time > 1420243200000) {\nreturn(current + 1);\n} else {\nreturn(current);\n}\n}"
        }
    ],
    "dataSource": "test",
    "granularity": "all",
    "queryType": "timeseries"
}


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/6466f9b9-6dc9-4585-bb43-13d50add225c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Agustin Schapira

unread,
Feb 3, 2015, 3:17:15 PM2/3/15
to druid-de...@googlegroups.com
Thanks, Fangjin, Xavier

I will try 0.6.172.    On 0.6.171 it definitely doesn't work. I tried a slightly modified version of Xavier's query, and these are the results

[ {
  "timestamp" : "2015-02-03T18:22:00.000Z",
  "result" : {
    "aggtime" : 0.0,
    "cnt" : 101
  }
} ]

This the exact query that I am running:

{
    "intervals": [
        "2015-02-01T00:00:00.000+00:00/2015-02-05T00:00:00.000+00:00"
    ],
    "aggregations": [
        {
            "type": "count",
            "name": "cnt"
        },
        {
            "type": "javascript",
            "fieldNames": [
                "__time"
            ],
            "name": "aggtime",
            "fnReset": "function reset() { return 0 }",
            "fnCombine": "function combine(a,b) { return a + b }",
            "fnAggregate": "function aggregate(current, time) {if(time > 1) {return(current + 1);} else {return(current);}}"
        }
    ],
    "dataSource": "operational_events_temp",

Xavier Léauté

unread,
Feb 3, 2015, 7:46:16 PM2/3/15
to druid-de...@googlegroups.com
Hi Austin, are you querying a historical segment or a realtime segment?
Real-time segments currently don't support querying the time column.
Do you mind filing a github issue if that is the case?

Thanks,
Xavier

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

Agustin Schapira

unread,
Feb 4, 2015, 4:20:26 AM2/4/15
to druid-de...@googlegroups.com
Thanks, Xavier. I am pretty sure I am querying a historical segment. For completeness sake, 

- here's the ingest spec (created for this test)

  {
    "type" : "index_realtime",
    "schema": {
      "dataSource": "operational_events_temp",
      "aggregators": [
        {
          "type": "count",
          "name": "rowCount"
        }
      ],
      "indexGranularity": "minute",
      "shardSpec": {
        "type": "none"
      }
    },
    "config": {
      "maxRowsInMemory": 500000,
      "intermediatePersistPeriod": "PT10m"
    },
    "firehose": {
      "type": "kafka-0.8",
      "consumerProps": {
        "group.id": "xxx",
        "zookeeper.connect": "xxxx",
        "zookeeper.connection.timeout.ms": "15000",
        "zookeeper.session.timeout.ms": "15000",
        "zookeeper.sync.time.ms": "5000",
        "fetch.message.max.bytes": "1048586",
        "auto.offset.reset": "largest",
        "auto.commit.enable": "true"
      },
      "feed": "operational_events",
      "parser": {
        "timestampSpec": {
          "column": "timestamp",
          "format": "millis"
        },
        "data": {
          "format": "json"
        },
        "dimensionExclusions": [
          "timestamp"
        ]
      }
    },
    "plumber": {
      "type": "realtime",
      "windowPeriod": "PT10m",
      "segmentGranularity": "hour"
    }
  }

-- here's a segment query for the segment I am trying to query

{
"queryType":"segmentMetadata",
"dataSource":"operational_events_temp",
"intervals":["2015-02-03T18:00:00.000+00:00/2015-02-03T19:00:00.000+00:00"],
"toInclude":{ "type": "list", "columns":["__time"]}
}

==> 
[ {
  "id" : "operational_events_temp_2015-02-03T00:00:00.000Z_2015-02-04T00:00:00.000Z_2015-02-03T18:22:46.586Z",
  "intervals" : [ "2015-02-03T00:00:00.000Z/2015-02-03T23:59:00.000Z" ],
  "columns" : {
    "__time" : {
      "type" : "LONG",
      "size" : 2880,
      "cardinality" : null,
      "errorMessage" : null
    }
  },
  "size" : 36971
} ]


and the query

{
    "intervals": [
        "2015-02-03T18:00:00.000+00:00/2015-02-03T19:00:00.000+00:00"
    ],
    "aggregations": [
        {
            "type": "count",
            "name": "cnt"
        },
        {
            "type": "javascript",
            "fieldNames": [
                "__time"
            ],
            "name": "aggtime",
            "fnReset": "function reset() { return 0 }",
            "fnCombine": "function combine(a,b) { return a + b }",
            "fnAggregate": "function aggregate(current, time) {if(time > 1) {return(current + 1);} else {return(current);}}"
        }
    ],
    "dataSource": "operational_events",
    "granularity": "all",
    "queryType": "timeseries"
}

==> 

[ {
  "timestamp" : "2015-02-03T18:22:00.000Z",
  "result" : {
    "aggtime" : 0.0,
    "cnt" : 22
  }
} ]


This is against 0.6.171 --haven't tried 0.6.172 yet.

I'll file a github issue. Thanks!

+ A

Agustin Schapira

unread,
Feb 4, 2015, 4:27:37 AM2/4/15
to druid-de...@googlegroups.com
Filed github issue: https://github.com/druid-io/druid/issues/1086

Thanks,

+ A

Xavier Léauté

unread,
Feb 4, 2015, 3:25:46 PM2/4/15
to druid-de...@googlegroups.com
This has been resolved, please see GitHub issue https://github.com/druid-io/druid/issues/1086 for more information.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages