H Eric,
Also I've few dimensions which are having values in comma separated string form so to query such dimension , is using regex appropriate approach ??
That depends on if you consider the comma-delimited list as a single value or if you consider each item in the list as an individual value. If it is the former, then keeping it as a comma-delimited list and using regex or search style filters works. A search style filter would look something like {"type": "search", "dimension":"cat", "query": {"type": "insensitive_contains", "value", "iab25"}}.
If you want to treat each of the entities between the commas as a separate token, you can do that as well. If you are using the JSON input, then instead of a comma-delimited String, if you format it as a JSON array of individual values, then it will treat them each individually and you can filter the same was as if it wasn't multi-valued.
>> I tried both ways
1) creating json array of dim values as string
2) creating and passing json array [net.sf.json.JSONArray] object of dim values
[Also which approch is right stringified json array or JSONArray object]
but in both cases when I try running query like;
{
"queryType": "groupBy",
"dataSource": "analytics",
"granularity": "day",
"dimensions": ["cat"],
"filter": {
"type" : "and",
"fields" : [
{
"type": "regex",
"dimension": "cat",
"pattern": ".*\"iab4\".*" // or "pattern": ".*iab4.*"
}
]
},
"aggregations":[
{"type" : "longSum", "name" : "serv_count", "fieldName":"serv_sum"} ,
{"type" : "longSum", "name" : "nserv_count", "fieldName":"nserv_sum"}
],
"intervals":["2013-04-29T00:00/2013-05-16T23:00"]
}
And cat data is like:
{
"version" : "v1",
"timestamp" : "2013-05-16T00:00:00.000Z",
"event" : {
"cat" : "[\"iab2\",\"iab3\",\"iab4\",\"iab5\"]",
"serv_count" : 1,
"nserv_count" : 0
}The above result I got by doing
groupBy on dimension
cat w/o using any filters.
But when I do regex based query like above I get bunch of exception and attached those in a separate file.
>> Also when I do number based regex query I get correct result. Query is below :
{
"queryType": "groupBy",
"dataSource": "analytics",
"granularity": "day",
"dimensions": ["bid"],
"filter": {
"type" : "and",
"fields" : [
{
"type": "regex",
"dimension": "bid",
"pattern": "^[0-1].[0|5|7]$"
}
]
},
"aggregations":[
{"type" : "longSum", "name" : "serv_count", "fieldName":"serv_sum"} ,
{"type" : "longSum", "name" : "nserv_count", "fieldName":"nserv_sum"}
],
"intervals":["2013-04-29T00:00/2013-05-16T23:00"]
}
Empty string and null are considered the same thing by Druid. The thinking behind this is that I cannot think of a meaningful semantic difference between the empty string and null for a dimension value, I generally feel that building in a semantic difference and overloading "" to mean something should actually be done by filling out the string with a meaningful identifier. This also makes our handling of nulls consistent between numbers and Strings, numbers do not have a way of representing an "empty" number so it doesn't necessarily make sense for Strings to.
I know that there might be disagreement here, and I'm open to hearing practical use cases for the other direction (which could change my opinion), but the above is my religion ;).
At the moment I will let you keep your religion ;) but will definitely touch it if something arguable is there on this issue :)
Thanks and Regards,
-- NiTiN