Dimension and values

963 views
Skip to first unread message

Nitin Mane

unread,
May 15, 2013, 10:49:17 AM5/15/13
to druid-de...@googlegroups.com
Hi,

    I have 20 dimensions and some of those won't have any value or null (java's null) value, so my questions are
     1)   What is the impact of having empty as well as null dimension values??
     2)   How are the null value fields treated ??

   Also I've few dimensions which are having values in comma separated string form so to query such dimension , is using regex appropriate approach ??

Thanks and Regards,
 -- N iTi N

Eric Tschetter

unread,
May 15, 2013, 12:42:13 PM5/15/13
to druid-de...@googlegroups.com
Just to be clear, null handling in Druid is a little inconsistent right now, but we are working on making it completely consistent.  The answers below are what the end result of our endeavors will be, but might not reflect what actually happens right now.  In general, if you care about null as a meaningful value, we recommend working around the current consistency issues by assigning it a real value of some sort.  If null is equivalent to "it doesn't exist" then I believe the current handling is generally acceptable.
 
    I have 20 dimensions and some of those won't have any value or null (java's null) value, so my questions are 
     1)   What is the impact of having empty as well as null dimension values??

Empty string and null are considered the same thing by Druid.  The thinking behind this is that I cannot think of a meaningful semantic difference between the empty string and null for a dimension value, I generally feel that building in a semantic difference and overloading "" to mean something should actually be done by filling out the string with a meaningful identifier.  This also makes our handling of nulls consistent between numbers and Strings, numbers do not have a way of representing an "empty" number so it doesn't necessarily make sense for Strings to.  

I know that there might be disagreement here, and I'm open to hearing practical use cases for the other direction (which could change my opinion), but the above is my religion ;).

 
     2)   How are the null value fields treated ??

null fields are treated as just another "token" and you should be able to filter by using '{ "type": "selector", "dimension": "dim", "value": null}' in a filter.  Note that this filter sometimes works and sometimes doesn't right now (there is a rule to when it does and doesn't work, but it's so complex and tied into internal implementation details that it is best to consider it magic and random ;) ).

 
   Also I've few dimensions which are having values in comma separated string form so to query such dimension , is using regex appropriate approach ??

That depends on if you consider the comma-delimited list as a single value or if you consider each item in the list as an individual value.  If it is the former, then keeping it as a comma-delimited list and using regex or search style filters works.  A search style filter would look something like  {"type": "search", "dimension":"cat", "query": {"type": "insensitive_contains", "value", "iab25"}}.

If you want to treat each of the entities between the commas as a separate token, you can do that as well.  If you are using the JSON input, then instead of a comma-delimited String, if you format it as a JSON array of individual values, then it will treat them each individually and you can filter the same was as if it wasn't multi-valued.

--Eric
 

Thanks and Regards,
 -- N iTi N

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/9da79152-fc6f-4627-b625-9c5f73b26f16%40googlegroups.com?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Nitin Mane

unread,
May 16, 2013, 11:18:08 AM5/16/13
to druid-de...@googlegroups.com
H Eric,


   Also I've few dimensions which are having values in comma separated string form so to query such dimension , is using regex appropriate approach ??

That depends on if you consider the comma-delimited list as a single value or if you consider each item in the list as an individual value.  If it is the former, then keeping it as a comma-delimited list and using regex or search style filters works.  A search style filter would look something like  {"type": "search", "dimension":"cat", "query": {"type": "insensitive_contains", "value", "iab25"}}.

If you want to treat each of the entities between the commas as a separate token, you can do that as well.  If you are using the JSON input, then instead of a comma-delimited String, if you format it as a JSON array of individual values, then it will treat them each individually and you can filter the same was as if it wasn't multi-valued.


>> I tried both ways 
1) creating json array of dim values as string
2) creating and passing json array [net.sf.json.JSONArray] object of dim values
[Also which approch is right stringified json array or JSONArray object]

but in both cases  when I try running query like;
{
    "queryType": "groupBy",
    "dataSource": "analytics",
    "granularity": "day",
    "dimensions": ["cat"], 
"filter": {
"type" : "and",
"fields" : [
{
"type": "regex",
"dimension": "cat",
"pattern": ".*\"iab4\".*" // or "pattern": ".*iab4.*"
}
]
},
    "aggregations":[
{"type" : "longSum", "name" : "serv_count", "fieldName":"serv_sum"} ,
{"type" : "longSum", "name" : "nserv_count", "fieldName":"nserv_sum"} 
    ],

    "intervals":["2013-04-29T00:00/2013-05-16T23:00"]
}
And cat data is like:
{
  "version" : "v1",
  "timestamp" : "2013-05-16T00:00:00.000Z",
  "event" : {
    "cat" : "[\"iab2\",\"iab3\",\"iab4\",\"iab5\"]",
    "serv_count" : 1,
    "nserv_count" : 0
  }
The above result I got by doing groupBy on dimension cat w/o using any filters.
But when I do regex based query like above I get bunch of exception and attached those in a separate file.

>> Also when I do number based regex query I get correct result. Query is below :
{
    "queryType": "groupBy",
    "dataSource": "analytics",
    "granularity": "day",
    "dimensions": ["bid"], 
"filter": {
"type" : "and",
"fields" : [
{
"type": "regex",
"dimension": "bid",
"pattern": "^[0-1].[0|5|7]$"
}
]
},
    "aggregations":[
{"type" : "longSum", "name" : "serv_count", "fieldName":"serv_sum"} ,
{"type" : "longSum", "name" : "nserv_count", "fieldName":"nserv_sum"} 
    ],

    "intervals":["2013-04-29T00:00/2013-05-16T23:00"]
}

Empty string and null are considered the same thing by Druid.  The thinking behind this is that I cannot think of a meaningful semantic difference between the empty string and null for a dimension value, I generally feel that building in a semantic difference and overloading "" to mean something should actually be done by filling out the string with a meaningful identifier.  This also makes our handling of nulls consistent between numbers and Strings, numbers do not have a way of representing an "empty" number so it doesn't necessarily make sense for Strings to.  

I know that there might be disagreement here, and I'm open to hearing practical use cases for the other direction (which could change my opinion), but the above is my religion ;).

  At the moment I will let you keep your religion ;) but will definitely touch it if something arguable is there on this issue :)

Thanks and Regards,
-- NiTiN
regex-exception-1.txt
regex-exception-2.txt

Fangjin Yang

unread,
May 16, 2013, 2:26:00 PM5/16/13
to druid-de...@googlegroups.com
Hi Nitin, I just responded to your other email without realizing you answered my questions already in this one. Let me look through the logs and see if it is possible to recreate these problems and I will get back to you. 

Thanks!
FJ


--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.

Fangjin Yang

unread,
May 16, 2013, 4:57:15 PM5/16/13
to druid-de...@googlegroups.com
Hi Nitin,

Based on the structure of the result of your cat data, it would appear dimension values for the cat dimension are single value strings of the form:
"[\"iab2\",\"iab3\",\"iab4\",\"iab5\"]"

I believe this means that the regex you may want is actually .*\\\"iab4\\\".*

Let me know if this works.

Thanks!
FJ

Nitin Mane

unread,
May 17, 2013, 2:54:44 AM5/17/13
to druid-de...@googlegroups.com
FJ,

Based on the structure of the result of your cat data, it would appear dimension values for the cat dimension are single value strings of the form:
"[\"iab2\",\"iab3\",\"iab4\",\"iab5\"]"
I believe this means that the regex you may want is actually .*\\\"iab4\\\".*
Let me know if this works.

This  .*\\\"iab4\\\".* regex does not work . It is quite interesting to see that how druid-regex feature treats numbers only and strings+numbers separately. Because I have a dimension which has values of type at= 6,10 or at=8,9,11 [strings] and for these regex works fine
Also for price fields it works fine. Even if we do regex for number only within above single value strings [eg. in iab4 do a regex for 4 only] it does not work.

Thanks and Regards,
-- N iTi N

Nitin Mane

unread,
May 17, 2013, 4:09:40 AM5/17/13
to druid-de...@googlegroups.com
Also which approach is correct for building json array for multi-values dimension
 1) stringified json array (like gson.toJson(String[]))
 2) JSONArray object from net.sf or gson libs (JSONArray.fromObject(String[]))

Thanks and Regards,
--- N iTi N


On Friday, 17 May 2013 02:27:15 UTC+5:30, Fangjin Yang wrote:

Fangjin Yang

unread,
May 17, 2013, 1:14:19 PM5/17/13
to druid-de...@googlegroups.com
Hi Nitin,

If the format of your data is in JSON, you should be able to specify a multi-dim value as so:
{"ts":"2012-01-01", "dim" : ["dimVal1", "dimVal2"]}

For your cat dimension, are iab1 to iabn all individual dimension values?
Btw, can you just do a regex for .*iab4.*?

Let me know.

Thanks!
FJ


Reply all
Reply to author
Forward
0 new messages