Histogram()

39 views
Skip to first unread message

Ben Johnson

unread,
Apr 7, 2013, 12:49:13 AM4/7/13
to sk...@googlegroups.com
I pushed support for histogram generation to the unstable branch. It's a pretty sweet aggregate function that can be used on Integer and Float types.

I wanted the histogram() function to be fast so it does a nice little double pass trick. When a query with a histogram is executed, a sample aggregation is performed over the first 1000 objects to select some values and generate the binning structure. The data is then cleared and the query is rerun across all the data and can be merged together at the end (since all the threads share the same binning structure).

This setup has the downside that generating a histogram over an obscure event will not return a useful histogram since very few values will be available in the sample to create bins. I feel like it's a good first cut though.

Here's the usage in JSON query form (with accompanying results):

$ curl -X POST http://localhost:8585/tables/users/query -d '{
  "steps": [
    {"type":"selection","fields":[{"name":"myHistogram","expression":"histogram(age)"}]}
  ]
}'
# => {"myHistogram":{"count":4,"min":20,"max":80,"width":15,"bins":{"0":32,"1":40,"2":14,"3":3}}}
So this query would run a histogram over the age property and based on the sampling would return a histogram with 4 bins that represent values between 20 years old to 80 years old.  Each bin represents 15 years so returned data shows that there are thirty-two 20 - 34 year olds, forty 35 - 49 year olds, fourteen 50 - 64 year olds, & three 65 - 80 year olds. Out of range data gets placed into the first and last bins although I may separate out those values.

You can use the histogram like any other selection so you can nest it inside a funnel analysis or select other fields along side it.

Full details:



Ben
Reply all
Reply to author
Forward
0 new messages