count distinct

2 views
Skip to first unread message

Hui Peng Hu

unread,
Nov 25, 2011, 11:05:27 PM11/25/11
to programthecloud
Hi,

Suppose that I have a collection with schema [:col1, :col2, :col3].
What's the best way to do the equivalent of
SELECT COUNT(DISTINCT(col2)) FROM table GROUP BY col1;

Thanks.

Hui Peng Hu

unread,
Nov 25, 2011, 11:22:21 PM11/25/11
to programthecloud

Peter Alvaro

unread,
Nov 26, 2011, 11:59:40 AM11/26/11
to program...@googlegroups.com
what is it that you're trying to do?  you want to group by a column that's not included in the output?  won't the output be hard to interpret (a row for each distinct col1 value, but unlabeled)? 

Yoriyasu Yano

unread,
Nov 26, 2011, 5:30:25 PM11/26/11
to program...@googlegroups.com
We want something like this, except want it to only count the distinct :d

class B
  include Bud

  state do
    scratch :t, [:a, :d, :c]
    scratch :b, [:a, :dp]
  end

  bloom :test do
    b <= t.group([:a], count(:d))
  end
end

b = B.new
b.t <+ [[1, "a", "a"], [1, "a", "c"], [1, "b", "d"]]
b.tick
puts b.b.inspected

This outputs:
b: [
  (1, 3)]
but we want:
b: [
  (1, 2)]
since theres only 2 unique values for :d, "a" and "b"
--
Yoriyasu Yano
University of California, Berkeley
BA Cognitive Science and Computer Science

Peter Alvaro

unread,
Nov 26, 2011, 6:53:19 PM11/26/11
to program...@googlegroups.com
oh -- it was the "group by col1" bit of the SQL query that confused me.  you are just asking how to do count distinct.

one workaround would be to use a scratch to project away the irrelevant columns, and use set semantics to de-dup the column before counting over it.

scratch :c, [:a, :d]

c <= t {|tee| [tee.a, tee.d]}
b <= c.group([:a], count(:d))


long term, bud should probably have a countdistinct aggregate.

Joseph Hellerstein

unread,
Nov 27, 2011, 2:11:54 AM11/27/11
to program...@googlegroups.com
When you can't find a built-in agg you like, you can always use Ruby's reduce.

Something like this should work for count distinct (though I agree a built-in would be nice):

scratch :b_vals, [:grp, :vals]

# collect the distinct vals in a hash for each group
b_vals <= t.reduce({}) { |memo,o| memo[o.a] ||= {}; memo[o.a][o.d]=true; memo }
# count the distinct vals
b <= b_vals {|t| [t.grp, t.vals.length]}

Reply all
Reply to author
Forward
0 new messages