count distinct

Hui Peng Hu

unread,

Nov 25, 2011, 11:05:27 PM11/25/11

to programthecloud

Hi,

Suppose that I have a collection with schema [:col1, :col2, :col3].
What's the best way to do the equivalent of
SELECT COUNT(DISTINCT(col2)) FROM table GROUP BY col1;

Thanks.

Hui Peng Hu

unread,

Nov 25, 2011, 11:22:21 PM11/25/11

to programthecloud

Peter Alvaro

unread,

Nov 26, 2011, 11:59:40 AM11/26/11

to program...@googlegroups.com

what is it that you're trying to do? you want to group by a column that's not included in the output? won't the output be hard to interpret (a row for each distinct col1 value, but unlabeled)?

Yoriyasu Yano

unread,

Nov 26, 2011, 5:30:25 PM11/26/11

to program...@googlegroups.com

We want something like this, except want it to only count the distinct :d

class B

include Bud

state do

scratch :t, [:a, :d, :c]

scratch :b, [:a, :dp]

end

bloom :test do

b <= t.group([:a], count(:d))

end

b = B.new

b.t <+ [[1, "a", "a"], [1, "a", "c"], [1, "b", "d"]]

b.tick

puts b.b.inspected

This outputs:

b: [

(1, 3)]

but we want:

b: [

(1, 2)]

since theres only 2 unique values for :d, "a" and "b"

--

Yoriyasu Yano

University of California, Berkeley

BA Cognitive Science and Computer Science

Peter Alvaro

unread,

Nov 26, 2011, 6:53:19 PM11/26/11

to program...@googlegroups.com

oh -- it was the "group by col1" bit of the SQL query that confused me. you are just asking how to do count distinct.

one workaround would be to use a scratch to project away the irrelevant columns, and use set semantics to de-dup the column before counting over it.

scratch :c, [:a, :d]

c <= t {|tee| [tee.a, tee.d]}

b <= c.group([:a], count(:d))

long term, bud should probably have a countdistinct aggregate.

Joseph Hellerstein

unread,

Nov 27, 2011, 2:11:54 AM11/27/11

to program...@googlegroups.com

When you can't find a built-in agg you like, you can always use Ruby's reduce.

Something like this should work for count distinct (though I agree a built-in would be nice):

scratch :b_vals, [:grp, :vals]

# collect the distinct vals in a hash for each group
b_vals <= t.reduce({}) { |memo,o| memo[o.a] ||= {}; memo[o.a][o.d]=true; memo }
# count the distinct vals
b <= b_vals {|t| [t.grp, t.vals.length]}

Reply all

Reply to author

Forward