generalize distinct

Eugen Dueck

unread,

Feb 21, 2010, 7:25:01 AM2/21/10

to Clojure

Hi,

Clojure is great! The gain in productivity from more low level
languages like Java, but also more functional languages like Ruby and
Common LISP etc. amazes me every day. Like how adding a simple "map"
in front of the count here:
(count colls)
changes the code from counting the number of collections to
enumerating the number of items in those collections. It's all these
little things that take one minute or 5 in Java, but only 5 seconds in
Clojure (in case you are a slow typer) that make you so much more
efficient. With clojure, I spend most of my time thinking about
problems at a conceptual level, rather than at the implementation
level (Java: int[] vs. Iterable<Integer> vs. List<Integer> vs.
Integer[] etc. - arrg!). Thanks Rich!

That said, every now and then I come across a function in clojure.core
that could be vastly improved in its range of applicability by just
adding one or more optional parameters, defaulting to the currently
hard-coded values (like the often implicit 'identity' function). Take
'distinct'. I'd like to be able to specify the keyfn, that's an
important one for me, and while we're at it, I'd like to pass in the
set that that function builds up incrementally, and normally starts
with an empty set, so that I can pre-initialize it, say with #{ nil },
so that it also filters out nils.

This 2nd point is not that important, and I'm not sure if it is that
great an idea in terms of "orthogonality" of the functions, as we have
filter. But you'd get rid off one level of indirection. distinct
basically gives us a filter for free.

But the 1st point I think certainly makes sense, and we have a couple
of other fns in clojure that have a variant with a keyfn parameter,
like sort-by etc. I guess they normally get a different name.

I'm not so sure about what the best order of parameters is in clojure,
and named parameters would make this a no brainer, but this is what I
currently use, the first parameter being coll, at least in the variant
with only one parameter that makes it a drop-in replacement candidate
for distinct:

(defn distinkt
"Returns a lazy sequence of the elements of coll with duplicates
removed"
([coll] (distinkt coll identity))
([coll keyfn] (distinkt coll keyfn #{}))
([coll keyfn seen-items]
(let [step (fn step [xs seen]
(lazy-seq
((fn [[f :as xs] seen]
(when-let [s (seq xs)]
(let [key (keyfn f)]
(if (contains? seen key)
(recur (rest s) seen)
(cons f (step (rest s) (conj seen key)))))))
xs seen)))]
(step coll seen-items))))

I don't mind writing - or better - copy-and-pasting this code and
keeping it in my project, but I think it could be useful for others,
so I wouldn't mind at all if this makes it into clojure.core... :)

Or is the reason for hard coding an (implicit) 'identity' performance?
Or did I miss some other way to achieve the same goal?

Eugen

Wilson MacGyver

unread,

Feb 21, 2010, 9:53:36 PM2/21/10

to clo...@googlegroups.com

Any reason why you can't use
distinct?

http://richhickey.github.com/clojure/clojure.core-api.html#clojure.core/distinct

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Wilson MacGyver

unread,

Feb 21, 2010, 10:08:52 PM2/21/10

to clo...@googlegroups.com

let me expand on my comment. What I meant is, the type of functionality you
are adding into distinct, makes it feel less "distinct". to me. If I
read what you are saying
correctly, you want to be also be able to pass a set, so in
addition to filter duplicate elements, it also will filter out
anything in the set.

to me that doesn't feel clojureish. I've always felt that clojure code is about
a lot of small functions that does one thing, but does the one thing really
well, so you can compose the functions together to do what you want.

adding what "feels" like an unrelated capability to a core function seems
"odd" to me. Just my 2 cents.

> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

--
Omnem crede diem tibi diluxisse supremum.

Eugen Dueck

unread,

Feb 22, 2010, 7:22:30 AM2/22/10

to Clojure

Yes, for the reasons stated.

First, and most important, clojure.core/distinct does not let me pass
in a keyfn, it's hard-coded.

Second, and as mentioned that's debatable, distinct could make more of
the filtering functionality it already has available to the caller,
for free.

On Feb 22, 11:53 am, Wilson MacGyver <wmacgy...@gmail.com> wrote:
> Any reason why you can't use
> distinct?
>

> http://richhickey.github.com/clojure/clojure.core-api.html#clojure.co...

> clojure+u...@googlegroups.com<clojure%2Bunsu...@googlegroups.com>

Sean Devlin

unread,

Feb 22, 2010, 9:13:30 AM2/22/10

to Clojure

It sounds to me that you want to use c.c.seq-utils/group-by, not
distinct.

Eugen Dück

unread,

Feb 22, 2010, 9:55:29 AM2/22/10

to Clojure

group-by does not filter, in contrast to distinct.

And I'm really only interested in one of the values considered equal
according to my keyfn. So if I sed group-by, I'd have to wrap it with
a call to map or similar. This is certainly an option, but I'd rather
not create all the intermediate vectors if I don't need to.

Michał Marczyk

unread,

Feb 22, 2010, 1:15:22 PM2/22/10

to clo...@googlegroups.com

The Haskell approach is to have two functions, nub and nubBy, which
perform the task of distinct and the task of distinct with a
user-provided keyfn, respectively. I'd love to see a distinct-by in
core or contrib. (Mostly for completeness, it's easy enough to
write...)

As for the extra set argument, I'd rather leave it out. Use a
composition of distinct-by and filter.

Sincerely,
Michał

Sean Devlin

unread,

Feb 22, 2010, 1:25:13 PM2/22/10

to Clojure

What does distinct-by return in case of a collision?

(keys (group-by...))

(map first (vals (group-by ...)))

(map other-fn (vals (group-by ...)))

You're still better off w/ group-by, and manipulating the resulting
map appropriately.

Sean

Michał Marczyk

unread,

Feb 22, 2010, 1:39:23 PM2/22/10

to clo...@googlegroups.com

On 22 February 2010 19:25, Sean Devlin <francoi...@gmail.com> wrote:
> What does distinct-by return in case of a collision?

I'm not sure what you mean by that. distinct-by would do precisely
what distinct does, only instead of comparing the items of the
argument collection themselves, it would compare values of (keyfn
item). The return value ought to be a lazy-seq, in keeping with that
of distinct.

Sincerely,
Michał

Michał Marczyk

unread,

Feb 22, 2010, 1:52:33 PM2/22/10

to clo...@googlegroups.com

My apologies for misrepresenting Haskell's nubBy:

On 22 February 2010 19:15, Michał Marczyk <michal....@gmail.com> wrote:
> The Haskell approach is to have two functions, nub and nubBy, which
> perform the task of distinct and the task of distinct with a
> user-provided keyfn, respectively.

Actually nubBy takes a binary function as its first argument and
performs equality comparisons of each successive element to each of
the elements collected into the result list using that function. A
corresponding Clojure function:

(defn distinct-by
"A version following Haskell's nubBy."
[f coll]

(let [step (fn step [xs seen]
(lazy-seq

((fn [[x :as xs] seen]
(when-let [s (seq xs)]
(if (some #(f x %) seen)
(recur (rest s) seen)
(cons x (step (rest s) (conj seen x))))))
xs seen)))]
(step coll #{})))

Here's one example of use:

user> (distinct-by #(= (class %1) (class %2)) [1 2 3 :a :b :c 'a 'b 'c])
(1 :a a)

And here's a Clojure version of the coolest Haskell one-liner I've ever seen:

user> (take 10 (distinct-by #(< 1 (gcd %1 %2)) (iterate inc 2)))
(2 3 5 7 11 13 17 19 23 29)

The alternative is to use a unary keyfn-based approach as I originally
suggested. This would work like so:

(defn distinct-by
"NB. this isn't like Haskell's nubBy. f is supposed to be unary."
[f coll]

(let [step (fn step [xs seen]
(lazy-seq

((fn [[x :as xs] seen]
(when-let [s (seq xs)]
(if (some #(f x %) seen)
(recur (rest s) seen)
(cons x (step (rest s) (conj seen (f x)))))))
xs seen)))]
(step coll #{})))

user> (distinct-by class [1 2 3 :a :b :c 'a 'b 'c])
(1 :a a)

Now I'm not really sure which one is more deserving of being called
distinct-by...

Sincerely,
Michał

Michał Marczyk

unread,

Feb 22, 2010, 1:53:48 PM2/22/10

to clo...@googlegroups.com

PS. The code for both functions is lifted straight from clojure.core's distinct.

Sean Devlin

unread,

Feb 22, 2010, 1:59:24 PM2/22/10

to Clojure

Generally when you have a "*-by" fn, you return the a sequence of
original values, not the mapped values (e.g. sort-by)

However, in the case you suggest, you really should just be using map
& distinct, and not creating your own fn. Using the built ins as much
as possible is more idiomatic Clojure.

(distinct (map f coll))

Sean

On Feb 22, 1:39 pm, Michał Marczyk <michal.marc...@gmail.com> wrote:

Michał Marczyk

unread,

Feb 22, 2010, 2:19:04 PM2/22/10

to clo...@googlegroups.com

On 22 February 2010 19:59, Sean Devlin <francoi...@gmail.com> wrote:
> Generally when you have a "*-by" fn, you return the a sequence of
> original values, not the mapped values (e.g. sort-by)

Which is precisely what both versions of distinct-by are doing. They
return a sequence of values taken from the original collection,
although determining which values those are involves calling a
user-supplied function and performing a bunch of comparisons of the
value thus obtained to values obtained from calls to said function at
previous steps.

E.g. (I'm repeating an example; this uses the second version of distinct-by):

(distinct-by class [1 2 3 :a :b :c 'a 'b 'c])

; => (1 :a a)

whereas

(distinct (map class [1 2 3 :a :b :c 'a 'b 'c])
; => (java.lang.Integer clojure.lang.Keyword clojure.lang.Symbol)

The primes one-liner using the nubBy-like version returns a list of
numbers, whereas the "mapped values" are Booleans.

Sincerely,
Michał

Sean Devlin

unread,

Feb 22, 2010, 2:28:37 PM2/22/10

to Clojure

Okay, your longer post makes more sense.

Then is the seq (1 :a a) guaranteed? How do I know that I won't get
(2 :b b), (1 :b c), etc? What if I want a specific combination
instead? I've had to actually code this specific problem, and I found
that using group-by & some secondary mapping operation was the only
thing that gave me the flexibility I needed (manufacturing is fun!).

Sean

On Feb 22, 2:19 pm, Michał Marczyk <michal.marc...@gmail.com> wrote:

Michał Marczyk

unread,

Feb 22, 2010, 3:13:52 PM2/22/10

to clo...@googlegroups.com

On 22 February 2010 20:28, Sean Devlin <francoi...@gmail.com> wrote:
> Then is the seq (1 :a a) guaranteed? How do I know that I won't get
> (2 :b b), (1 :b c), etc? What if I want a specific combination
> instead? I've had to actually code this specific problem, and I found
> that using group-by & some secondary mapping operation was the only
> thing that gave me the flexibility I needed (manufacturing is fun!).

The ordering guarantees distinct-by makes are exactly those that
distinct makes, because it uses the same code (as mentioned
previously, I lifted it all from clojure.core, then tweaked to take
the keyfn / eqfn into account). Basically this means that if your
collection has an intrinsic ordering, it will be preserved (the result
will include, for each equivalence class of items from the sequence
modulo the user-defined equivalence relation, the one earliest w.r.t.
that ordering). If it's a hash-map or a hash-set instead, you'll get
whatever ordering (seq coll) happens to produce.

As for group-by giving you more flexibility -- well, it gives you a
lot of flexibility where it's appropriate to use it, but because of
its choice of data structure for the result, you can't use it to
reimplement distinct-by directly:

user=> (group-by class [1 2 3 :a :b :c 'a 'b 'c])
java.lang.ClassCastException: java.lang.Class cannot be cast to
java.lang.Comparable (NO_SOURCE_FILE:0)

So no way to use non-Comparables as keys...

And then there's the fact that you can't tell in which order the keys
discovered by group-by appeared in the original collection, which is
again because of its use of sorted-map, which has the consequence that
order is being mangled on purpose! E.g.:

user=> (seq (group-by #(- %) [1 2 3 4 5]))
([-5 [5]] [-4 [4]] [-3 [3]] [-2 [2]] [-1 [1]])

In other words: (seq (group-by f coll)) has an ordering possibly
completely unrelated to that of coll (so you'd have to make a separate
traversal through the coll to discover the original ordering of the
keys), whereas (distinct-by f coll), for either version of
distinct-by, preserves the ordering of coll. That's a desirable
property for when that's what you want to do, whereas group-by will, I
suppose, be more useful on other occasions. ;-)

To sum it up, (1) distinct-by actually behaves in a very predictable
way (which may or may not be useful for any particular purpose), (2)
it cannot be implemented directly in terms of group-by. I'd say it's
pretty orthogonal to the existing library functions (that I know of)
actually...

Sincerely,
Michał

Sean Devlin

unread,

Feb 22, 2010, 3:25:17 PM2/22/10

to Clojure

Excellent points.

+1 for orthogonality.

Sean

On Feb 22, 3:13 pm, Michał Marczyk <michal.marc...@gmail.com> wrote:

Eugen Dück

unread,

Feb 22, 2010, 6:17:10 PM2/22/10

to Clojure

I agree with you, Michal.

But let me rephrase the question, maybe my initial long-winded post
wasn't clear enough on that.

Rather than having a separate fn 'distinct-by' in addition to the
existing 'distinct', which, apart from the hard-coded keyfn would be
EXACTLY the same, shouldn't we just generalize distinct to default to
hard-coded if no keyfn is specified? (Btw I'm not considering my other
suggestion to allow the set to be passed in here, which I personally
like, but also understand can be seen as unorthogonal)

And this is not only true for distinct-by, it's true for some of the
other * / *-by pairs. It just seems this should be generalized in
order not to duplicate code for these cases. If we copy-and-paste
code, what's the justification? I'd say orthogonality is an argument
against copy-and-paste. Do we copy and paste rather than generalize in
order to have distinct a little bit faster due to it being hard-coded?

Eugen

On Feb 23, 5:13 am, Michał Marczyk <michal.marc...@gmail.com> wrote:

Michał Marczyk

unread,

Feb 23, 2010, 12:37:59 PM2/23/10

to clo...@googlegroups.com

On 22 February 2010 20:28, Sean Devlin <francoi...@gmail.com> wrote:
> I've had to actually code this specific problem, and I found
> that using group-by & some secondary mapping operation was the only
> thing that gave me the flexibility I needed (manufacturing is fun!).

Incidentally, have you ever benefitted from group-by using a
sorted-map, as opposed to, say, a hash-map? I'm asking because I'm a
bit puzzled by the reasoning behind using this particular structure
with the resulting ordering only vaguely related (i.e. in a manner
which might be difficult to predict) to that of the input collection
anyway...

I haven't got a strong opinion on this, by the way, it's just that if
there isn't some huge benefit to sorted-map which I can't yet see,
perhaps a hash-map would make for a more versatile function?

Sincerely,
Michał

Michał Marczyk

unread,

Feb 23, 2010, 12:44:59 PM2/23/10

to clo...@googlegroups.com

On 23 February 2010 00:17, Eugen Dück <eu...@dueck.org> wrote:
> Rather than having a separate fn 'distinct-by' in addition to the
> existing 'distinct', which, apart from the hard-coded keyfn would be
> EXACTLY the same, shouldn't we just generalize distinct to default to
> hard-coded if no keyfn is specified? (Btw I'm not considering my other
> suggestion to allow the set to be passed in here, which I personally
> like, but also understand can be seen as unorthogonal)
>
> And this is not only true for distinct-by, it's true for some of the
> other * / *-by pairs. It just seems this should be generalized in
> order not to duplicate code for these cases. If we copy-and-paste
> code, what's the justification? I'd say orthogonality is an argument
> against copy-and-paste. Do we copy and paste rather than generalize in
> order to have distinct a little bit faster due to it being hard-coded?

That seems like a very good point to me! Especially if
clojure.core/identity could be inlined by the compiler (which I
thought it is, although now that I look the inlining-related entries
are missing from (meta #'identity)), both distinct and sort could be
arity overloaded to accept an optional keyfn or not (sort would
basically have to be replaced with sort-by + an added unary variant
with defaults for both keyfn and comparator).

Then perhaps we could reuse the *-by names for Haskell-style binary
predicate-based versions. Haskell's sortBy, groupBy, nubBy all take
binary functions to decide ordering / equality of pairs of elements
from the input sequence... That's enables some neat code sometimes,
cf. the primes on liner. How about that?

Sincerely,
Michał

Sean Devlin

unread,

Feb 23, 2010, 2:26:15 PM2/23/10

to Clojure

Take a look at sort/sort-by in core. That would be starting point for
fn style.

Michał Marczyk

unread,

Feb 23, 2010, 2:40:35 PM2/23/10

to clo...@googlegroups.com

On 23 February 2010 20:26, Sean Devlin <francoi...@gmail.com> wrote:
> Take a look at sort/sort-by in core. That would be starting point for
> fn style.

Not sure what you mean by that.

If anything, having an arity-overloaded distinct (optionally accepting
a binary predicate to use in place of =) and a separate distinct-by
(accepting an extra keyfn argument plus optionally the binary
predicate) would make for a more uniform standard lib.

(I guess I suggested switching to the Haskell naming convention
before, but there's no real reason to do that if the same
functionality is available, plus I'd be confused by this myself,
having gotten used to the Clojure names already... I retract that
part.)

Sincerely,
Michał

Sean Devlin

unread,

Feb 23, 2010, 3:03:04 PM2/23/10

to Clojure

I guess I was getting at how to do the airity overloading. My first
instinct when developing is do something like this

(def distinct
([coll] (distinct identity coll))
([f coll] (lot-of-code)))

However, if you look around core, you'll see that Rich "repeats
himself" a lot. This really obvious w/ fns like juxt & partial. It
turns out that this is results in faster code. So I'd implement
distinct like this.

(def distinct
([coll] (lots-of-code))
([f coll] (lot-of-similar-code)))

That's all I was saying.

Sean

On Feb 23, 2:40 pm, Michał Marczyk <michal.marc...@gmail.com> wrote:

Michał Marczyk

unread,

Feb 23, 2010, 3:30:16 PM2/23/10

to clo...@googlegroups.com

On 23 February 2010 21:03, Sean Devlin <francoi...@gmail.com> wrote:
> I guess I was getting at how to do the airity overloading. My first
> instinct when developing is do something like this
>
> (def distinct
> ([coll] (distinct identity coll))
> ([f coll] (lot-of-code)))
>
> However, if you look around core, you'll see that Rich "repeats
> himself" a lot. This really obvious w/ fns like juxt & partial. It
> turns out that this is results in faster code. So I'd implement
> distinct like this.
>
> (def distinct
> ([coll] (lots-of-code))
> ([f coll] (lot-of-similar-code)))
>
> That's all I was saying.
>
> Sean

That's a very valid point! (Which I previously completely failed to grok.)

For a "private" utility function, I guess I normally prefer DRYer,
perhaps slightly slower code, but for a library function it's a
different matter... Anyway, thanks for the clarification!

How about that sorted-map vs hash-map thing with group-by? I'm
genuinely curious.

All best,
Michał

Eugen Dück

unread,

Feb 23, 2010, 6:18:36 PM2/23/10

to Clojure

Agree, it might make sense to do that performance-wise. Although I'm
hoping that the compiler would be able to inline that as well
(optimizing away calls to identity).

Btw, what does :inline-arities do? It sounds like this could do what
we want.

Reply all

Reply to author

Forward