intersection and union on sequences/lists

Christian Schuhegger

unread,

Mar 13, 2011, 7:17:48 AM3/13/11

to Clojure

Hi all,

I am coming from common lisp and was wondering if clojure supports
intersection and union not only on sets, but also on normal sequences/
lists? I just did a google search but without a result. For the moment
I'm writing those functions myself.

Thanks for any hints,
Christian

Alan

unread,

Mar 13, 2011, 3:26:35 PM3/13/11

to Clojure

All this will do for you is give you a way to compute intersections
that is O(N) instead of O(logN).

If you don't care about performance, just (defn intersect [a b]
(clojure.set/intersection (set a) (set b))).
If you care about performance and plan to work with something as a set
later, then make it a set to begin with.
If you care about the *order* of the entries in one of your sets (it
makes no sense to care about order in both of them), then you can
(defn ordered-intersect [a b] (filter (set b) a)).

On Mar 13, 3:17 am, Christian Schuhegger

Christian Schuhegger

unread,

Mar 13, 2011, 3:40:09 PM3/13/11

to Clojure

I would like to keep duplicates, e.g. (intersect '( 1 2 3 3 3) '(3 4 3
5)) should give '(3 3). As I've said above, I've solved the issue for
the moment by writing the required functions myself. I was just
wondering if this functionality already exists somewhere in the
clojure core/contrib space?

Alan

unread,

Mar 13, 2011, 3:50:32 PM3/13/11

to Clojure

Why shouldn't it give '(3 3 3)? It looks like you've introduced a
fairly arbitrary distinction between the first and second arguments to
a normally-symmetric function, intersect. And you would get the same
behavior by using one of my suggestions:

user> (filter (set [1 2 3 3 3]) [3 4 3 5])
(3 3)
user> (filter (set [3 4 3 5]) [1 2 3 3 3])
(3 3 3)

As a side note, Clojure rarely uses quoted lists for data structures:
vectors are easier for a number of reasons.

On Mar 13, 12:40 pm, Christian Schuhegger

Ken Wesson

unread,

Mar 13, 2011, 4:41:27 PM3/13/11

to clo...@googlegroups.com

On Sun, Mar 13, 2011 at 3:50 PM, Alan <al...@malloys.org> wrote:
> Why shouldn't it give '(3 3 3)? It looks like you've introduced a
> fairly arbitrary distinction between the first and second arguments to
> a normally-symmetric function, intersect. And you would get the same
> behavior by using one of my suggestions:
>
> user> (filter (set [1 2 3 3 3]) [3 4 3 5])
> (3 3)
> user> (filter (set [3 4 3 5]) [1 2 3 3 3])
> (3 3 3)

I'm pretty sure what he wants is for the intersection to be "as many
1s as are common to both lists; as many 2s as are common to both
lists; as many 3s ...", so [1 2 3 3 3 4 5] intersect [1 1 3 3 4 6]
would be [1 3 3 4], for instance.

For that sort of thing, what you really want is a "bag" datatype (an
unordered collection with duplicates that is stored and searched more
efficiently than a vector or a list), but here's quick-and-dirty
intersect that does what he wants on anything seqable:

(defn intersect [s1 s2]
(let [f1 (frequencies s1)
f2 (frequencies s2)
d1 (apply dissoc f1 (keys f2))
d2 (apply dissoc f2 (keys f1))
f1 (apply dissoc f1 (keys d1))
f2 (apply dissoc f2 (keys d2))]
(mapcat
(fn [[k v]]
(repeat v k))
(merge-with min f1 f2))))

user=> (intersect [1 2 3 3 3 4 5] [1 1 3 3 4 6])
(1 3 3 4)
user=> (apply str
(intersect
"the quick brown fox jumped over the lazy dog"
"it was the best of times, it was the worst of times"))
" abeeeefhhimooorttw"

Christian Schuhegger

unread,

Mar 14, 2011, 3:34:11 AM3/14/11

to Clojure

You're right. Your version does what I want. Actually I've just seen
that common lisps behaviour in the case of duplicates is undefined:
http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node152.html
I was certain that it behaved the way your intersect behaves.

Ken Wesson

unread,

Mar 14, 2011, 3:46:09 AM3/14/11

to clo...@googlegroups.com

Here is a bag implementation:

(defprotocol SetOps
(disjoin* [this obj])
(has* [this obj])
(total [this obj])
(counts [this]))

(defn disjoin
([s] s)
([s obj]
(disjoin* s obj))
([s obj & more]
(apply disjoin (disjoin s obj) more)))

(defn difference
([s] s)
([s other]
(apply disjoin s other))
([s other & more]
(apply difference (difference s other) more)))

(defn intersection
([s] s)
([s other]
(difference s (difference s other)))
([s other & more]
(apply intersection (intersection s other) more)))

(defn union
([s] s)
([s other]
(if (empty? other) s (apply conj s other)))
([s other & more]
(apply union (union s other) more)))

(defn has?
([s] true)
([s obj]
(has* s obj))
([s obj & more]
(and (has? s obj) (apply has? s more))))

(extend-type clojure.lang.IPersistentSet
SetOps
(disjoin* [this obj] (disj this obj))
(has* [this obj] (contains? this obj))
(total [this obj] (if (contains? this obj) 1 0))
(counts [this] (zipmap this (repeat 1))))

(deftype Bag [m c]
clojure.lang.IObj
(withMeta [this, mtd]
(Bag. (with-meta m mtd) c))
(meta [this] (meta m))
clojure.lang.IPersistentCollection
(count [this] c)
(empty [this] (Bag. {} 0))
(equiv [this other] (= (seq this) other))
(seq [this]

(mapcat
(fn [[k v]]
(repeat v k))

m))
(cons [this obj]
(Bag. (merge-with + m {obj 1}) (inc c)))
SetOps
(disjoin* [this obj]
(if-let [n (m obj)]
(if (= 1 n)
(Bag. (dissoc m obj) (dec c))
(Bag. (assoc m obj (dec n)) (dec c)))
this))
(has* [this obj] (contains? m obj))
(total [this obj] (m obj 0))
(counts [this] m)
Object
(toString [this]
(apply str
(interpose " "
(seq this)))))

(defn bag [& objects]
(Bag. (frequencies objects) (count objects)))

user=> (difference (bag 1 6 3 6 8 4 5 6 6 9 9) (bag 9 9 6 8 1))
#<Bag 6 6 6 3 4 5>

The difference, union, disjoin, intersection etc. functions are also
made to work on Clojure sets, and via extend-type or extend-protocol
can be extended to other types. (I didn't make Bag implement
IPersistentSet because I considered nonduplication to be part of the
latter's contract. So SetOps is a broader umbrella: unordered
collection with efficient membership test.)

The has? function is like contains?, but works on bags and sets rather
than on sets and maps and may take more arguments, in which case it
checks for all being contained. Note that (has? (bag 1) 1 1 1) will
return true; to check for containing at least a certain number of
occurrences use (not (empty? (difference bag-1 bag-2))) where bag-2
has the items we want to check for in the appropriate quantities.

Well, except that count and empty are broken for some reason:

user=> (.count (Bag. {} 0))
0
user=> (count (Bag. {} 0))
1

I don't understand what's causing this, but empty bags are always
returning a count of 1 (and false from empty?) although the .count
method correctly returns zero. I've traced the problem as far as the
Java method clojure.lang.RT/count but no farther at this time.

Ken Wesson

unread,

Mar 14, 2011, 3:49:42 AM3/14/11

to clo...@googlegroups.com

> Well, except that count and empty are broken for some reason:
>
> user=> (.count (Bag. {} 0))
> 0
> user=> (count (Bag. {} 0))
> 1
>
> I don't understand what's causing this, but empty bags are always
> returning a count of 1 (and false from empty?) although the .count
> method correctly returns zero. I've traced the problem as far as the
> Java method clojure.lang.RT/count but no farther at this time.

Argh. This same bug causes union to blow up in the case of (union (bag
1) (bag)). It works with (union #{1} #{}), by contrast. It traces back
to empty? wrongly returning true, and thus to count wrongly returning
1 even though .count correctly returns 0.

Ken Wesson

unread,

Mar 14, 2011, 4:07:17 AM3/14/11

to clo...@googlegroups.com

Simply adding "clojure.lang.Counted", without any methods, to the
protocol list makes it work again. The count function in RT is this
odd little thing:

public static int count(Object o){
if(o instanceof Counted)
return ((Counted) o).count();
return countFrom(Util.ret1(o, o = null));
}

If it's not counted, countFrom is called, and for some reason does not
work on empty Bags even though it apparently works correctly on, among
others, empty LazySeqs:

user=> (instance? clojure.lang.Counted (map + '()))
false
user=> (count (map + '()))
0

Taking the countFrom expression apart, Util.ret1 just seems to return
its first argument. I suppose it's being used to null o before
countFrom does all of its work, to avoid holding onto the head of a
lazy sequence that gets passed in. It's not the source of the problem.

The cause seems to be this in countFrom:

else if(o instanceof IPersistentCollection) {
ISeq s = seq(o);
o = null;
int i = 0;
for(; s != null; s = s.next()) {
if(s instanceof Counted)
return i + s.count();
i++;
}
return i;
}

If "o instanceof IPersistentCollection" then o has a .count method,
yet this doesn't call it. Instead it calls seq, and if seq returns a
non-null, empty seq that loop returns 1 instead of 0. So another fix
that could be made to the original bag would be to call seq on the
return value of mapcat. (Why mapcat is not returning nil for an empty
input seq is itself somewhat mysterious.) I'm not sure whether the
loop above should be considered correct (seq should never return a
non-nil, empty seq) or incorrect (the corner case of seq's return
being empty but not null should be handled). I'm not even sure whether
that loop should exist at all, or should instead call the .count
method of IPersistentCollection (with the default implementation being
the loop above, of course). If the .count method of
IPersistentCollection is not going to be used, on the other hand,
unless the thing is also a Counted, then it should be removed from the
former interface.

Regardless, it looks like (count a-bag) is going to be inefficient
despite the .count method implementation unless the Counted type is
added to the protocol list, and seq should probably be returning nil
for empty bags.

So the version I'd suggest using is:

(deftype Bag [m c]
clojure.lang.IObj
(withMeta [this, mtd]
(Bag. (with-meta m mtd) c))
(meta [this] (meta m))
clojure.lang.IPersistentCollection
(count [this] c)
(empty [this] (Bag. {} 0))
(equiv [this other] (= (seq this) other))
(seq [this]

(seq

(mapcat
(fn [[k v]]
(repeat v k))

m)))

(cons [this obj]
(Bag. (merge-with + m {obj 1}) (inc c)))
SetOps
(disjoin* [this obj]
(if-let [n (m obj)]
(if (= 1 n)
(Bag. (dissoc m obj) (dec c))
(Bag. (assoc m obj (dec n)) (dec c)))
this))
(has* [this obj] (contains? m obj))
(total [this obj] (m obj 0))
(counts [this] m)

clojure.lang.Counted

Alan

unread,

Mar 14, 2011, 4:10:42 AM3/14/11

to Clojure

clojure.lang.RT.count checks whether the object being counted
implements Counted, and only calls .count on it if so.
IPersistentCollection does not implement Counted, so it calls (seq) on
your object and walks over it to get the count. You return
(mapcat ...), when asked for a seq, which is fully lazy and thus non-
nil even for an empty bag. You can fix the immediate problem by
adjusting the seq function to return (seq (mapcat ...)): in my REPL
this causes (count (bag)) to return 0 instead of 1.

You could also make Bag derive Counted, so that your .count method
will be called to save some time.

Alan

unread,

Mar 14, 2011, 4:12:13 AM3/14/11

to Clojure

By the way, thanks for the very thorough Bag implementation. Your
implementations of the very detailed things that people wish Clojure
had are always interesting reading.

On Mar 14, 12:49 am, Ken Wesson <kwess...@gmail.com> wrote:

Ken Wesson

unread,

Mar 14, 2011, 4:32:13 AM3/14/11

to clo...@googlegroups.com

On Mon, Mar 14, 2011 at 4:12 AM, Alan <al...@malloys.org> wrote:
> By the way, thanks for the very thorough Bag implementation. Your
> implementations of the very detailed things that people wish Clojure
> had are always interesting reading.

You're welcome.

I solved the count problem independently, but thanks for the advice
anyway. The presence of the count method in the specification of
IPersistentCollection puzzles me, if it's never called unless an
object also implements Counted (why not ICounted?), which specifies
the same method anyway.

Ken Wesson

unread,

Mar 14, 2011, 4:35:30 AM3/14/11

to clo...@googlegroups.com

(defn bag-of [coll]
(apply bag coll))

just so we have analogues of both vec and vector here.

Reply all

Reply to author

Forward