intersection

Tuang

unread,

Jan 28, 2004, 4:14:41 AM1/28/04

to

I'm a bit confused about how "intersection" is supposed to work in
Common Lisp.

The Hyperspec says the following about intersection:

"The intersection operation is described as follows. For all possible
ordered pairs consisting of one element from list-1 and one element
from list-2, :test or :test-not are used to determine whether they
satisfy the test [of equality]....For every pair that satifies the
test, exactly one of the two elements of the pair will be put in the
result."

That tells me that if you have 3 A's in one list and 2 A's in the
other, you'll have six possible ordered pairs containing one A from
each list. Taking one from each pair, it sounds as though there should
be six A's in the result. Of course you could remove a pair, after
adding one A to the result, as soon as you get a match, but then you
wouldn't be considering "all possible ordered pairs".

Trying it out with CLISP (on Win32) to help myself understand it, I
get the following:

> (intersection '(A A A D E) '(A A B C))
(A A A)
> (intersection '(A A B C) '(A A A D E))
(A A)

Well, now I'm even more confused. At the very least, it would seem to
me that (intersection X Y) ought to be the same as (intersection Y X),
though potentially in a different order.

I think that the second is what I would normally consider
intersection, which is sharing an element in common in a venn diagram.
If set X contains 2 A's and set Y contains 3 A's, then drawing the
intersecting circles, you could put two A's in the shared (overlapped)
part with one more A in Y but outside the shared part. The
intersection would be those items in the shared region, meaning 2 A's
in the intersection.

I don't know. Is this just a bug in CLISP, or a misstatement in the
Hyperspec, or some misunderstanding on my part? I've been assuming
that something fundamental like intersection would be implemented
everywhere with the same 5-6 lines of code that have been used since
the dawn of Lisp.

Thoughts?

Marco Gidde

unread,

Jan 28, 2004, 5:28:15 AM1/28/04

to

tuan...@hotmail.com (Tuang) writes:

> I'm a bit confused about how "intersection" is supposed to work in
> Common Lisp.
>
> The Hyperspec says the following about intersection:
>
> "The intersection operation is described as follows. For all possible
> ordered pairs consisting of one element from list-1 and one element
> from list-2, :test or :test-not are used to determine whether they
> satisfy the test [of equality]....For every pair that satifies the
> test, exactly one of the two elements of the pair will be put in the
> result."

It also says:

"If one of the lists contains duplicate elements, there may be
duplication in the result."

> That tells me that if you have 3 A's in one list and 2 A's in the
> other, you'll have six possible ordered pairs containing one A from
> each list. Taking one from each pair, it sounds as though there should
> be six A's in the result. Of course you could remove a pair, after
> adding one A to the result, as soon as you get a match, but then you
> wouldn't be considering "all possible ordered pairs".
>
> Trying it out with CLISP (on Win32) to help myself understand it, I
> get the following:
>
> > (intersection '(A A A D E) '(A A B C))
> (A A A)
> > (intersection '(A A B C) '(A A A D E))
> (A A)
>
> Well, now I'm even more confused. At the very least, it would seem to
> me that (intersection X Y) ought to be the same as (intersection Y X),
> though potentially in a different order.
>
> I think that the second is what I would normally consider
> intersection, which is sharing an element in common in a venn diagram.
> If set X contains 2 A's and set Y contains 3 A's, then drawing the
> intersecting circles, you could put two A's in the shared (overlapped)
> part with one more A in Y but outside the shared part. The
> intersection would be those items in the shared region, meaning 2 A's
> in the intersection.

You assume the A's are different, while in fact they are
equal. Remember the definition of an intersection: the
intersection of two sets A and B is the the set of all x, where x is
an element of A and x is an element of B. Taking this into account the
solution could also be '(A) or '(A A A A A A A A A)

Erik Naggum

unread,

Jan 28, 2004, 10:02:26 AM1/28/04

to

* Tuang

| The Hyperspec says the following about intersection:

If one of the lists contains duplicate elements, there may be
duplication in the result.

Sets implemented as lists are defined on the presence of members of
the set in the list. It is strictly speaking an error to duplicate
members of the set in the list representation, but this error is hard
to detect inexpensively, so the rational course of action is to allow
it with suitable hand-waving. The above is an instance of precisely
such hand-waving.

--
Erik Naggum | Oslo, Norway 2004-028

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.

Marco Antoniotti

unread,

Jan 28, 2004, 12:10:47 PM1/28/04

to

That depends on your definition of set intersection and set equality.

If you do not consider multiplicity of elements, basic set theory says that

{A, A, A} = {A}

hence CLisp (and Common Lisp) are right in returning what they return.
Of course you may have problems if you want to do

(equal '(A A A) '(A A))

but then, nobody ensured that EQUAL implements set equality as intended
in basic set theory. To do that, you are better off with

(defun set-equal (s1 s2)
(and (subsetp s1 s2) (subsetp s2 s1)))

Of course, you can add all the &key you need etc etc...

Note also that

(union '(a) '(a a c))

yields different results on different CL implementations. But then
again the basi set semantics is correct and SET-EQUAL will work as
expected. It is the LIST interpretation of sets that yields incorrect
results.

The use of lists as sets is a simple and nice and quick and dirty way to
do set-theory in CL. It is not necessarily the right way, especially
when sets become large. In that case you are better off writing your
own set manipulation library.

Cheers
--
Marco

Kaz Kylheku

unread,

Jan 28, 2004, 12:14:00 PM1/28/04

to

tuan...@hotmail.com (Tuang) wrote in message news:<df045d93.04012...@posting.google.com>...

> Trying it out with CLISP (on Win32) to help myself understand it, I
> get the following:
>
> > (intersection '(A A A D E) '(A A B C))
> (A A A)
> > (intersection '(A A B C) '(A A A D E))
> (A A)

Cluestick: a mathematical set has no order and no concept of
duplicates. An element is either a member of the set or is not. If you
care about the difference between (a a a) and (a), then you are not
working with a set. You are conceptually working with two sets and a
relation between them: a set of symbols, and the set of integers. The
relation associates an element from the first set with an element from
the second, representing its multiplicity. If, additionally, you care
about the difference between (a b) and (b a) then you are working with
sequences, not sets.

> I don't know. Is this just a bug in CLISP, or a misstatement in the
> Hyperspec, or some misunderstanding on my part? I've been assuming

The HyperSpec is crystal clear:

``If one of the lists contains duplicate elements, there may be
duplication in the result.''

``May'' means ``It's not wrong if it is so, but is not required to be,
and portable programs can't depend on either behavior''.

So what this says in effect is ``if you feed non-set-like sequences to
INTERSECTION, don't count on it to normalize the results to a proper
set-like sequence with no duplicates, but also don't count on it to
preserve duplication.''

There is no requirement that the exact multiplicity of the duplication
from both lists be preserved if the implementation chooses to not
normalize the result.

> that something fundamental like intersection would be implemented
> everywhere with the same 5-6 lines of code that have been used since
> the dawn of Lisp.

Efficient implementations of intersection require hashing, or
something like it. The intersection can be computed in linear time if
you build an index out of one of the sets, and then iterate through
the second set, doing quick lookups in the index to see whether the
items are in the first set.

It can be readily seen how this type of implementation can easily
produce a different answer based on the order of the two operands. The
hashing would reduce any duplicates in the list that is indexed, so
any duplicates would come from the pass over the second one.

The simple, naive INTERSECTION works in O(M * N) time rather than O(M
+ N) time: take elements from one list one by one, and linearly search
for them in he other list. This haive implementation will also have an
order dependency in its reproduction of duplicates if it stops
scanning the target list whenever it finds a match. I.e. (intersection
'(A) '(A A A)) will find the first A in (A A A) and stop, thereby not
seeing the remaining two A's. But (intersection '(A A A) '(A)) will
iterate over the first list, and find A in the second list three times
over again.

Tuang

unread,

Jan 28, 2004, 2:53:14 PM1/28/04

to

Erik Naggum <er...@naggum.no> wrote in message news:<32842909464...@naggum.no>...

> * Tuang
> | The Hyperspec says the following about intersection:
>
> If one of the lists contains duplicate elements, there may be
> duplication in the result.
>
> Sets implemented as lists are defined on the presence of members of
> the set in the list. It is strictly speaking an error to duplicate
> members of the set in the list representation, but this error is hard
> to detect inexpensively, so the rational course of action is to allow
> it with suitable hand-waving. The above is an instance of precisely
> such hand-waving.

and Marco said:

> Remember the definition of an intersection: the
> intersection of two sets A and B is the the set of all x, where x is
> an element of A and x is an element of B. Taking this into account the
> solution could also be '(A) or '(A A A A A A A A A)

Both of these replies (thank you, BTW) seem to be saying that the
result of performing an intersection on lists containing duplicate
members is undefined, at least with respect to the number of
duplicates that may show up in the result. That's a much stronger
statement than the "may have duplicates" line in the spec.

The spec made a big point about how "nintersection" was destructive to
one of its inputs, and how that destruction was implementation
dependent, and provided examples of different results of destruction,
etc. With that much description of an implementation dependency that
could have been covered with "the first input to nintersection is
destroyed", I didn't think "may have duplicates" would imply "the
number of members in the result is so undefined that even within the
same implementation the length of X i'sect Y isn't guaranteed to be
the same as the length of Y i'sect X.

I did understand that intersection (and other set operations) had
several possible definitions in the presence of duplicates, but I was
thinking that the Lisp world had probably chosen the one that was most
useful as a building block for higher abstractions long ago -- like
defining multiplication of no arguments to equal 1 instead of making
it implementation dependent.

Of course, Lisp makes it easy for me to define it any way I like and
implement it in about half a dozen lines of code, so it doesn't matter
in that respect. I was thinking, though, that I was going to learn how
I *ought* to define it, mistakenly thinking that all of the Lisp
builtins, except where explicitly noted, had evolved to very rigidly
defined algorithms years ago.

Pascal Bourguignon

unread,

Jan 28, 2004, 3:18:09 PM1/28/04

to

tuan...@hotmail.com (Tuang) writes:

Yes, but all of this has no importance if you use this list->set
function:

(defun list->set (list) (remove-duplicates list))
(setq p (list->set '(a a b a c a a d)))
(setq q (list->set '(a a a a a d a d)))
(intersection p q) --> (d a) ;; or (a d)...

--
__Pascal_Bourguignon__ http://www.informatimago.com/
There is no worse tyranny than to force a man to pay for what he doesn't
want merely because you think it would be good for him.--Robert Heinlein
http://www.theadvocates.org/

Paul Dietz

unread,

Jan 28, 2004, 3:18:01 PM1/28/04

to

Tuang wrote:

> Both of these replies (thank you, BTW) seem to be saying that the
> result of performing an intersection on lists containing duplicate
> members is undefined, at least with respect to the number of
> duplicates that may show up in the result. That's a much stronger
> statement than the "may have duplicates" line in the spec.

No, they're about the same.

> The spec made a big point about how "nintersection" was destructive to
> one of its inputs,

No, the spec does not say that nintersection is destructive
to list-1. It says it *may* be destructive. This is an important
difference: a conforming program cannot count on the first list
being modified. The implementation has the freedom to modify it,
but it not required to do so. Making nintersection do exactly the
same thing as intersection would be acceptable in a conforming
implemention of Common Lisp.

> I didn't think "may have duplicates" would imply "the
> number of members in the result is so undefined that even within the
> same implementation the length of X i'sect Y isn't guaranteed to be
> the same as the length of Y i'sect X.

Where did it say they had to be the same? This is not stated
in the spec, so why were you imagining it is required?

Paul

Erik Naggum

unread,

Jan 28, 2004, 3:57:47 PM1/28/04

to

* Tuang

| Both of these replies (thank you, BTW) seem to be saying that the
| result of performing an intersection on lists containing duplicate
| members is undefined, at least with respect to the number of
| duplicates that may show up in the result. That's a much stronger
| statement than the "may have duplicates" line in the spec.

Please, this requires attention to detail. It is not «undefined» just
because what you want to have defined is not defined. This is like
literary analysis in that you can choose one of two possible modi
operandi in your approach to a text: (A) to regard the text as the
true source of its meaning and to spend your energy on understanding
what the author meant, or (B) to regard yourself as the true source of
meaning in your personal universe and to spend your energy on proving
that what you mean is in the text. There is no limit to what you can
find support for in a text if you set your mind to it. The challenge
is therefore not to find support for what you believe, but to explain
why you believe it in the face of relevant counter-evidence.

| With that much description of an implementation dependency that could
| have been covered with "the first input to nintersection is destroyed",
| I didn't think "may have duplicates" would imply "the number of members
| in the result is so undefined that even within the same implementation
| the length of X i'sect Y isn't guaranteed to be the same as the length
| of Y i'sect X.

You draw some pretty annoyingly misguided conclusions from the evidence
before you, and it is actually hard for me to muster all the energy to
respond to the clear impression that you have already made up your mind
about something that you do not have sufficient information to make up
your mind about. I just don't have the energy to fight someone who
insists that his misguided beliefs must be correct and therefore the
rest of the world is wrong, but the more you insist, the more the rest
of the world needs to be shown that you are misguided.

The obvious implementation of INTERSECTION is the trivial

(defun intersection (list-1 list-2-not)
(loop for elt in list-1
if (member elt list-2)
collect elt))

which happens to produce exactly the results you observe. With the
exception of the key and test arguments, which I have omitted for the
sake of clarity, but which cry out for optimization, I would be very
surprised if anyone has implemented this differently. (Barring, of
course, LOOP allergies.)

| Of course, Lisp makes it easy for me to define it any way I like and
| implement it in about half a dozen lines of code, so it doesn't matter
| in that respect. I was thinking, though, that I was going to learn
| how I *ought* to define it, mistakenly thinking that all of the Lisp
| builtins, except where explicitly noted, had evolved to very rigidly
| defined algorithms years ago.

They generally have. Just because you don't find them intuitively
evident does not mean that they have not evolved thusly.

Henrik Motakef

unread,

Jan 28, 2004, 4:18:39 PM1/28/04

to

tuan...@hotmail.com (Tuang) writes:

> The spec made a big point about how "nintersection" was destructive to
> one of its inputs, and how that destruction was implementation
> dependent, and provided examples of different results of destruction,
> etc. With that much description of an implementation dependency that
> could have been covered with "the first input to nintersection is
> destroyed", I didn't think "may have duplicates" would imply "the
> number of members in the result is so undefined that even within the
> same implementation the length of X i'sect Y isn't guaranteed to be
> the same as the length of Y i'sect X.

Interpreting the spec sometimes requires some experience with it. The
spec is generally pretty explicit about potentially destructive
operations - for example, it often defines both an operator that
/might/ be destructive and one that /must never/ be, but rarely one
that /must always/ be destructive (except when the destructive
operation is the whole point, as with CHANGE-CLASS for example). In
other regards, it is less specific, for example in what it means to
treat a list as a set.

BTW, you might notice that some list-as-set functions are defined in a
way that is explicitly non-set-like. (SET-DIFFERENCE s1 s2) and
(SET-DIFFERENCE s2 s1) are very different, which is probably not
what you would expect.

My impression (which may be completely off-base; when I came to Lisp,
the CL standard already was an old one) is that this might be because
expressing constraints and non-constraints on /implementations/ was
considered more important than those on /conforming programs/. In
general, if you want to write Lisp programs, you should either never
assume anything not explicitly guaranteed in the spec, or assume a
specific imlpementations behaviour and be prepared to some porting
effort when you want to run it in a different
implementation. Fortunatly, porting problems are pretty rare in
practice scince many implementations implement
implementation-dependent behaviour similarly. (Do I get a price for
that sentence?)

> I did understand that intersection (and other set operations) had
> several possible definitions in the presence of duplicates, but I was
> thinking that the Lisp world had probably chosen the one that was most
> useful as a building block for higher abstractions long ago -- like
> defining multiplication of no arguments to equal 1 instead of making
> it implementation dependent.

In a kind of way, it has. If you /consistently/ treat a list as a set,
there won't be many problems. If you mix lists-as-sets and
lists-as-lists, it may become more complicated, but, well, don't do that.

> Of course, Lisp makes it easy for me to define it any way I like and
> implement it in about half a dozen lines of code, so it doesn't matter
> in that respect. I was thinking, though, that I was going to learn how
> I *ought* to define it, mistakenly thinking that all of the Lisp
> builtins, except where explicitly noted, had evolved to very rigidly
> defined algorithms years ago.

Whenever something is defined rigidly, you will probably find a rigid
definition in the spec. If you don't, don't count on it. Even if all
implementations you bother to test with behave consistently,
implementations change their behaviour in corner cases all the time,
and completely new implementations continure to emerge (how much code
is properly tested on ArmedBear? Or even ECLS?).

Paul Dietz

unread,

Jan 28, 2004, 4:29:56 PM1/28/04

to

Erik Naggum wrote:

> The obvious implementation of INTERSECTION is the trivial
>
> (defun intersection (list-1 list-2-not)
> (loop for elt in list-1
> if (member elt list-2)
> collect elt))
>
> which happens to produce exactly the results you observe. With the
> exception of the key and test arguments, which I have omitted for the
> sake of clarity, but which cry out for optimization, I would be very
> surprised if anyone has implemented this differently. (Barring, of
> course, LOOP allergies.)

If the lists are long enough, it becomes preferable to use
a linear time algorithm with eql hash tables.

Paul

Paul Dietz

unread,

Jan 28, 2004, 4:43:23 PM1/28/04

to

Henrik Motakef wrote:

> (how much code is properly tested on ArmedBear? Or even ECLS?).

Maybe they found some bloody idiot to write a comprehensive
test suite they could use?

Paul

Henrik Motakef

unread,

Jan 28, 2004, 5:12:36 PM1/28/04

to

Paul Dietz <paul.f...@motorola.com> writes:

There is some bloody idiot writing a comprehensive test suite for
/programs/ that "purport to conform", rather than for implementations?
I'm impressed, that seems to be even more of an herculean task than a
test suite for implementations.

Paul Dietz

unread,

Jan 28, 2004, 5:36:14 PM1/28/04

to

Henrik Motakef wrote:

> There is some bloody idiot writing a comprehensive test suite for
> /programs/ that "purport to conform", rather than for implementations?
> I'm impressed, that seems to be even more of an herculean task than a
> test suite for implementations.

Ah, I misinterpreted what you wrote. Never mind.

Paul

Erik Naggum

unread,

Jan 28, 2004, 6:11:33 PM1/28/04

to

* Paul Dietz

| If the lists are long enough, it becomes preferable to use a linear
| time algorithm with eql hash tables.

Yes, this is true, but I somehow expect a programmer who works with
large sets to use a more efficient representation than lists. E.g.,
if the set has a finite number of candidate elements, a bit vector is
a suitable representation, where UNION is implementable with BIT-IOR,
INTERSECTION with BIT-AND, etc. I have no idea how long «long enough»
is, though. You wouldn't have any empirical data on the threshold?

Kaz Kylheku

unread,

Jan 28, 2004, 8:07:06 PM1/28/04

to

tuan...@hotmail.com (Tuang) wrote in message news:<df045d93.04012...@posting.google.com>...

> Both of these replies (thank you, BTW) seem to be saying that the
> result of performing an intersection on lists containing duplicate
> members is undefined, at least with respect to the number of
> duplicates that may show up in the result. That's a much stronger
> statement than the "may have duplicates" line in the spec.

No, it isn't. ``Undefined with respect to the number of duplicates''
means exactly the same thing as ``may have duplicates'', which means
exactly the same thing as ``The number of duplicates can be any
non-negative integer''.

> The spec made a big point about how "nintersection" was destructive to
> one of its inputs, and how that destruction was implementation
> dependent, and provided examples of different results of destruction,
> etc. With that much description of an implementation dependency that
> could have been covered with "the first input to nintersection is
> destroyed", I didn't think "may have duplicates" would imply "the
> number of members in the result is so undefined that even within the
> same implementation the length of X i'sect Y isn't guaranteed to be
> the same as the length of Y i'sect X.
>
> I did understand that intersection (and other set operations) had
> several possible definitions in the presence of duplicates, but I was
> thinking that the Lisp world had probably chosen the one that was most
> useful as a building block for higher abstractions long ago -- like
> defining multiplication of no arguments to equal 1 instead of making
> it implementation dependent.

There is a mathematical basis for this, because exponentiation to the
power of zero is 1.

(= (expt 42 3) (* 42 42 42))
(= (expt 42 2) (* 42 42))
(= (expt 42 1) (* 42))
(= (expt 42 0) (*))

Similar reasoning allows a zero-dimensional array A have a single
element that is specified by no subscripts whatsoever. The size of an
array is a product of all the dimensions. The product of dimensions in
this case, since there aren't any, is (*), so the size is 1. Thus
there is a single cell in the array and is named by the place (AREF
A).

> Of course, Lisp makes it easy for me to define it any way I like and
> implement it in about half a dozen lines of code, so it doesn't matter
> in that respect. I was thinking, though, that I was going to learn how
> I *ought* to define it, mistakenly thinking that all of the Lisp
> builtins, except where explicitly noted, had evolved to very rigidly
> defined algorithms years ago.

No algorithm for INTERSECTION that can be called ``evolved'' would
support the preservation of duplicates, regardless of the order of the
parameters. It would be inefficient compared to algorithms that don't
care about preserving duplicates.

Requiring implementations to do inefficient things to support
braindamaged inputs is rarely a good idea, unless there is evidence of
a significant body of existing code which relies on it.

The semantics of INTERSECTION are very rigid when you give it lists
that behave like sets, and therefore have no duplicates. Feed no
duplicates, get no duplicates.

Lastly, consider that it's *impossible* to require INTERSECTION to
support duplicates from both sets, commutatively, without
simultaneously requiring it to *generate* spurious duplicates when
neither list has duplicates.

If you compute

(intersection '("A") '("A") :test #equal)

you must get the list

'("A")

So in other words, one of the two "A" objects, which are not
necessarily the same object, has to be rejected so that duplicates are
not generated in the output.

You could require INTERSECTION to always favor the left side when
choosing what to reject, but then it would no longer satsify your
expectation of commutativity. You would end up with some ridiculous,
contorted definition like:

``Keep all the elements from the right list (all duplicates )
that appear one or more times in the left list. Additionally, for
all such elements, keep any duplicates from the left list.''
----
* So that if there is just one occurence of a matched item in the
left list, then it does not appear in the output. If the item
is duplicated twice, then just the second occurence appears in
the resulting list, and so on.

Ugh.

Paul F. Dietz

unread,

Jan 28, 2004, 8:19:55 PM1/28/04

to

Erik Naggum wrote:

> I have no idea how long «long enough»
> is, though. You wouldn't have any empirical data on the threshold?

Hmm, good question...

Trying it on CMUCL 18e+ (on an Athlon XP+, under linux, with safety 0,
debug 0, and speed 3), the crossover is around length 30 when
a list of unique symbols is intersected with its reverse. Intersecting
two disjoint sets of the same size pushes the breakeven point lower,
to around 20.

The function (only a binary version):

(defun hashed-intersection (set1 set2)
(declare (optimize speed (safety 0) (debug 0)))
(let ((table (make-hash-table :size (length set2)))
(result nil))
(dolist (e set2) (setf (gethash e table) t))
(dolist (e set1) (when (gethash e table) (push e result)))
result))

Paul

Thomas F. Burdick

unread,

Jan 28, 2004, 10:29:35 PM1/28/04

to

"Paul F. Dietz" <di...@dls.net> writes:

> Erik Naggum wrote:
>
> > I have no idea how long «long enough»
> > is, though. You wouldn't have any empirical data on the threshold?
>
> Hmm, good question...
>
> Trying it on CMUCL 18e+ (on an Athlon XP+, under linux, with safety 0,
> debug 0, and speed 3), the crossover is around length 30 when
> a list of unique symbols is intersected with its reverse. Intersecting
> two disjoint sets of the same size pushes the breakeven point lower,
> to around 20.

Just checked this on my iBook with SBCL 0.8.4, from a different angle.
The last time I heavily used INTERSECTION in an application, I was
repeatedly getting the intersection of new input, and my old results
set. Something like:

(loop with result = (next-set)
for set = (next-set)
while (not (eq set +stop+))
do (setf result (intersection result set))
while result
finally (return result))

where NEXT-SET is somewhat expensive to compute, and gives very long
lists. Hacking away quickly, I originally put the long list on the
left hand side in the call to INTERSECTION, because that's how it
worked in my brain. When looking at the code the next day I spotted
that, and swapped their positions, giving a ~2x speedup for the
function.

With that little anecdote in mind, I tested the case of one small set
(two items), and one long one (1000), where both elements of the small
set occur at the end of the long one. Repeating 10000 times, I got:

INTERSECTION-SIMPLE long short - ~15s
INTERSECTION-SIMPLE short long - ~0.81s
INTERSECTION-HASHED long short - ~3.2s
INTERSECTION-HASHED short long - ~6.8s

Which goes to show that this is a multidimentional space we're talking
about -- we all knew that, but the dimensions really matter. I also
suspect that in most cases where the efficiency of INTERSECTION
matters, the sets are probably fairly different in size, along the
lines of my example above. In that case, assuming you know which is
smaller, the non-hashed version is the better of the two, for fairly
large ranges. The break-even for simple-short-long v hashed-long-short,
with long being 1000 items, is ~15-150, depending on where in long the
intersection occurs (beginning or end).

--
/|_ .-----------------------.
,' .\ / | No to Imperialist war |
,--' _,' | Wage class war! |
/ / `-----------------------'
( -. |
| ) |
(`-. '--.)
`. )----'

Kenny Tilton

unread,

Jan 29, 2004, 2:54:56 AM1/29/04

to

Tuang wrote:
> I did understand that intersection (and other set operations) had
> several possible definitions in the presence of duplicates, but I was
> thinking that the Lisp world had probably chosen the one that was most

> useful as a building block for higher abstractions ....

you have received a number of carefully reasoned, delicately phrased
responses. i would like to try a different tack:

Bullshit. "severable possible definitions"?! Actually the phrase about
duplicates being possible means (duh) /no/zero/zilch/nada/rien/ result
is defined. Right? Think about it. In which case Lisp language lawyers
can evolve away until the Sun supernovas and no "rigidly define
algorithm" will emerge. It would be as productive as evolving a rigid
definition for the sum of two anonymous functions.

A perfectly valid complaint is with the spec, which should have
explicitly stated "undefined", but I find it astonishing that the spec
even stopped to point out what happens if you give non-sets to set
operations (hey, maybe a runtime error would be appropriate? but at a
wicked performance cost?) so it is a tad hard working up the energy for
that complaint.

kenny

--

http://www.tilton-technology.com/
---------------------------------------------------------------
"[If anyone really has healing powers,] I would like to call
them about my knees."
-- Tenzin Gyatso, the Fourteenth Dalai Lama

Barry Margolin

unread,

Jan 29, 2004, 3:45:33 AM1/29/04

to

In article <4018BCFA...@nyc.rr.com>,
Kenny Tilton <kti...@nyc.rr.com> wrote:

> A perfectly valid complaint is with the spec, which should have
> explicitly stated "undefined", but I find it astonishing that the spec
> even stopped to point out what happens if you give non-sets to set
> operations

We couldn't imagine any reasonable implementation of the set operations
that would have drastically weird results. The only variation we could
think of is whether the duplicates would reappear in the result or not.
So we didn't see the need to allow for anything else in the spec.

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***

Marco Antoniotti

unread,

Jan 29, 2004, 10:44:43 AM1/29/04

to

Tuang wrote:

>
> I did understand that intersection (and other set operations) had
> several possible definitions in the presence of duplicates, but I was
> thinking that the Lisp world had probably chosen the one that was most
> useful as a building block for higher abstractions long ago -- like
> defining multiplication of no arguments to equal 1 instead of making
> it implementation dependent.
>
> Of course, Lisp makes it easy for me to define it any way I like and
> implement it in about half a dozen lines of code, so it doesn't matter
> in that respect. I was thinking, though, that I was going to learn how
> I *ought* to define it, mistakenly thinking that all of the Lisp
> builtins, except where explicitly noted, had evolved to very rigidly
> defined algorithms years ago.

Well. Nobody is perfect. Some like it hot. :)

I have no fear to be contradicted when I say that the Common Lisp spec
is not perfect and has holes. The set interpretation of lists is a
"convenience" that the spec has. But by no means it is to be intended
as a "good and completely mathematically sound" implementation of set
theoretical notions. As said elsewhere, you have no guarantee that
INTERSECTION will be performed efficiently and you have no way to add an
element to a set represented to a list with the guarantee that
duplicates will be taken care of.

These are limitations of the spec. They are not the only one. At least
(1) we have a spec, and (2) it is not the whole language to be limited :)

Cheers
--
Marco

Paul Dietz

unread,

Jan 29, 2004, 11:32:31 AM1/29/04

to

"Thomas F. Burdick" wrote:

> The break-even for simple-short-long v hashed-long-short,
> with long being 1000 items, is ~15-150, depending on where in long the
> intersection occurs (beginning or end).

The sbcl intersection routine doesn't special-case :test 'eq
(where is could inline the EQ call), and the general case
doesn't check for whether the element being searched for really
needs EQL. If it did that, it would be three or four times
faster on the tests I ran (judging by the time of special
purpose routines I wrote.)

I've been told by an sbcl developer that he wants to put in
a general mechanism for these special-case algorithms before
they start adding them.

ACL has some special cases like this, but not (AFAICT) for
INTERSECTION.

One extension to the list set routines I'd like to see is
a :MARK keyword argument. It would take a function of two
arguments, a list element and an optional value. If called
with one argument it would return a mark value stored in the
object. If called with two, it would store the second value
as the mark value in the object. This would enable list operations
to be implemented in linear time without hash tables.

Paul

Pascal Bourguignon

unread,

Jan 29, 2004, 12:03:09 PM1/29/04

to

Henrik Motakef <usenet...@henrik-motakef.de> writes:

Actually, when implementing a Common-Lisp, I'd like to give random
results, ie. choose non-deterministically between the results
specified by the standard.

If the standard says somewhere that such function MAY be destructive,
randomly destruct or not the argument, etc.

That would be a great implementation to test conformity of programs.

Thomas A. Russ

unread,

Jan 29, 2004, 1:37:16 PM1/29/04

to

"Paul F. Dietz" <di...@dls.net> writes:

> The function (only a binary version):
>
> (defun hashed-intersection (set1 set2)
> (declare (optimize speed (safety 0) (debug 0)))
> (let ((table (make-hash-table :size (length set2)))
> (result nil))
> (dolist (e set2) (setf (gethash e table) t))
> (dolist (e set1) (when (gethash e table) (push e result)))
> result))
>
> Paul

Interestingly enough, by using this algorithm, a small change to the
second DOLIST form will give a version that is guaranteed to not have
duplicates as well:

(dolist (e set1) (when (gethash e table)
(push e result)

(setf (gethash e table) nil)))

-Tom.

--
Thomas A. Russ, USC/Information Sciences Institute

Kaz Kylheku

unread,

Jan 29, 2004, 4:58:51 PM1/29/04

to

Marco Antoniotti <mar...@cs.nyu.edu> wrote in message news:<LT9Sb.535$Nq.1...@typhoon.nyu.edu>...

> theoretical notions. As said elsewhere, you have no guarantee that
> INTERSECTION will be performed efficiently and you have no way to add an
> element to a set represented to a list with the guarantee that
> duplicates will be taken care of.

You mean other than:

(adjoin 3 '(1 2 3)) -> (1 2 3)
(adjoin 4 '(1 2 3)) -> (4 1 2 3)

(defvar *list* '(1 2 3))
(pushnew 3 *list*) -> (1 2 3)
(pushnew 4 *list*) -> (4 1 2 3)

(union '(1 2 3) '(2 3 4)) -> (1 2 3 4)

:)

Marco Antoniotti

unread,

Jan 30, 2004, 12:28:51 PM1/30/04

to

Kaz Kylheku wrote:
> Marco Antoniotti <mar...@cs.nyu.edu> wrote in message news:<LT9Sb.535$Nq.1...@typhoon.nyu.edu>...
>
>>theoretical notions. As said elsewhere, you have no guarantee that
>>INTERSECTION will be performed efficiently and you have no way to add an
>>element to a set represented to a list with the guarantee that
>>duplicates will be taken care of.
>
>
> You mean other than:
>
> (adjoin 3 '(1 2 3)) -> (1 2 3)
> (adjoin 4 '(1 2 3)) -> (4 1 2 3)

Oooops. Sorry. I stand corrected. I had forgotten ADJOIN. However,
it is defined on LISTs, not sets.

I guess my point is that (1) you *can* interpret LISTs as sets, but (2)
you cannot expect to use LIST operations on the lists you interpreted as
sets without being aware of the cross-over. This is ammunition for the
static-type aficionado :)

>
> (defvar *list* '(1 2 3))
> (pushnew 3 *list*) -> (1 2 3)
> (pushnew 4 *list*) -> (4 1 2 3)
>
> (union '(1 2 3) '(2 3 4)) -> (1 2 3 4)

On LWM

(union '(a a a 2 3) '(a s d 2))
==> (D S A A A 2 3)

Cheers
--
Marco