How to efficiently compare related persistent collections (maps, sets)?

661 views
Skip to first unread message

Dragan Djuric

unread,
Dec 16, 2009, 8:53:31 AM12/16/09
to Clojure
Hi,

Here's the example of what I meant in the topic title:

Let's say we have a set s1 that have 3 elements: #{obj1, obj2, obj3}
I add a few elements to it and get s2 #{obj1, obj2, obj3, obj4, obj5}
It is important to notice that, because s2 has been created by
"modifying" s1, it reuses its structure, so these sets are related by
the implementation with the persistent structures.

Now, I can use difference function to get the difference (obj4, obj5),
but it seems that this function needs to traverse all elements, which
can be heavy if the collection stores thousands of elements.

I have a hunch that, If the collections are somehow related by the
fact that one collection is used in building the other one, that can
be used as a hint in retrieving the difference. Is something like this
possible in the current implementation and how to do it? I don't mind
accessing persistent collections-related implementation-specific code.

Sean Devlin

unread,
Dec 16, 2009, 9:45:11 AM12/16/09
to Clojure
Set equality requires the complete traversal of the set, and will
always be O(n). However, there are shortcuts for determining if they
are not equal. You could use the following test

(if (first (remove set1 set2))
:not-equal
(if (first (remove set2 set1))
:not-equal
:equal))

remove is lazy, so this will return quickly if it is not equal. I
would recommend making set1 smaller than set2

Hope this helps,
Sean

Meikel Brandmeyer

unread,
Dec 16, 2009, 9:58:27 AM12/16/09
to Clojure
Hi,

On Dec 16, 3:45 pm, Sean Devlin <francoisdev...@gmail.com> wrote:

> Set equality requires the complete traversal of the set, and will
> always be O(n).

I think, what Dragan was refering to is the shared structure in a set.
Let's say you have to two sets A' and A'' which evolved from set A by
adding elements. So the internal structure is partially shared. So
when comparing A' with A'' and you hit in both sets the same shared
structure you can short-circuit, knowing that they are equal and you
have to only compare the rest.

Whether this is possible, feasible, etc... Dunno. No clue here.

Sincerely
Meikel

Stuart Sierra

unread,
Dec 16, 2009, 10:24:27 AM12/16/09
to Clojure
In general, straight equality is efficient for Clojure data
structures. For example, the equals() implementation for sets checks
type, size, and hash code before examining the set elements.
Determining that two sets are equal is still O(n), but determining
that they are NOT equal is usually O(1).

As for accessing the shared structure, that's definitely not trivial,
and would require digging into the Java sources.

-SS


On Dec 16, 8:53 am, Dragan Djuric <draga...@gmail.com> wrote:

Sean Devlin

unread,
Dec 16, 2009, 10:28:15 AM12/16/09
to Clojure
Oh, I get it. Thanks Meikel.

I imagine this is possible if you drill into the guts of
PersistentHaspMap, but I would strongly discourage the behavior in
user code. Perhaps as an upgrade to the object itself? There is a 1%
chance that this could be a language upgrade, assuming it works across
the board. I would tread cautiously.

Sean

Richard Newman

unread,
Dec 16, 2009, 2:38:28 PM12/16/09
to clo...@googlegroups.com
> I imagine this is possible if you drill into the guts of
> PersistentHaspMap, but I would strongly discourage the behavior in
> user code. Perhaps as an upgrade to the object itself? There is a 1%
> chance that this could be a language upgrade, assuming it works across
> the board. I would tread cautiously.

I could easily imagine an implementation of equality for two tree sets
checking for intermediate node reference equality as a shortcut... but
as you say, not in user code.

Right now, APersistentSet.equals(Object o) only casts to Set, and does
equality checking by iterating over the elements of the input set.

I'd certainly benchmark this before doing the work, though -- I'd
guess that most equal sets will hash-compare as equal, most non-equal
sets will fail early on in a per-element comparison, and it's probably
a minority of compared sets that will share structure. It's unlikely
that adding an additional object identity comparison is worth the
machine instructions.

My hunch (from examining the performance of Clojure's sets in the
past) is that it's not worth doing the work...

Dragan Djuric

unread,
Dec 16, 2009, 2:59:03 PM12/16/09
to Clojure
Yes, the true/false for equality is not a problem.

I am looking for a shortcut that finds different elements more
efficiently. So, the sets are different, but I want to get hold of
elements that are in s2 but not in s1.

ajay gopalakrishnan

unread,
Dec 16, 2009, 12:51:24 PM12/16/09
to clo...@googlegroups.com
Your argument is right and it is a good idea to take advantage of the shared structure to calculate differences. However, it is important to remember that is is just a special case and I don't expect that whenever you want to calculate a difference between two sets, you always compare between the older and newer versions only. In general, I would guess that one would need to compare 2 sets that were never shared in the first place and were created separately. In that case, there is not sharing and this idea wont be applicable. So, if the language must support set-difference then it needs to have two types of implementations anyway, which seems to indicate that it is better not to add such a feature to the language.

Perhaps somebody can write another library for set operations (because it is very frequent these days) that implements this paper (although, I believe this will be easier to implement it at the Java level & not Clojure level)
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.9963

Ajay

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

ajay gopalakrishnan

unread,
Dec 16, 2009, 3:04:02 PM12/16/09
to clo...@googlegroups.com
If the sets data structure is also not shared, then the paper I mentioned (link provided earlier) is one of the fastest to date. And it is very small & easy to implement.

Rich Hickey

unread,
Dec 28, 2009, 2:19:32 PM12/28/09
to clo...@googlegroups.com
On Wed, Dec 16, 2009 at 3:04 PM, ajay gopalakrishnan <ajgo...@gmail.com> wrote:
> If the sets data structure is also not shared, then the paper I mentioned
> (link provided earlier) is one of the fastest to date. And it is very small
> & easy to implement.
>

I think an implementation of persistent sets based upon treaps would
make a useful contribution:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.54.5678

Rich

Reply all
Reply to author
Forward
0 new messages