Efficiency of sets vs vectors vs dicts

270 views
Skip to the first unread message

catc...@bromberger.com

unread,
21 Feb 2015, 11:43:25 am21/2/15
to juli...@googlegroups.com
Hi all,

I ran into some performance degradation in one of my packages yesterday moving from vectors to sets and after some experimentation and research I think I've figured it out. I'd appreciate an extra set of eyes on this; it might also be a good candidate for the docs.

(Possibly flawed) assertions:
A) vectors: O(1) insertion, O(n) search (though this can be optimized to O(logN) in some cases?), O(1) delete (by index), O(n) traversal/iteration
B) sets: O(logN) insertion, O(1) search, O(1) delete (by value), O(?) traversal/iteration [q1]
C) dicts: O(1) insertion, O(n) search [q2], O(n) delete (by value), O(1) delete by key, O(?) traversal/iteration [q3]
D) Sets are just dicts where the key is some (automatic) hash of the value.

[q1] - This is where my original performance issues cropped up, in iterating over a set. We spent time in dict.jl, in skip_deleted().
[q2] - not sure whether this is correct.
[q3] - iteration over a dict seems expensive relative to a vector. This is related to [q1] as well because of assertion D).


Thank you for any guidance/feedback.

Stefan Karpinski

unread,
21 Feb 2015, 12:43:41 pm21/2/15
to juli...@googlegroups.com
On Sat, Feb 21, 2015 at 11:43 AM, <catc...@bromberger.com> wrote:
A) vectors: O(1) insertion, O(n) search (though this can be optimized to O(logN) in some cases?), O(1) delete (by index), O(n) traversal/iteration

Insertion and deletion are more like O(log n). Search is O(log n) if the array is sorted.

C) dicts: O(1) insertion, O(n) search [q2], O(n) delete (by value), O(1) delete by key, O(?) traversal/iteration [q3]

Insertion is O(log n) since it may require rehashing the entire data structure; ditto for delete by key. Searching for a key is O(1); searching for a value is O(n). Deleting a value is also O(n). Traversal/iteration is O(n).

B) sets: O(logN) insertion, O(1) search, O(1) delete (by value), O(?) traversal/iteration [q1]

D) Sets are just dicts where the key is some (automatic) hash of the value.

Sets are actually just dicts where the value type is Void – this is how they are implemented. So insertion and deletion are what they would be for dict keys: i.e. O(log n). Search is O(1), traversal is O(n).

[q1] - This is where my original performance issues cropped up, in iterating over a set. We spent time in dict.jl, in skip_deleted().

Technically, this is O(n) but as you've noticed, skipping over deleted elements is slow. Jeff was recently trying out Dict implementation that preserves insertion order and incidentally makes traversal 10x faster: https://github.com/JuliaLang/julia/pull/10116.
 
[q2] - not sure whether this is correct.

Key "search" is O(1) for dicts. This is pretty much what Dicts exist for – fast lookup.
 
[q3] - iteration over a dict seems expensive relative to a vector. This is related to [q1] as well because of assertion D).

It is – there may be up to 4x as many empty Dict slots as actual elements, and there's some cost to checking for deleted values as well, so 8-10x slower is entirely plausible, which seems borne out by Jeff's ordered Dict implementation that has 10x faster iteration.

It sounds like you might want to use a sorted vector instead of a dict-based Set.

catc...@bromberger.com

unread,
22 Feb 2015, 1:36:49 am22/2/15
to juli...@googlegroups.com
Stefan,

Thank you. This was exactly what I was looking for.

Seth.

Kevin Squire

unread,
22 Feb 2015, 3:06:55 am22/2/15
to juli...@googlegroups.com
On Sat, Feb 21, 2015 at 9:42 AM, Stefan Karpinski <ste...@karpinski.org> wrote:
On Sat, Feb 21, 2015 at 11:43 AM, <catc...@bromberger.com> wrote:
A) vectors: O(1) insertion, O(n) search (though this can be optimized to O(logN) in some cases?), O(1) delete (by index), O(n) traversal/iteration

Insertion and deletion are more like O(log n). Search is O(log n) if the array is sorted.

Aren't insertion and deletion (at an arbitrary index) O(n)?  Expected data movement should be n/4 (from the front or end of the array).

Viral Shah

unread,
22 Feb 2015, 3:33:46 am22/2/15
to juli...@googlegroups.com
Operations on sorted vectors come up so frequently, that I wonder if we should have a SortedVector type.

-viral

Stefan Karpinski

unread,
22 Feb 2015, 11:59:50 am22/2/15
to juli...@googlegroups.com
Ah, yes, insert and delete at an random point is O(n) – push and pop are O(log n).

Simon Byrne

unread,
22 Feb 2015, 12:27:48 pm22/2/15
to juli...@googlegroups.com
All this information would be great to have the docs.

Stefan Karpinski

unread,
22 Feb 2015, 12:38:36 pm22/2/15
to juli...@googlegroups.com
Jeff and I were talking about this too since there are a lot of much faster algorithms that apply when you know the arguments are sorted.
Reply all
Reply to author
Forward
0 new messages