Go Performance for Sets (in this case map[string]bool)

2,286 views
Skip to first unread message

Mirko Dziadzka

unread,
May 5, 2014, 2:33:03 PM5/5/14
to golan...@googlegroups.com
Hi

I'm new to Go and coming from Python. Im starting to port some of my programs to Go. 

I'm trying to implement a Set of strings as 

type set map[strings]bool

and adding a value to the set  as

res := make(set)
res[value] = true

Unfortunately, my Go implementation[1] is significant slower[3] than the corresponding python implementation[2]. 
I'm probably doing something wrong here, is there a better way to to implement sets of strings in Go?

Any pointers are welcome

Thanks

    Mirko

You can see the full code of the Go function and the corresponding Python function on 

Hotei

unread,
May 5, 2014, 2:47:29 PM5/5/14
to golan...@googlegroups.com
Are you comparing apples and oranges?  

A go map is the same as a python dictionary.  I've not used python actively for a few years but it seems like a set would be more like a slice in go ie []string.  A map and a slice can both be used to do reverse lookups ( "is x in set" ) but a map does it a lot faster than "for _,v in range set { if x==v{ etc  Depends a lot on how you want to use it.

Caleb Spare

unread,
May 5, 2014, 2:50:33 PM5/5/14
to Mirko Dziadzka, golang-nuts
If you use a map[string]struct{} instead, you won't need to use any
space for the values and you won't have the extra unwanted semantics
of a bool.

If you're inserting a large, known number of items (or even if you can
estimate the number), you can presize the map which may speed things
up: make(map[string]struct{}, 12345)

But really, you should provide a reproducible test that doesn't rely
on a large file that we don't have. It would be better if the tests
generated the data themselves; then we could run the same code on our
machines, reproduce the results, and provide better suggestions.
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Matthew Kane

unread,
May 5, 2014, 2:58:04 PM5/5/14
to Hotei, golang-nuts
No, python's sets are akin to map[string]struct{}:
http://stackoverflow.com/questions/3949310/how-is-set-implemented

Since you're reading the entire file in one shot, you can give the
number of lines in the file as a hint to make() for the final size of
the map. I get a significant speedup by doing this.
http://play.golang.org/p/_uyRsN0KwY
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
matt kane
twitter: the_real_mkb / nynexrepublic
http://hydrogenproject.com

Mirko Dziadzka

unread,
May 5, 2014, 3:15:36 PM5/5/14
to golan...@googlegroups.com, Hotei
Ah ... The size hint to make() really helped to bring this to the same speed as the python implementation.

Thanks for your help

    Mirko

Rui Ueyama

unread,
May 5, 2014, 3:21:44 PM5/5/14
to Mirko Dziadzka, golang-nuts, Hotei
Dave Cheney happened to do some research on map[string]bool and map[string]struct{} very recently. You may enjoy it. https://gist.github.com/davecheney/3be245c92b61e5045f75

Shawn Milochik

unread,
May 5, 2014, 3:59:12 PM5/5/14
to golan...@googlegroups.com
I've been searching and can't seem to find the reference, but a recent conversation on this list pointed the OP to use a map of empty interfaces. I don't recall anyone advising they should use a map of empty structs.

However, this thread suggests empty structs, and I saw it mentioned in a book also. The argument for using an empty interface seems to apply equally to an empty struct -- it's explicit that you don't care about the value (as opposed to just using a boolean).

Is there a difference? Is there a "best practice" for this? Does it make a difference in any way as far as memory consumption?

Kyle Lemons

unread,
May 5, 2014, 4:01:37 PM5/5/14
to Sh...@milochik.com, golang-nuts
Theoretically an interface{} still occupies space, a struct{}{} doesn't.  With the interface you can set it with "m[k] = nil" and with the struct case you have to "m[k] = struct{}{}".  I don't see a reason to use an interface{} over bool, but I see the argument in favor of the struct{}{} though I don't find it compelling. 

Dan Kortschak

unread,
May 5, 2014, 4:08:48 PM5/5/14
to <Shawn@Milochik.com>, golan...@googlegroups.com
Empty struct are zero size and cannot store things, empty interface are two uintptr in size and can store (possibly indirected) anything. I you really want a set use struct{} as the values.

You can make a generic set which is map[interface{}]struct{}, but they are so simple to define, why not jus make a type-specified set for each type.

Hotei

unread,
May 5, 2014, 4:41:18 PM5/5/14
to golan...@googlegroups.com, Hotei
No - actually they're not akin to maps - at least according to python.org's definition of a set at https://docs.python.org/2/library/sets.html  Set's implement a very restricted subset - to quote python.org:

Like other collections, sets support x in setlen(set), and for x in set. Being an unordered collection, sets do not record element position or order of insertion. Accordingly, sets do not support indexing, slicing, or other sequence-like behavior.

Caleb Spare

unread,
May 5, 2014, 4:41:39 PM5/5/14
to Shawn Milochik, golang-nuts
On Mon, May 5, 2014 at 12:59 PM, Shawn Milochik <shawn...@gmail.com> wrote:
> I've been searching and can't seem to find the reference, but a recent
> conversation on this list pointed the OP to use a map of empty interfaces. I
> don't recall anyone advising they should use a map of empty structs.
>
> However, this thread suggests empty structs, and I saw it mentioned in a
> book also. The argument for using an empty interface seems to apply equally
> to an empty struct -- it's explicit that you don't care about the value (as
> opposed to just using a boolean).

This is just not true. Empty interfaces *have* a value of value of
indeterminate concrete type. Only struct{} makes it clear you don't
care about the value.

>
> Is there a difference? Is there a "best practice" for this? Does it make a
> difference in any way as far as memory consumption?
>

Shawn Milochik

unread,
May 5, 2014, 4:47:14 PM5/5/14
to golan...@googlegroups.com
I believe that when a Python programmer comes to a Go list asking how to make something like a Python set, the primary concern is not duplicating any values. The OP may want to clarify if I'm wrong. Slices are much further from a set than a map in this regard. 

Set methods are a different story. But once you have the uniqueness you can deal with those separately. In my experience, the primary use of sets in Python is avoiding duplicates without having to check "if x in y" every iteration of a loop.

Mirko Dziadzka

unread,
May 5, 2014, 5:17:02 PM5/5/14
to golan...@googlegroups.com, Sh...@milochik.com

Shawn Milochik:
I believe that when a Python programmer comes to a Go list asking how to make something like a Python set, the primary concern is not duplicating any values. The OP may want to clarify if I'm wrong. Slices are much further from a set than a map in this regard. 

Ack

My intention is to have a 'set' data structure with (in this case) 2 methods

 * add an element to a set (with the special case for bulk insert at the beginning and occasional inserts later)
 * fast(!) check if an element is in the set

In my case, I will have about 10'000'000 elements, where every element is a short string (well a short byte array) - the result of an SHA1 hash of other date. As Go does not have a set type, I started with

type set map[string]bool
mySet := make(set)
// add a value v
mySet[v] = true
// check if v is in mySet
if mySet[v] { ...


I have updated https://gist.github.com/MirkoDziadzka/a941b46e0b66035f1129 with the improvements I got  from this discussion, mainly setting a size hint for the make(set) to speed up the bulk insert.

If I could solve this problem better with other standard Go data structures, I would love to hear about this. 


Greetings

    Mirko


Hotei

unread,
May 5, 2014, 5:19:39 PM5/5/14
to golan...@googlegroups.com, Sh...@milochik.com
I'm not arguing that sets are not implemented in python using a map-like capability (called a dictionary in python-speak) to prevent duplicates in the set. However, if they are that's a implementation detail, not part of the Set specification.

If the set is small enough a linear search can prevent duplicates.  Moderately sized sets can be de-duped with a binary search if the set is kept sorted.  The implementation details don't REQUIRE maps. That's my only point. 

Shawn, for the record I am not saying slice == set either so I agree with you on that.  

Rui Ueyama

unread,
May 5, 2014, 6:42:21 PM5/5/14
to Mirko Dziadzka, golang-nuts, Sh...@milochik.com
On Mon, May 5, 2014 at 2:17 PM, Mirko Dziadzka <mirko.d...@gmail.com> wrote:

Shawn Milochik:
I believe that when a Python programmer comes to a Go list asking how to make something like a Python set, the primary concern is not duplicating any values. The OP may want to clarify if I'm wrong. Slices are much further from a set than a map in this regard. 

Ack

My intention is to have a 'set' data structure with (in this case) 2 methods

 * add an element to a set (with the special case for bulk insert at the beginning and occasional inserts later)
 * fast(!) check if an element is in the set

In my case, I will have about 10'000'000 elements, where every element is a short string (well a short byte array) - the result of an SHA1 hash of other date. As Go does not have a set type, I started with

It's a little bit off topic, but if you are going to load so many items, you may not want to load the entire file into memory at once. Instead I'd use bytes.Bufio to read one line at a time.
 

type set map[string]bool
mySet := make(set)
// add a value v
mySet[v] = true
// check if v is in mySet
if mySet[v] { ...


I have updated https://gist.github.com/MirkoDziadzka/a941b46e0b66035f1129 with the improvements I got  from this discussion, mainly setting a size hint for the make(set) to speed up the bulk insert.

If I could solve this problem better with other standard Go data structures, I would love to hear about this. 


Greetings

    Mirko


egon

unread,
May 6, 2014, 1:09:58 AM5/6/14
to golan...@googlegroups.com
Here's how I would implement it http://play.golang.org/p/fc6dQaucvp
I didn't compare the performance.

+ egon

Michael Jones

unread,
May 6, 2014, 4:13:33 AM5/6/14
to egon, golang-nuts
By the way, egon's code has this (nice!) performance characteristic:

ORIGINAL:

mtj-macbookpro:scan mtj$ go build scan.go
mtj-macbookpro:scan mtj$ time ./scan < /usr/share/dict/words
read 235886 entries in 104.328141ms

real 0m0.112s
user 0m0.097s
sys 0m0.012s

MODIFIED (preallocate map at 256*1024 entries):

mtj-macbookpro:scan mtj$ vi scan.go 
mtj-macbookpro:scan mtj$ go build scan.go
mtj-macbookpro:scan mtj$ time ./scan < /usr/share/dict/words
read 235886 entries in 57.845774ms

real 0m0.064s
user 0m0.053s
sys 0m0.009s


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Michael T. Jones | Chief Technology Advocate  | m...@google.com |  +1 650-335-5765

Michael Jones

unread,
May 6, 2014, 4:24:38 AM5/6/14
to egon, golang-nuts
Oh, and Matthew Kane's version that reads the whole file, counts the lines, and then builds a map of the right size times in the middle of the prior two. It can't be as fast since it had to count what I set the other modified code to know by deus ex machina, but it is faster then the constant regrowth approach.

mtj-macbookpro:scan mtj$ go build read.go
mtj-macbookpro:scan mtj$ ./read
read 235886 entries in 82.824087ms

Michael Jones

unread,
May 6, 2014, 4:27:51 AM5/6/14
to egon, golang-nuts
Here's the original version from Mirko for comparison:

mtj-macbookpro:scan mtj$ go build cache.go 
mtj-macbookpro:scan mtj$ time ./cache
read 235886 entries in 115.656609ms

real 0m0.123s
user 0m0.106s
sys 0m0.014s

keith....@gmail.com

unread,
May 6, 2014, 12:05:29 PM5/6/14
to golan...@googlegroups.com, egon
If you're going to be looking up SHA1 hashes, don't use a map[string]struct{}, use a map[[sha1.Size]byte]struct{}.  It will be a lot more efficient because you're not allocating a string for each key.

Michael Jones

unread,
May 6, 2014, 4:15:32 PM5/6/14
to keith....@gmail.com, golang-nuts, egon
HASH:

mtj-macbookpro:scan mtj$ time ./hash < /usr/share/dict/words
read 235886 entries in 124.23074ms

real 0m0.131s
user 0m0.120
sys 0m0.022s

Keith Randall

unread,
May 6, 2014, 4:35:46 PM5/6/14
to golan...@googlegroups.com, keith....@gmail.com, egon
Keep in mind more than half of that time is the SHA1 itself.

Kevin Gillette

unread,
May 6, 2014, 10:16:50 PM5/6/14
to golan...@googlegroups.com, Hotei
Certainly you can do things to tweak map performance in Go, and although the Go map implementation is rather advanced and well engineered, note that a lot more effort has been put toward optimizing dicts (and by extension sets) in Python. This is, in part, due to the two extra decades Python has had to work on the problem, but more importantly it is because dicts are integral to virtually every aspect of CPython. Additionally, the refcounting mechanism in Python is more forgiving of wasteful code, and there are several optimizations for small dicts/sets. In contrast, Go maps are comparatively rarely used, since structs cover a majority of the common use cases.

In Go, small sets are probably more efficiently stored and processed as sorted slices; for membership testing, I suspect (though currently have no supporting evidence) that a few thousand elements is the threshold in which sorted slices become less efficient, and maps become more efficient; sorted slices will still be rather optimal for set logic operations, such as intersection and union.
Reply all
Reply to author
Forward
0 new messages