Data Structure for String Interning?

518 views
Skip to first unread message

jlfo...@berkeley.edu

unread,
Jan 9, 2022, 5:52:24 PM1/9/22
to golang-nuts
I'm aware of Artem Krylysov's idea for string interning published on

If I understand it correctly, each string is stored twice in a map, once as
a key and once as a value. That means that words that only appear once in
his example of string interning the words in the novel 1984 are 100% inefficient
in terms of storage. Words that appear twice are neutral. The advantage of his
approach only appears for words that appear three or more times.

I'm wondering if there's a map-like data structure that would store a string
as the key, and the address of the key as the value. I'm aware that a standard
Go map can't be used for this because its components might be moved around
while a program is running so taking the address of the key would be dangerous,
which is why it isn't allowed.

This came up because I profiled a run of an app I'm working on that
processed 12518 strings, of which 9810 appeared 2 or fewer times. Krylysov's
approach would only be marginally useful for this app.

The data structure I'm looking for would always be at least neutral, and would start
showing a benefit when a word appears twice.

Does anybody know of such a data structure?

Cordially,
Jon Forrest

burak serdar

unread,
Jan 9, 2022, 6:07:18 PM1/9/22
to jlfo...@berkeley.edu, golang-nuts
Note that a with a map[string]string, the code:

m[s]=s

The contents of the string s are not duplicated, only the string header s is. 

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/36684584-11eb-4bc0-a436-4906d523c8ban%40googlegroups.com.

jlfo...@berkeley.edu

unread,
Jan 9, 2022, 6:30:12 PM1/9/22
to golang-nuts
On Sunday, January 9, 2022 at 3:07:18 PM UTC-8 bse...@computer.org wrote:
Note that a with a map[string]string, the code:

m[s]=s

The contents of the string s are not duplicated, only the string header s is. 

I didn't know this. That's very good to know.

What about this:

package main

type word struct {
  s string
  refcnt int
}

var w1 = word {"moby", 1}

func main() {
        m := make(map[string]word)
        m["moby"] = w1
}

This is a little closer to what I'd like to do.
How many times would the string "moby" be stored?

Thanks,
Jon

Axel Wagner

unread,
Jan 9, 2022, 6:34:48 PM1/9/22
to golang-nuts
You might be interested in this package:
It is probably your best choice for a long-term maintained implementation of this concept. Matt Layher explained the rationale and design here:
and Brad Fitzpatrick wrote a blog post about how Tailscale uses it here:

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

burak serdar

unread,
Jan 9, 2022, 6:42:58 PM1/9/22
to jlfo...@berkeley.edu, golang-nuts
If you write it as follows, the string "moby" will have only one copy:

m[w1.s]=w1

This will only duplicate the string header `s`, not the contents of 's'.
 

Thanks,
Jon

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.

Bakul Shah

unread,
Jan 9, 2022, 7:35:46 PM1/9/22
to burak serdar, jlfo...@berkeley.edu, golang-nuts
The string header will be 16 bytes on 64bit word size machines.
If most of the words are much shorter, interning won’t buy you
much. For applications where you *know* all the words are short,
and the total string space won’t exceed 4GB, you can try other
alternatives. For instance if the max length of any word is 16 bytes,
create 16 string tables. Then the ‘interned string’ type is a 4 byte
index, with the bottom 4 bits indicating the length as well as which
table to index! A table for N byte words stores one copy of each
unique word and in consecutive slots on N bytes. Given that on modern
machines even though memory can be plentiful, memory access is slow
& computing is fast, this can win. You will also need a possibly unsafe
String func. that doesn’t allocate string data space. You can even allow
an overflow table if longer words are not common. And you can allow
many more unique words by extending the interned string type to 8
bytes.

On Jan 9, 2022, at 3:07 PM, burak serdar <bse...@computer.org> wrote:



burak serdar

unread,
Jan 9, 2022, 9:16:14 PM1/9/22
to Bakul Shah, jlfo...@berkeley.edu, golang-nuts
I think the question is "why do you want to intern strings". The solution can be different for different use cases.

I work with large Json files where keys are repeated, and the keys are long strings (URLs) so using a map[string]string to intern the keys makes a lot of sense. For serialization, the same idea is used to compress large data using string tables: a []string holds all unique strings, and references to those strings are represented using uint32 indexes. For the example where you intern words, you might even consider a single []byte containing all the words concatenated, with a custom string representation containing uint32 offset and length. This may or may not be faster than the map implementation, but it will occupy less space, but with a single large object in memory.

Bakul Shah

unread,
Jan 9, 2022, 10:50:33 PM1/9/22
to burak serdar, jlfo...@berkeley.edu, golang-nuts

On Jan 9, 2022, at 6:15 PM, burak serdar <bse...@computer.org> wrote:

I think the question is "why do you want to intern strings". The solution can be different for different use cases.

Indeed. This is a given! Such tricks should only be tried if the benefits
outweigh the cost, as per profiling expected use cases for your program.

[Aside:
Though it is unfortunate that most (all?) programming languages
do not make it easy to try out alternative data structures. As an
example, try replacing processing an array of structs with a struct
of arrays, with each struct member stored in a different array!
Row oriented vs Column oriented processing. 
]

Jesper Louis Andersen

unread,
Jan 26, 2022, 8:36:14 PM1/26/22
to jlfo...@berkeley.edu, golang-nuts
On Sun, Jan 9, 2022 at 11:52 PM jlfo...@berkeley.edu <jlfo...@berkeley.edu> wrote:

I'm wondering if there's a map-like data structure that would store a string
as the key, and the address of the key as the value. I'm aware that a standard
Go map can't be used for this because its components might be moved around
while a program is running so taking the address of the key would be dangerous,
which is why it isn't allowed.


Somewhat late answer:

If you want to squeeze the CPU/Memory trade-off even more, and you have strings where many of them are prefixes of each other, then you can look into Trie-structures, where each internal node has some compression scheme depending on its density.

Even more squeezing can be had by using Level-Compressed Tries, Radix/Crit-bit/Patricia trees, and so on. Also, depending on what you want to do with the data, generalized suffix trees, and Aho-Corasick comes to mind.

Tries are map-like in the following way: they are bootstrapped by means of an underlying map. Each internal node is a map-like structure, generalizing from "rank-0" maps to "rank-1" maps.

Of course, depending on your actual data, these data structures might end up taking more memory than simpler interning schemes. But one can often compress them further in memory if space is paramount.

--
J.
Reply all
Reply to author
Forward
0 new messages