ANN: an off-heap hash-table for go

696 views
Skip to first unread message

Jason E. Aten

unread,
Jan 4, 2015, 9:26:28 PM1/4/15
to golan...@googlegroups.com
Given the persistent issues with big hash tables requiring long GC pause times that have lead to discussion[1][2][3] without a standard solution, I've implemented[4] an entirely off-heap hash-table in Go. It uses a simple Malloc()/Free() implementation based on mmap (gommap) to obtain memory directly from the OS.  It could just as easily use malloc from the C runtime in you want use CGO, but the current implementation avoids that in favor of an entirely Go-based solution for ease of "go get  github.com/glycerine/go-offheap-hashtable".

This off-heap hash table is based on a simple open-addressing with linear-probing and the speedy xxhash64 hash function[5]. The implementation was inspired by Jeff Preshing's blog article where he shows that such a strategy can best even the highly cache-tuned Judy array[6].

See here:


-Jason

Nick Craig-Wood

unread,
Jan 5, 2015, 3:51:32 AM1/5/15
to golan...@googlegroups.com, j.e....@gmail.com
On 05/01/15 02:26, Jason E. Aten wrote:
> Given the persistent issues with big hash tables requiring long GC pause
> times that have lead to discussion[1][2][3] without a standard solution,
> I've implemented[4] an entirely off-heap hash-table in Go. It uses a
> simple Malloc()/Free() implementation based on mmap (gommap) to obtain
> memory directly from the OS. It could just as easily use malloc from
> the C runtime in you want use CGO, but the current implementation avoids
> that in favor of an entirely Go-based solution for ease of "go get
> github.com/glycerine/go-offheap-hashtable".
>
> This off-heap hash table is based on a simple open-addressing with
> linear-probing and the speedy xxhash64 hash function[5]. The
> implementation was inspired by Jeff Preshing's blog article where he
> shows that such a strategy can best even the highly cache-tuned Judy
> array[6].
>
> See here:
>
> https://github.com/glycerine/go-offheap-hashtable

Interesting idea!

I had a look at the godoc for the package

http://godoc.org/github.com/glycerine/go-offheap-hashtable

It wasn't immediately obvious how to use the hash table - it desperately
needs an example.

There seem to be too many implementation details which are public.

IMHO all the public types, functions and methods should have a doc string.

--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Jason E. Aten

unread,
Jan 5, 2015, 4:08:32 AM1/5/15
to Nick Craig-Wood, golang-nuts
I'll see about making the godoc look better, thanks for pointing that out.

There are usage examples in most of the test files. bytekey_test.go, insert_test.go, string_test.go, and rand_test.go demonstrate using various different types of keys, such as string, []byte, and int.


Nate Finch

unread,
Jan 5, 2015, 7:26:49 AM1/5/15
to golan...@googlegroups.com
Totally agree about not making implementation details public. For example, IsDir is not a function that a package that is about HashTables should expose. Only expose what you absolutely must for the package to be useful.

Egon

unread,
Jan 5, 2015, 8:10:08 AM1/5/15
to golan...@googlegroups.com
As everyone else has commented, make the public API significantly smaller.

I would probably start with something like: http://play.golang.org/p/-8FueAXbo1
Remove everything else (except those loading/saving to disk with mmap).

Any other opinions how to improve the API / what to keep and leave out from the API.

I also considered making separate table for each key/value type. Although it would make things more confusing.

+ Egon

Egon

unread,
Jan 5, 2015, 2:30:12 PM1/5/15
to golan...@googlegroups.com, egon...@gmail.com
On Monday, 5 January 2015 20:44:14 UTC+2, Jason E. Aten wrote:
Thanks for the suggestions guys. I've simplified the exposed methods and added docs to them.

You also might want to make the public import as "github.com/glycerine/offheap".

Also, there might be a better name for it - using the negation form of "not on the heap" is probably not that informative.

Is it necessary to have it as "offheap.HashTable" and not "offheap.Table"? I'm just thinking whether the "Hash" part adds any clarity to the code that uses it? To me it really doesn't add any useful information.

Also try to get rid of as much stutter as possible, e.g.
func (t *HashTable) DestroyHashTable()
-->
func (t *HashTable) Destroy()

Diagnostic stuff probably shouldn't be part of public API
func (*HashTable) Dump

That probably could be cleaned up or made clearer what it is... e.g. name it:
TableFile or something like that.

Similarly Malloc -> NewFileBackedTable() or NewFileTable()

For these don't add the suffixes, because the type already contains "ByteKey" it should be obvious that Insert/Delete/Lookup take "[]byte" as argument. (Similarly for the StringHashTable).

Rename that to something simple like "Key".
Rename that to something simple like "Value".

I would say put the Iterator constructor on the hashtable and not as a separate constructor.

If you intend some of the parameters be modified by the package user, make them constants e.g.
const (
  KeySize = 64
  CellSize = 128
)
and place them in a separate file - that way you can point to it in the documentation.


--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/o7ABs3i8lHM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Jason E. Aten

unread,
Jan 5, 2015, 4:11:42 PM1/5/15
to Egon, golang-nuts
On Mon, Jan 5, 2015 at 11:30 AM, Egon <egon...@gmail.com> wrote: 
You also might want to make the public import as "github.com/glycerine/offheap".

Thank you, Egon. I like this shorter form. I've renamed it just "offheap" and put it here:

 
 I'll take a look at your other suggestions too. Some of them are just stylistic, and I tend to prefer clarity over brevity, but others are worth doing. May take me a bit to get to it though.

Best,
-Jason

Egon

unread,
Jan 5, 2015, 4:45:31 PM1/5/15
to golan...@googlegroups.com, egon...@gmail.com


On Monday, 5 January 2015 23:11:42 UTC+2, Jason E. Aten wrote:
On Mon, Jan 5, 2015 at 11:30 AM, Egon <egon...@gmail.com> wrote: 
You also might want to make the public import as "github.com/glycerine/offheap".

Thank you, Egon. I like this shorter form. I've renamed it just "offheap" and put it here:

 
 I'll take a look at your other suggestions too. Some of them are just stylistic, and I tend to prefer clarity over brevity,

Take usability as the primary goal - not clarity nor brevity. Clarity and brevity are often the side-effects of usability.

e.g.
"i" is very concise, but still very usable
for i := 0; i < 10; i += 1 {

There is probably no value in making it longer:
for iteratorindex := 0; iteratorindex < 10; iteratorindex += 1 {

This is clearer in its meaning - but it is not more usable.

In a similar situation some global
var F map[string]string
Is concise, but it's not really usable.

My general rules for names is:
1. does this name part provide any useful distinction from other things?
2. can it be still clearly understood without it?

If answers to 1 and 2 are "yes" I will remove that part - because it probably adds nothing useful to the name.

+ Egon
Reply all
Reply to author
Forward
Message has been deleted
0 new messages