How best to implement a hash table in Mathematica

515 views
Skip to first unread message

Joseph Gwinn

unread,
Feb 18, 2012, 6:28:51 AM2/18/12
to
I have an application that resembles dictionary lookup, with significant
rates of collision.

The current approach is to initialize the hash table, a 10^4 element
list of empty sublists, and then build the table by inserting document
object id numbers into the correct sublists as indexed by some important
property of the object. The choice of property varies with application.
The property may be quantized geographic location, or some attribute of
the object itself.

When done building the hash table, query consists of reading the sublist
using an index derived in the same way as for the original building of
the list. Each sublist may have many entries, and the entries need not
be numbers.

Anyway, I have a workable approach using

HashList[[index]]=
Union[Flatten[Join[HashList[[index]],{new object ID}]]]]

as the core update function, but I'm wondering how well this will scale
to say 10^6 sublists (which may be needed to keep the collisions under
control with large datasets), and of course if there is a better way in
Mathematica.

I did look at SparseArrays, but they don't support elements that are
sublists.

Although this hash table is one dimensional, I have uses for 2D and
maybe 3D tables, which are very common when one is hashing physical
location.

Thanks in advance,

Joe Gwinn

Szabolcs Horvát

unread,
Feb 19, 2012, 6:29:47 AM2/19/12
to
If you have an application similar to dictionary lookup, have you
considered using the DownValues of a symbol?

dict[_] = Null
dict["key1"] = "value1"
dict["key2"] = "value2"

This is a mutable data structure which might not play very well with
Mathematica's native paradigms, but it is often very useful.

You can also consider using a list of rules:

rules = {"key1" -> "value1", "key2" -> "value2"}

You can build a dispatch table using Dispatch[rules] to make
replacements very fast, but every time you add an element, it will be
necessary to re-build the Dispatch object from the rules, which might
take a while.

--
Szabolcs Horvát
Visit Mathematica.SE: http://mathematica.stackexchange.com/

David Bailey

unread,
Feb 19, 2012, 6:36:27 AM2/19/12
to
Before constructing a hash table (other than for pedagogical reasons), I
would try the built in mechanisms of Mathematica. For example, here is a
trivial loop which creates a 'hash table' called hashtable, with values
f[key]:

kk=1;
While[kk < 100000,
hashtable[kk]=f[kk];
kk++;
];

Of course, in this example the keys to the table are just the integers,
but the built-in mechanism will work with keys of any structure.

When a function is defined in this way as lots of individual values,
Mathematica uses techniques (probably hash) to access the values. Try it
and see how efficient it is!

David Bailey
http://www.dbaileyconsultancy.co.uk


Richard Fateman

unread,
Feb 19, 2012, 6:36:58 AM2/19/12
to
On 2/18/2012 3:28 AM, Joseph Gwinn wrote:
> I have an application that resembles dictionary lookup, with significant
> rates of collision.
>
>...

Mathematica presumably already uses hashing for storage of stuff like

f[{a,b,c}] = value1
f[{1,2,c}] = value2

etc.

Initializing a hashtable with 10^4 (identical?) empty keys makes no
sense. Did I misunderstand your message?



Joseph Gwinn

unread,
Feb 20, 2012, 2:48:47 AM2/20/12
to
In article <jhqmrr$fbq$1...@smc.vnet.net>,
The table has to be updated more or less at random as the dataset is
digested. What I'm doing is essentially your above method, but not all
elements will ever receive a value. In fact, most will not.

The reason to initialize the hashtable in advance is so a later probe to
an empty table element yields {}, versus repeating the question back.


> When a function is defined in this way as lots of individual values,
> Mathematica uses techniques (probably hash) to access the values. Try it
> and see how efficient it is!

Mathematica does uses hash tables internally, to good effect. There is
a discussion of this with respect to internal implementation of sparse
arrays, but I assume that hash tables are far more widely used inside
Mathematica.


For the record, here is my core hashtable algorithm, with transient
details omitted:

HashIndexModulus = 10000;

HashTable=Table[{},{HashIndexModulus}];

HashValuesList = Map[HashingFunction, dataset];

Do[
Do[HashTable[[i]] = Join[HashTable[[i]], {j}] // Union, {i,
HashValuesList[[j]]}
],
{j, Length[HashValuesList]}
];

The dataset in the above case is a list containing thousands of strings,
the strings being on average 100 characters long.

The code takes about 3 seconds on a 3.2 GHz intel iMac, processing about
1200 strings, and the computation time does not much vary with the size
of the hashtable. I would expect it to instead vary directly with the
total number of characters in the dataset.

Above 10^4 elements, creating the empty hashtable is linear in the
length of the hashtable, costing on average 53 to 54 nanoseconds per
element. Below 10^4, elements, one can see evidence of internal
gearshifting within the kernel. In any event, it takes 55 milliseconds
to init a 10^6 element hashtable, so this isn't a problem.

Joe Gwinn

Joseph Gwinn

unread,
Feb 20, 2012, 2:49:48 AM2/20/12
to
In article <jhqmsq$fbt$1...@smc.vnet.net>,
No, you understood correctly. Please see my response to David Bailey
for the rationale.

Joe Gwinn

Joseph Gwinn

unread,
Feb 20, 2012, 2:50:49 AM2/20/12
to
In article <jhqmfb$f7i$1...@smc.vnet.net>,
Each key value would need to support a perhaps empty perhaps large list
of values.


> You can build a dispatch table using Dispatch[rules] to make
> replacements very fast, but every time you add an element, it will be
> necessary to re-build the Dispatch object from the rules, which might
> take a while.

I did consider rules and Dispatch, but they seemed mismatched and thus
overly complex (and slow) compared to the direct approach with lists.

Please see my response to David Bailey for the details of my approach.

Joe Gwinn

Richard Fateman

unread,
Feb 21, 2012, 6:11:40 AM2/21/12
to
Again, I may misunderstand, but I think you can accomplish the same
thing by typing

f[x_]:= {}

if {} is the default empty value.

RJF



A Retey

unread,
Feb 21, 2012, 6:20:22 AM2/21/12
to
Hi,

> The table has to be updated more or less at random as the dataset is
> digested. What I'm doing is essentially your above method, but not all
> elements will ever receive a value. In fact, most will not.
>
> The reason to initialize the hashtable in advance is so a later probe to
> an empty table element yields {}, versus repeating the question back.

Honestly, despite your additional explanations I don't feel I fully
understand what you are trying to do, but the following assumes that it
is essential that you use your own hash function instead of what
Mathematica would internally use.

You should note that David's suggestion to use DownValues to store the
key/value pairs will allow for returning empty lists for the keys that
have not been met yet, even without preallocating entries or a priori
knowledge of the hash index modulus. You would simply "initialize" with
a blank pattern (see second line):

ClearAll[hashtable];
hashtable[_] = {};
Module[{index},
Scan[(
index = hashingfunction[#];
hashtable[index] = Union[Join[hashtable[index], {#}]]
) &,
dataset
];
]

It is difficult to make a statement about runtime performance compared
to your approach, since it will depend on the hash function you are
using. But I see some advantages beside runtime performance: the
approach will work without knowledge of the hash index modulus and it
will be more memory efficient if the modulus is very large and thus
there are many empty values (in general I think that a large modulus is
what you usually are after when looking for a good hash function).

You might want to try the above with your hash function, with

hashingfunction = Hash;
dataset = DictionaryLookup["*t*"];

the above will take 0.343 Seconds for about 41000 strings on my machine,
which is presumably much slower than your machine. Using StringLength as
hashfunction slows things down considerable, most probably because of
the need to reallocate memory for the growing value arrays. I think that
this is also dominating the runtime of your approach, and because
everything is in one large array even more so. For my example with
hashfunction=StringLength building the hash table using DownValues seems
to be a lot faster...

hth,

albert

Szabolcs Horvát

unread,
Feb 21, 2012, 6:21:54 AM2/21/12
to
On 2/20/2012 9:48 AM, Joseph Gwinn wrote:
>
> The reason to initialize the hashtable in advance is so a later probe to
> an empty table element yields {}, versus repeating the question back.
>
>

I probably don't understand your question completely, but why can't you
just make definition like

hashtable[_] = {}

? This will ensure that {} is returned for "undefined" keys.

Also, please avoid using symbols that start with a capital letter.
HashTable is a built-in symbol and if you define it (as you did in the
example code you sent), it is likely to break stuff.

Leonid Shifrin

unread,
Feb 22, 2012, 5:34:48 AM2/22/12
to
Hi Joe,

I suspect that most of the time is spent it adding to sub-lists, rather
than hashing proper. I suggest to use a DownValues-based hash-table with
linked lists, as follows.

Here is a linked list "API":


ClearAll[linkedList, toLinkedList, fromLinkedList, addToList, pop,
emptyList];
SetAttributes[linkedList, HoldAllComplete];
toLinkedList[data_List] := Fold[linkedList, linkedList[], data];
fromLinkedList[ll_linkedList] :=
List @@ Flatten[ll, Infinity, linkedList];
addToList[ll_, value_] := linkedList[ll, value];
pop[ll_] := Last@ll;
emptyList[] := linkedList[];

Here is a simplistic hash table constructor (I exploit the fact that you
don't have to delete
key-value pairs):


ClearAll[makeHash];
makeHash[] :=
Module[{hash},
hash[_] := emptyList[];
hash /: add[hash, key_, value_] :=
hash[key] = addToList[hash[key], value];
hash /: get[hash, key_] := fromLinkedList@hash[key];
Return[hash]];


Let's create a new hash table:

In[10]:= hash = makeHash[]
Out[10]= hash$777

Now, I will populate it with the words of built-in English dictionary, the
keys being first two letters of a word:

values = DictionaryLookup["*"];
keys = StringTake[#, Min[2, StringLength[#]]] & /@ values;

It takes only half a second to fill the hash table, on my machine:

In[14]:= (add[hash,##]&@@@Transpose[{keys,values}]);//Timing
Out[14]= {0.515,Null}

for 92518 words.

Here is how you make a lookup:

In[19]:= get[hash,"pr"]//Short
Out[19]//Short=
{practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying}

You can also use the hash table from System`Utilities`. The following is an
analogous constructor based on that:

ClearAll[makeHashAlt];
makeHashAlt[] :=
Module[{hash, innerHash},
innerHash = System`Utilities`HashTable[];
hash /: add[hash, key_, value_] :=
System`Utilities`HashTableAdd[innerHash, key,
addToList[
If[System`Utilities`HashTableContainsQ[innerHash, key],
With[{res = System`Utilities`HashTableGet[innerHash, key]},
System`Utilities`HashTableRemove[innerHash, key];
res],
(* else *)
emptyList[]
],
value]];
hash /:
get[hash, key_] /; !
System`Utilities`HashTableContainsQ[innerHash, key] := {};
hash /: get[hash, key_] :=
fromLinkedList@System`Utilities`HashTableGet[innerHash, key];
Return[hash]];

Create a new hash:

In[48]:= newhash = makeHashAlt[]
Out[48]= hash$1814

The performance will be slightly worse:

In[49]:= (add[newhash,##]&@@@Transpose[{keys,values}]);//Timing
Out[49]= {0.671,Null}

You can use it in the same way as the first version.

In[50]:= get[newhash,"pr"]//Short
Out[50]//Short=
{practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying}


The whole point here is that this should scale really well, since the
insertion is always constant-time.
For hashes based on plain lists, this will be worse as the lists for a
given key grow larger (you will have
an average insertion time be proportional roughly to the average length of
the accumulated sublist).

Hope this helps. Consider cross-posting to
http://mathematica.stackexchange.com, to have some
more eyes on your problems of interest.


Cheers,
Leonid





On Mon, Feb 20, 2012 at 10:46 AM, Joseph Gwinn <joeg...@comcast.net> wrote:

> In article <jhqmrr$fbq$1...@smc.vnet.net>,
> David Bailey <da...@removedbailey.co.uk> wrote:
>
> > On 18/02/2012 11:28, Joseph Gwinn wrote:
> > Before constructing a hash table (other than for pedagogical reasons), I
> > would try the built in mechanisms of Mathematica. For example, here is a
> > trivial loop which creates a 'hash table' called hashtable, with values
> > f[key]:
> >
> > kk=1;
> > While[kk < 100000,
> > hashtable[kk]=f[kk];
> > kk++;
> > ];
> >
> > Of course, in this example the keys to the table are just the integers,
> > but the built-in mechanism will work with keys of any structure.
>
> The table has to be updated more or less at random as the dataset is
> digested. What I'm doing is essentially your above method, but not all
> elements will ever receive a value. In fact, most will not.
>
> The reason to initialize the hashtable in advance is so a later probe to
> an empty table element yields {}, versus repeating the question back.
>
>

Joseph Gwinn

unread,
Feb 22, 2012, 5:35:50 AM2/22/12
to
In article <jhvuoi$891$1...@smc.vnet.net>,
Szabolcs Horvát <szho...@gmail.com> wrote:

> On 2/20/2012 9:48 AM, Joseph Gwinn wrote:
> >
> > The reason to initialize the hashtable in advance is so a later probe to
> > an empty table element yields {}, versus repeating the question back.
> >
> >
>
> I probably don't understand your question completely, but why can't you
> just make definition like
>
> hashtable[_] = {}
>
> ? This will ensure that {} is returned for "undefined" keys.

I'll have to try this and related approaches, but it's not the entire
hashtable that's empty, it's this or that element of said hashtable.


> Also, please avoid using symbols that start with a capital letter.
> HashTable is a built-in symbol and if you define it (as you did in the
> example code you sent), it is likely to break stuff.

Well, I use initial (and embedded) caps all the time, for readable
symbols, and Mathematica does sometimes balk, so I make the symbol a word
longer.

As for HashTable, Mathematica complained, so I changed it to BigHashTable, and
peace was restored.

Joe Gwinn

Joseph Gwinn

unread,
Feb 22, 2012, 5:36:20 AM2/22/12
to
In article <jhvulm$88e$1...@smc.vnet.net>, A Retey <aw...@gmx-topmail.de>
wrote:

> Hi,
>
> > The table has to be updated more or less at random as the dataset is
> > digested. What I'm doing is essentially your above method, but not all
> > elements will ever receive a value. In fact, most will not.
> >
> > The reason to initialize the hashtable in advance is so a later probe to
> > an empty table element yields {}, versus repeating the question back.
>
> Honestly, despite your additional explanations I don't feel I fully
> understand what you are trying to do, but the following assumes that it
> is essential that you use your own hash function instead of what
> Mathematica would internally use.

Two reasons. First, the nature of the hash function should not affect
how the hash table is built and subsequently probed. Second, hash
functions have mathematical properties, and not all hash functions are
suited to any given hash function. For instance, for handling many
kinds of data, like natural language text, order matters, so the
permutations of "abc" must all be distinct. Well, three reasons: The
hash functions like md5 are far too heavy for the current purposes.
I have an adequate solution for my present purposes, but wanted to know
if there were better ways. I've gotten some good suggestions. I plan
to try them all and publish the results on this newsgroup.

Joe Gwinn

Joseph Gwinn

unread,
Feb 22, 2012, 5:36:51 AM2/22/12
to
In article <jhvu5c$80d$1...@smc.vnet.net>,
Richard Fateman <fat...@cs.berkeley.edu> wrote:

> On 2/19/2012 11:49 PM, Joseph Gwinn wrote:
> > In article<jhqmsq$fbt$1...@smc.vnet.net>,
> > Richard Fateman<fat...@cs.berkeley.edu> wrote:
>
> >> Initializing a hashtable with 10^4 (identical?) empty keys makes no
> >> sense. Did I misunderstand your message?
> >
> > No, you understood correctly. Please see my response to David Bailey
> > for the rationale.
> >
> > Joe Gwinn
> >
> Again, I may misunderstand, but I think you can accomplish the same
> thing by typing
>
> f[x_]:= {}
>
> if {} is the default empty value.

This creates the hashtable, and gives it a name. In a hashtable, it is
not the table that's empty, it's this or that specific cell that's
empty, in an unpredictable and changing pattern. So one would be
attempting to update and access f[x], where x is a random integer. So,
I just tried this, and it works just fine, including handling variable
lists.

So, this and related approaches may be workable. I will have to
implement them in a real application, and check performance and
scalability.

Joe Gwinn

Nasser M. Abbasi

unread,
Feb 23, 2012, 5:51:22 AM2/23/12
to
On 2/22/2012 4:34 AM, Leonid Shifrin wrote:

>
> You can also use the hash table from System`Utilities`.

How can one obtain documentation on these functions?
Very Hard to use something, and even program for an API that
is not documented.

I tried

?? System`Utilities`HashTable

and it all returned is

Attributes[System`Utilities`HashTable]={Protected}

googling did not help. Any other place to obtain documentation?
And why are these not in the help system for Mathematica? I am
using V 8.04

thanks,
--Nasser

Leonid Shifrin

unread,
Feb 24, 2012, 1:00:01 AM2/24/12
to
Nasser,

Sorry, can't add more here. I myself learned about these functions very
recently from this post by Oleksandr Rasputinov:

http://mathematica.stackexchange.com/questions/990/struct-equivalent-in-mathematica/1029#1029

And that's pretty much all publicly available information on this, I
believe. These functions probably serve internal purposes, and so are undocumented. This is a normal practice, anyone who is willing to dig out and use undocumented functionality should accept that it is, well, undocumented.

Cheers,
Leonid
Reply all
Reply to author
Forward
0 new messages