Hi Joe,
I suspect that most of the time is spent it adding to sub-lists, rather
than hashing proper. I suggest to use a DownValues-based hash-table with
linked lists, as follows.
Here is a linked list "API":
ClearAll[linkedList, toLinkedList, fromLinkedList, addToList, pop,
emptyList];
SetAttributes[linkedList, HoldAllComplete];
toLinkedList[data_List] := Fold[linkedList, linkedList[], data];
fromLinkedList[ll_linkedList] :=
List @@ Flatten[ll, Infinity, linkedList];
addToList[ll_, value_] := linkedList[ll, value];
pop[ll_] := Last@ll;
emptyList[] := linkedList[];
Here is a simplistic hash table constructor (I exploit the fact that you
don't have to delete
key-value pairs):
ClearAll[makeHash];
makeHash[] :=
Module[{hash},
hash[_] := emptyList[];
hash /: add[hash, key_, value_] :=
hash[key] = addToList[hash[key], value];
hash /: get[hash, key_] := fromLinkedList@hash[key];
Return[hash]];
Let's create a new hash table:
In[10]:= hash = makeHash[]
Out[10]= hash$777
Now, I will populate it with the words of built-in English dictionary, the
keys being first two letters of a word:
values = DictionaryLookup["*"];
keys = StringTake[#, Min[2, StringLength[#]]] & /@ values;
It takes only half a second to fill the hash table, on my machine:
In[14]:= (add[hash,##]&@@@Transpose[{keys,values}]);//Timing
Out[14]= {0.515,Null}
for 92518 words.
Here is how you make a lookup:
In[19]:= get[hash,"pr"]//Short
Out[19]//Short=
{practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying}
You can also use the hash table from System`Utilities`. The following is an
analogous constructor based on that:
ClearAll[makeHashAlt];
makeHashAlt[] :=
Module[{hash, innerHash},
innerHash = System`Utilities`HashTable[];
hash /: add[hash, key_, value_] :=
System`Utilities`HashTableAdd[innerHash, key,
addToList[
If[System`Utilities`HashTableContainsQ[innerHash, key],
With[{res = System`Utilities`HashTableGet[innerHash, key]},
System`Utilities`HashTableRemove[innerHash, key];
res],
(* else *)
emptyList[]
],
value]];
hash /:
get[hash, key_] /; !
System`Utilities`HashTableContainsQ[innerHash, key] := {};
hash /: get[hash, key_] :=
fromLinkedList@System`Utilities`HashTableGet[innerHash, key];
Return[hash]];
Create a new hash:
In[48]:= newhash = makeHashAlt[]
Out[48]= hash$1814
The performance will be slightly worse:
In[49]:= (add[newhash,##]&@@@Transpose[{keys,values}]);//Timing
Out[49]= {0.671,Null}
You can use it in the same way as the first version.
In[50]:= get[newhash,"pr"]//Short
Out[50]//Short=
{practicabilities,practicability,practicable,practicably,<<1696>>,prussic,pry,prying}
The whole point here is that this should scale really well, since the
insertion is always constant-time.
For hashes based on plain lists, this will be worse as the lists for a
given key grow larger (you will have
an average insertion time be proportional roughly to the average length of
the accumulated sublist).
Hope this helps. Consider cross-posting to
http://mathematica.stackexchange.com, to have some
more eyes on your problems of interest.
Cheers,
Leonid
> > Before constructing a hash table (other than for pedagogical reasons), I
> > would try the built in mechanisms of Mathematica. For example, here is a
> > trivial loop which creates a 'hash table' called hashtable, with values
> > f[key]:
> >
> > kk=1;
> > While[kk < 100000,
> > hashtable[kk]=f[kk];
> > kk++;
> > ];
> >
> > Of course, in this example the keys to the table are just the integers,
> > but the built-in mechanism will work with keys of any structure.
>
> The table has to be updated more or less at random as the dataset is
> digested. What I'm doing is essentially your above method, but not all
> elements will ever receive a value. In fact, most will not.
>
> The reason to initialize the hashtable in advance is so a later probe to
> an empty table element yields {}, versus repeating the question back.
>
>