Was curious to know whether a hash table is a good choice as data structure for an object considered as a plain set of name:value symbols (slots, attributes); or for its symbol table if separate. This is the choice for Python and Lua, no idea about other languages.
The point is a "symbol table" is very different from a collection of key:value pairs. Keys are all strings, rather short, morphologically similar (words from natural languages). Also, the number of items is systematically small, compared to the extreme variety of collection sizes; 99% of such structures are hand written by programmers, anyway, types included.
So, I did a kind of test, in Python, with 5 types of data structures, all implemented by hand:
* SymbolTable: a plain sequence of (name,value) pairs
* DoubleTable: 2 // sequences for names & values
* HashTable: used FNV algorithm for string hash (tried Python's, performs worse)
* CharTrie: basic char per char trie
* RadixTrie: see http://en.wikipedia.org/wiki/Radix_trie
I thought simple tables, for which access requires traversal, meaning linear time, could outperform sophisticated algorithms anyway, due of the small number of items. This is true up to > 100 symbols. Was surprised that SymbolTable is the fastest, since names are nested in pairs, so that lookup should be longer than for DoubleTable. Have no explanation. DoubleTable may perform better if I had used a linked list as name sequence (used python lists, which are flexible arrays), since it is only traversed (but kept a flexible array for keys, which are only indexed).
I also thought tries could run faster than hash tables, because of the string nature of keys. This is wrong, except for small numbers of items which are deadly for hash tables, indeed, but differences are not huge.
Unsurprisingly, hash table performance falls down with name size. It could also run faster with linked lists as clusters (used python lists here, too). I fixed a load rate of 1 (meaning the number of clusters is equal to the number of items): don't know if it's a good choice.
Tries are disadvantaged because other types use builtin python lists. RadixTrie is highly disadvantaged because its performance is based on common name radices (prefixes), which are very common in real use (because of natural language, but also semantics: "foo_size", "foo_count", "foo_names"), while the test uses random data. A way to compensate this would be to restrict the charset, but I did not try it since I have no idea how much to compensate: I need instead a real name corpus from real code. Even then, RadixTrie systematically runs faster than the simple trie. Tries have O(s) access time, where s is size of key: a measure of tree depth.
Test data were randomly generated, the same set for each type indeed ;-) Names were generated from ASCII letters only (there may be duplicates); values are all 1. After some reflexion, I decided to test only value reading, since it's by far the most common operation, I guess, and it requires performing lower-level "access" (name finding), which is needed for other operations.
I guessed a key size & a table size of both 9 may be a typical case: this the first test. Then, I varied key size & table size independantly: 3-->9-->33. I added a test for bigger table sizes to look for the "inversion threshold" where sophisticed data structures would start to perform better than simple tables. I keep the code, if ever (can post it on request). Below results, all normalised at 100 for SymbolTable.
Denis
=== typical access time for table size 9 & name size 9
SymbolTable:100 DoubleTable:122 HashTable:439 CharTrie:584 RadixTrie:229
=== access time for various table sizes & name size 9
table size: 3
SymbolTable:100 DoubleTable:119 HashTable:573 CharTrie:730 RadixTrie:223
table size: 9
SymbolTable:100 DoubleTable:115 HashTable:442 CharTrie:585 RadixTrie:242
table size: 33
SymbolTable:100 DoubleTable:110 HashTable:216 CharTrie:302 RadixTrie:187
*** sum / 3 ***
SymbolTable:100 DoubleTable:114 HashTable:410 CharTrie:539 RadixTrie:217
=== access time for various name sizes & table size 9
name size: 3
SymbolTable:100 DoubleTable:128 HashTable:297 CharTrie:273 RadixTrie:259
name size: 9
SymbolTable:100 DoubleTable:133 HashTable:450 CharTrie:605 RadixTrie:239
name size: 33
SymbolTable:100 DoubleTable:104 HashTable:979 CharTrie:1743 RadixTrie:215
*** sum / 3 ***
SymbolTable:100 DoubleTable:121 HashTable:575 CharTrie:873 RadixTrie:237
=== access time for big table sizes & name size 9
table size: 99
SymbolTable:100 DoubleTable:121 HashTable:110 CharTrie:145 RadixTrie:109
table size: 333
SymbolTable:100 DoubleTable:124 HashTable:038 CharTrie:054 RadixTrie:044
table size: 999
SymbolTable:100 DoubleTable:126 HashTable:014 CharTrie:022 RadixTrie:018
*** sum / 3 ***
SymbolTable:100 DoubleTable:123 HashTable:054 CharTrie:073 RadixTrie:057
________________________________
vit esse estrany ☣
These measures show clearly that in this case, cache hit ratio matters
more than the asymptotic behavior. I think another factor is that more
complex structures have a greater overhead (heap allocation mainly).
One conclusion is that one should keep both the names and the
namespace small. It might look easy at first glance. but importing
symbols may bring into scope lots of symbol, in particular if the
module system has weaknesses that encourages kitchen-sink-style
modules.
> One conclusion is that one should keep both the names and the
> namespace small. It might look easy at first glance. but importing
> symbols may bring into scope lots of symbol, in particular if the
> module system has weaknesses that encourages kitchen-sink-style
> modules.
Interesting remark. I thought at data structures for standard objects representing elements of the model. But indeed, those can be used for namespaces; and in my case namespaces (there are only two, the "world", and the current local one) are just standard objects.
A local scope is hardly "over-busy" (this would rather show some kind of bad programming imo). But a simple table could be a really bad choice for the world. Again, we would need some statistics. Still, nothing to compare with collectons that can hold millions of items.
Let us imagine we can confirm the results showing that simple tables are much more efficient in most common cases. I would then accept, for exceptional cases of heavy worlds, a ratio of about 1/3 performance compared to hash table or radix trie. The stats (*) seem to show that it means > 300 symbols in there. How often does this happen? (This is certainly not my programming style; also not what my lang would favor; but who knows?)
Your note about imports is really relevant. I'm not such worried for my project because by default what is imported is a single object/namespace. But there should be an option to "explode" it into the importing namespace. This is nearly necessary for modules acting as libraries of funcs or types (thing at parsing). But the target is often another module/namespace (eg the parser), limited to that use, not the real global.
Denis
(*) To be confirmed, certainly, by tests using data structures written in another, lower-level, language. I'm a bit worried that 3 of them use builtin (C-speed) python lists in the background. Don't know the negative impact on the performance of tries.
> Hello,
>
> Was curious to know whether a hash table is a good choice as data structure for an object considered as a plain set of name:value symbols (slots, attributes); or for its symbol table if separate. This is the choice for Python and Lua, no idea about other languages.
> [snip]
Well this was actually helpful. I ran into this question a few weeks back and opted on a patricia trie for my symbol table largely because of the flat topology of my language. There will be potentially many hundreds or thousands of actors, all defining very few elements (internal state, such as responses to messages), typically under 100. So I opted for the patricia trie for storing information about the actors themselves, but hadn't made up my mind for sure on what would be best for storing the internal state. Had strong leanings towards a hash table, just because of the simplicity. I also chose the trie to reuse the base functionality to construct Dictionaries within the language; meant to represent large amounts of key->value data.
Now however, I'm not sure about using the hash table for internal state. Thanks for this.
Regards,
Jeremy Tregunna
First, have in mind these results (reproduced below) may be biased because the data structures are implemented in a high-level language, namely Python -- even if written by hand. Even more since sequential tables use C-speed builtin python lists in the background; as well as hash tables for the clusters (buckets, bins); so tries may be disadvantaged.
I'm surprised RadixTrie performs so well compared to hash, except for high numbers of items, and even more compared to single-char trie. Using real corpus data instead of random data as I did, it should perform even much better since common prefixes will be much more common. First enigma: I don't understand why the difference between CharTrie & RadixTrie disappears for higher numbers of items. On the contrary, for constant charset & key size, this should cause more common prefixes... There must be a flaw in my code, I'm going to investigate further. [Note: except for key lookup, both trie types share a common implementation (Trie supertype + double inheritance).]
But the real enigma for me was the comparison of SymbolTable & DoubleTable, the former systematically performing better. SymbolTable linearly stores pairs of name:value symbols (2-tuples); so that name lookup is in a way indirect (double indexing). While DoubleTable has separated sequences for names & values; name lookup is thus direct.
I replayed the test suite this morning and now DoubleTable systematically performs better. Must have changed something (I touched the code, since then) but cannot find the relevant difference. Anyway, these new results are correct in the sense that the implementations of both types are strictly //, except for the additional tuple indexing in the case of SimpleTable.
=== typical access time for table size 9 & name size 9
SymbolTable:100 DoubleTable:089 HashTable:342 CharTrie:417 RadixTrie:178
=== access time for various table sizes & name size 9
table size: 3
SymbolTable:100 DoubleTable:093 HashTable:439 CharTrie:539 RadixTrie:160
table size: 9
SymbolTable:100 DoubleTable:084 HashTable:332 CharTrie:401 RadixTrie:185
table size: 33
SymbolTable:100 DoubleTable:080 HashTable:172 CharTrie:215 RadixTrie:145
*** sum / 3 ***
SymbolTable:100 DoubleTable:085 HashTable:314 CharTrie:385 RadixTrie:163
=== access time for various name sizes & table size 9
name size: 3
SymbolTable:100 DoubleTable:087 HashTable:229 CharTrie:191 RadixTrie:172
name size: 9
SymbolTable:100 DoubleTable:084 HashTable:333 CharTrie:399 RadixTrie:181
name size: 33
SymbolTable:100 DoubleTable:085 HashTable:761 CharTrie:1350 RadixTrie:163
*** sum / 3 ***
SymbolTable:100 DoubleTable:085 HashTable:441 CharTrie:646 RadixTrie:172
=== access time for big table sizes & name size 9
table size: 99
SymbolTable:100 DoubleTable:079 HashTable:075 CharTrie:093 RadixTrie:072
table size: 333
SymbolTable:100 DoubleTable:099 HashTable:038 CharTrie:049 RadixTrie:048
table size: 999
SymbolTable:100 DoubleTable:083 HashTable:010 CharTrie:012 RadixTrie:014
*** sum / 3 ***
SymbolTable:100 DoubleTable:087 HashTable:041 CharTrie:051 RadixTrie:044
> I'm surprised RadixTrie performs so well compared to hash, except for high numbers of items, and even more compared to single-char trie. Using real corpus data instead of random data as I did, it should perform even much better since common prefixes will be much more common. First enigma: I don't understand why the difference between CharTrie & RadixTrie disappears for higher numbers of items. On the contrary, for constant charset & key size, this should cause more common prefixes... There must be a flaw in my code, I'm going to investigate further. [Note: except for key lookup, both trie types share a common implementation (Trie supertype + double inheritance).]
I removed a source of inefficiency for RadixTrie (now a node stores its radix's size). There is still a mystery about single-char tries, namely that they perform better and better, in comparison to other data structure, when the table size grows. I added a test for very big table sizes, with only hash table and tries (simple tables take eternal time). Single-char trie outperforms HashTable somewhere < 10000 items & RadixTrie somewhere > 10000 items. I cannot understand why.
Below new results (only for varying table size).
=== access time for various table sizes & name size 9
table size: 3
SymbolTable:100 DoubleTable:062 HashTable:305 CharTrie:372 RadixTrie:119
table size: 9
SymbolTable:100 DoubleTable:083 HashTable:301 CharTrie:387 RadixTrie:130
table size: 33
SymbolTable:100 DoubleTable:085 HashTable:164 CharTrie:217 RadixTrie:102
*** sum / 3 ***
SymbolTable:100 DoubleTable:076 HashTable:256 CharTrie:325 RadixTrie:117
=== access time for big table sizes & name size 9
table size: 99
SymbolTable:100 DoubleTable:081 HashTable:073 CharTrie:096 RadixTrie:053
table size: 333
SymbolTable:100 DoubleTable:079 HashTable:025 CharTrie:033 RadixTrie:021
table size: 999
SymbolTable:100 DoubleTable:081 HashTable:009 CharTrie:012 RadixTrie:009
*** sum / 3 ***
SymbolTable:100 DoubleTable:080 HashTable:035 CharTrie:047 RadixTrie:027
=== access time for very big table sizes & name size 9
table size: 3333
HashTable:237 CharTrie:283 RadixTrie:243
table size: 9999
HashTable:334 CharTrie:304 RadixTrie:298
table size: 33333
HashTable:485 CharTrie:304 RadixTrie:334
*** sum / 3 ***
HashTable:352 CharTrie:297 RadixTrie:291
> --
> send mail : pi...@googlegroups.com
> subscribe : pilud+s...@googlegroups.com
> unsubscribe : pilud+un...@googlegroups.com
> home page : http://groups.google.com/group/pilud
> Note: reply address is set to the list -- no need to "reply all"
A side-avantage of simple tables based on sequences is when putting out the internals of a structured object, the way Io does:
spir@o:~$ io
Io 20080120
Io> red := Object clone do (H:=100 ; S:=0 ; L:=0)
==> Object_0x84198b0:
H = 100
L = 0
S = 0
Io> center := Object clone do (x:=0 ; y:=0 ; c:=red)
==> Object_0x838b248:
c = Object_0x84198b0
x = 0
y = 0
Sequential table keep insertion order, which may be cool for the user.
Here is how various types put data out (in flat view):
SymbolTable : ST{x:0 y:0 c:ST{H:100 L:100 S:100}}
DoubleTable : DT{x:0 y:0 c:ST{H:100 L:100 S:100}}
HashTable : HT{y:0 c:ST{H:100 L:100 S:100} x:0}
CharTrie : CT{x:0 y:0 c:ST{H:100 L:100 S:100}}
RadixTrie : RT{x:0 y:0 c:ST{H:100 L:100 S:100}}
Tries seem to preserve order, but its only an illusion because, in the data above, there is no branch node. another trial:
data = [('a',0),('b',0),('c',0),('aaaa',0),('aaab',0),('ccca',0),('cccb',0)]
==>
SymbolTable : ST{a:0 b:0 c:0 aaaa:0 aaab:0 ccca:0 cccb:0}
DoubleTable : DT{a:0 b:0 c:0 aaaa:0 aaab:0 ccca:0 cccb:0}
HashTable : HT{cccb:0 a:0 b:0 aaaa:0 aaab:0 ccca:0 c:0}
CharTrie : CT{a:0 aaaa:0 aaab:0 b:0 c:0 ccca:0 cccb:0}
RadixTrie : RT{a:0 aaaa:0 aaab:0 b:0 c:0 ccca:0 cccb:0}
(*)
Denis
(*) The reason is visible in trie tree view:
CharTrie
<>
<a> --> a:0
<a>
<a>
<a> --> aaaa:0
<b> --> aaab:0
<b> --> b:0
<c> --> c:0
<c>
<c>
<a> --> ccca:0
<b> --> cccb:0
RadixTrie
<>
<a> --> a:0
<aa>
<a> --> aaaa:0
<b> --> aaab:0
<b> --> b:0
<c> --> c:0
<cc>
<a> --> ccca:0
<b> --> cccb:0