Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

symbol tables and search tries

26 views
Skip to first unread message

Mr.E

unread,
Jan 28, 2006, 3:10:45 PM1/28/06
to
I done some homework and there is something that I dont understand. In
all the books I've read everyone recommends hashing for symbol tables
but I've read a number of articles the state that searching tries is
many times faster than hashing. I understand there can sometimes be a
sizeable amount of overhead in using tries but with the different
methods of trie algorithms it would seem there would be a lot of clamor
for trie usage in compilers. Are there reasons why tries arent
advocated?

Thanks,

W.
[You need to read better articles. For the strings you're likely to
find in a compiler symbol table, tries would be about the same speed,
maybe slower, and the code would probably be more complicated. Tries
win when the strings are long and don't all fit in main memory.
Moreover, any compiler where symbol lookups are a significant part of
the compile time has worse problems than which search algorithm it's
using. -John]

Hans Aberg

unread,
Jan 28, 2006, 7:18:53 PM1/28/06
to
John Levine wrote:

>... any compiler where symbol lookups are a significant part of


> the compile time has worse problems than which search algorithm it's
> using.

Why not using a balanced tree, like for example the std::map of C++? I
gives the minimum logarithmic search/insertion times without having to
bother about balancing, as in a hash map. Is the overhead too big?

--
Hans Aberg
[I don't understand what you're proposing. Balanced trees are O(log N)
to search and require rebalancing whenever you insert something. Hash
tables of the sort we usually use for compiler symbol tables are O(1)
and balancing isn't an issue. -John]

glen herrmannsfeldt

unread,
Jan 30, 2006, 1:58:29 AM1/30/06
to
(snip)

> [I don't understand what you're proposing. Balanced trees are O(log N)
> to search and require rebalancing whenever you insert something. Hash
> tables of the sort we usually use for compiler symbol tables are O(1)
> and balancing isn't an issue. -John]


Hash tables are O(1) until you have to resize them.

The IBM S/360 Fortran H compiler uses six balanced trees,
one for each possible symbol length.

In one manual IBM recommended distributing variable names
equally between one and six characters to speed compilation.

-- glen
[Well, yeah, but it's been a while since I've written a compiler that
had to run in a 128K byte machine. I understand why they wouldn't
have wanted to devote precious storage to a hash table in 1969, but
that was then. -John]

Hans Aberg

unread,
Jan 30, 2006, 2:05:00 AM1/30/06
to
John Levine wrote:

> >Why not using a balanced tree, like for example the std::map of C++? I
> >gives the minimum logarithmic search/insertion times without having to
> >bother about balancing, as in a hash map. Is the overhead too big?

> I don't understand what you're proposing. Balanced trees are O(log N)


> to search and require rebalancing whenever you insert something. Hash
> tables of the sort we usually use for compiler symbol tables are O(1)
> and balancing isn't an issue.

The hash tables I know of use a linked list on each hash value, and if
there are k hash values (= number of linked list), the average search
time is O(N/k) = O(N). By choosing the value k suitably relative to N,
one can achieve fast searches by keeping the overhead down. If k is
chosen poorly, the hash table must be rebuilt, in order to achieve
this efficiency. Or are you using another type of hash tables? And for
balanced trees, the insert time is logarithmic. If k is doubled when
too low, perhaps insert time is logarithmic for hash tables too.

--
Hans Aberg
[I make my hash tables big enough that the chains are short, i.e.,
design so that k>=N so it's O(1). Hash headers need only be one word
pointers, so pick a number larger than the number of symbols you
expect in a large program, say, 10,000, and use that. So it adds 40K
to the size of the program, that's down in the noise these days.
-John]

Anton Ertl

unread,
Jan 31, 2006, 12:32:08 AM1/31/06
to
glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:
>Hash tables are O(1) until you have to resize them.

And with proper (i.e., exponential, e.g., doubling when a certain
density is reached) resizing, they are still O(1):

Resizing is an O(n) operation (where n is the number of entries in the
hash table at the time of resizing), but you resize only once every
O(n) elements, so the amortized cost of resizing per insertion is
O(n/n)=O(1); of course, there is no resizing cost when you are only
doing lookups.

- anton
--
M. Anton Ertl
an...@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/home.html

Vladimir Makarov

unread,
Jan 31, 2006, 12:34:39 AM1/31/06
to
Hans Aberg wrote:
> John Levine wrote:

> [I make my hash tables big enough that the chains are short, i.e.,
> design so that k>=N so it's O(1). Hash headers need only be one
> word pointers, so pick a number larger than the number of symbols
> you expect in a large program, say, 10,000, and use that. So it adds
> 40K to the size of the program, that's down in the noise these days.
> -John]

I am agree that hash tables are more suitable in practice than
balanced trees.

If you look at GCC, it uses extendable hash tables in many places.
The hashtables are very compact (one element takes only one pointer in
the table) because no chains are used. Also deletions of elements
from the hash table are permitted. The hashtable are rebuilt when
there are many collisions (after the rebuilding the size can be bigger
or smaller). Big number of collision might be because there are many
elements in the table or too many deletions were made.

So the hashtables in gcc are better practically in all respects than
balanced trees (speed and size). They permit deletion too as balanced
trees. Although the balance trees have one adavantage which is
possibility to traverse elements in some order (gcc hashtables permit
to do only unordered traversing easily).

As I know hashtables form gcc is used in many order compilers and
interpreters.

A standalone version of the hashtables can be found on
http://cocom.sf.net. Its description is on
http://cocom.sourceforge.net/ammunition-5.html

RL...@oxfam.org.uk

unread,
Jan 31, 2006, 12:33:59 AM1/31/06
to
Hans Aberg wrote:

> The hash tables I know of use a linked list on each hash value,
> and if there are k hash values (= number of linked list),
> the average search time is O(N/k) = O(N).

But if k is chosen to be proportional to N, it's O(1).

> If k is doubled when
> too low, perhaps insert time is logarithmic for hash tables too.

A symbol table (presumably) only grows, so exponentially increasing
the size of the hash table is similar to exponentially increasing the
size of an expandable array; it is on average O(1). (Each element in
the hash table is rehashed an average of one time.)

In any event, it would seem that lookup time is more important
than insert time for a symbol table implementation, although
that depends on the language. In almost any program, a given
symbol will be used more than once, but that doesn't take into
account the myriad of otherwise unused symbols introduced by
libraries (C header files, standard preludes, or whatever).

Having said that, I don't think that a hash table is the best
data structure for an IDE, although it probably is for a
compiler. An IDE will want to be able to do things like
tab-completion, which would militate for a data structure
which facilitates prefix searches, such as a trie or
PATRICIA tree.
[In block structured languages the symbol table can shrink, typically
dumping a block's symbols at the end of the block. I suppose that if
you have one of those quaint compilers that makes more than one pass
over the source you might need to leave them all in and just mark them.
-John]

Matthias Blume

unread,
Jan 31, 2006, 12:38:33 AM1/31/06
to
glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:

> (snip)
>> [I don't understand what you're proposing. Balanced trees are O(log N)
>> to search and require rebalancing whenever you insert something. Hash
>> tables of the sort we usually use for compiler symbol tables are O(1)
>> and balancing isn't an issue. -John]
>
> Hash tables are O(1) until you have to resize them.

The cost of an insertion or an access of a hash table element is O(1)
amortized, taking the occasional need for resizing into account.
(This assumes that the resizing policy is chosen appropriately.
Example: Double the size when the load factor exceeds some fixed
threshold.)

Hans Aberg

unread,
Jan 31, 2006, 12:37:22 AM1/31/06
to
John Levine wrote:

> I make my hash tables big enough that the chains are short, i.e.,
> design so that k>=N so it's O(1). Hash headers need only be one word
> pointers, so pick a number larger than the number of symbols you
> expect in a large program, say, 10,000, and use that. So it adds 40K
> to the size of the program, that's down in the noise these days.

Even though one in a practical application can choose k large enough,
hoping that somebody will not choose a larger N, from the theoretical
point, when computing the complexity order, N is not limited. So, the
adjustment of k must be taken into account when computing the
complexity order. Then, by doubling k, one gets down to logarithmic
time. I also think one cannot do any better, because the best sorting
algorithm, over all choices is O(N log N), so if one has better
inserting complexity in a container than O(log N), one can beat that
by merely insert elements one by one.

But I am not interested in the best complexity, but to figure out why
a hash map apparently has better value to compiler writers than a
balanced tree. This could be, an email I got suggests, that a compiler
perhaps uses a lot of lookups, and relatively few inserts. Then, the
lookup time on hash table proportional to the average depth of the
linked lists, so if that is close to 1, one probably can get very fast
lookups.

--
Hans Aberg
[In the compilers I use, there's an insert the first time you see a
symbol in the source and a lookup every time you see it subsequently.
Unless you have a lot of symbols you only use once, I'd expect lookups
to dominate. -John]

Clint Olsen

unread,
Jan 31, 2006, 12:35:34 AM1/31/06
to
On 2006-01-30, Hans Aberg <hab...@math.su.se> wrote:
> The hash tables I know of use a linked list on each hash value, and if
> there are k hash values (= number of linked list), the average search
> time is O(N/k) = O(N). By choosing the value k suitably relative to N,
> one can achieve fast searches by keeping the overhead down. If k is
> chosen poorly, the hash table must be rebuilt, in order to achieve this
> efficiency. Or are you using another type of hash tables? And for
> balanced trees, the insert time is logarithmic. If k is doubled when too
> low, perhaps insert time is logarithmic for hash tables too.

The average probe length for a hash table using chaining with N
elements distributed over M chains (buckets) is N/M. It is not O(N)
unless you do something naive like use a poor hash function or don't
maintain the load factor accordingly (for chains, allow the load
factor to vary between .5 and 2). Once the performance decays to
O(N), you are essentially using a linked list.

You don't need to choose M ahead of time. It's completely reasonable
to do a resize on the fly. This happens so infrequently that it's in
the noise. Doubling the number of chains for every resize is a
reasonable policy. This has been done in numerous free libraries and
works just fine (See Kaz's Kazlib or Vo's CDT from AT&T) for complete
working examples.

To respond to the original posters question, tries are only beneficial
if the keys are extremely long, and the hash function would dominate
the execution profile. Traditionally, symbols in compilers aren't all
that long (what was the original UNIX linker limit? John? [Eight]).
Tries also don't work well when the symbols have the same or similar
prefixes. You can fight this by changing the criteria for key
insertion into the trie. You can use the suffix of the key for
insertion if the prefixes are the same or similar. Or you can
alternate using characters from the prefix and suffix. Like a hash
function, a proper trie function is important, and you have to
consider the keys you're inserting in order to obtain good
performance.

Anyway, this is more of a data structure argument and has little to do
with compilers. See comp.programming for more discussion on this
topic.

-Clint

Henry Spencer

unread,
Jan 31, 2006, 12:37:47 AM1/31/06
to
Hans Aberg <hab...@math.su.se> wrote:
>if there are k hash values (= number of linked list), the average search
>time is O(N/k) = O(N). By choosing the value k suitably relative to N,
>one can achieve fast searches by keeping the overhead down. If k is
>chosen poorly, the hash table must be rebuilt, in order to achieve
>this efficiency...

Hash tables can be extended incrementally to keep fill factor under
control, rather than requiring full rebuilds. See, e.g., the Larson
paper in CACM April 1988. Given a good hashing function, it's not
that hard to make a hash table keep its efficiency while scaling up
indefinitely, *without* requiring you to guess a maximum size in
advance.

You can combine the advantages by hanging a tree, rather than a list,
on each hash bucket. But it hardly seems worth the trouble.

The only good reason to use trees for symbol tables is if you also have a
requirement to traverse the table in sort order.
--
spsystems.net is temporarily off the air; | Henry Spencer
mail to henry at zoo.utoronto.ca instead. | he...@spsystems.net

Daniel C. Wang

unread,
Jan 31, 2006, 9:56:18 AM1/31/06
to
Perhaps, the suggestion that a search tree might be faster than a hash
tree came from this paper.

http://www.cs.princeton.edu/~rs/strings/

It's worth pointing out that if you expect your lookup to fail most of
the time. a search tree can be more efficient than hashing, simply
because it doesn't need to examine the entire input string to
determine it's not in the tree. Any reasonable hashing scheme has to
examine the entire string.

With respect to compilers, I suspect hashing is more
efficient. However, I think there are legitimate situations where a
tree search may be more efficient.

Dmitry A. Kazakov

unread,
Jan 31, 2006, 9:58:00 AM1/31/06
to
On 31 Jan 2006 00:33:59 -0500, RL...@oxfam.org.uk wrote:

> Having said that, I don't think that a hash table is the best
> data structure for an IDE, although it probably is for a
> compiler. An IDE will want to be able to do things like
> tab-completion, which would militate for a data structure
> which facilitates prefix searches, such as a trie or
> PATRICIA tree.

I agree. The hash function evaluation time should be counted as well. The
penalty depends on not only n, the table size, but also on m, the token
length.

Further, what hash tables cannot do is parsing the source with an
unknown right bound of the token. I.e. trees don't need a tokenizer to
be run first. A tree (or other ordered structure) can itself act as a
tokenizer, making nice things like closest match, wild-cards etc. This
all boils down to the "native" order of tokens, hash tables lack. It
was already mentioned by some other poster. So it depends...

--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

Hans-Peter Diettrich

unread,
Jan 31, 2006, 10:37:47 AM1/31/06
to
Anton Ertl wrote:
>
> glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:
> >Hash tables are O(1) until you have to resize them.
>
> And with proper (i.e., exponential, e.g., doubling when a certain
> density is reached) resizing, they are still O(1):
>
> Resizing is an O(n) operation (where n is the number of entries in the
> hash table at the time of resizing), but you resize only once every
> O(n) elements, so the amortized cost of resizing per insertion is
> O(n/n)=O(1); of course, there is no resizing cost when you are only
> doing lookups.

That's fine :-)

I had expected that resizing costs O(log n) once, for about log(n)
resizes until the final n is reached.

Now I can continue to use my symbol tables, where the hash links are
already part of the symbol entries, to reduce memory management
operations and memory fragmentation.

DoDi

Henry Spencer

unread,
Jan 31, 2006, 9:18:30 PM1/31/06
to
Dmitry A. Kazakov <mai...@dmitry-kazakov.de> wrote:
>...The hash function evaluation time should be counted as well. The

>penalty depends on not only n, the table size, but also on m, the token
>length.

A well-balanced tree will get you to the right place with O(log n) tests,
but if the place is occupied, you then need an O(m) comparison to decide
whether the occupant is the symbol being looked up. In a compiler,
successful lookups would be expected to be rather more frequent than
insertions, so most of the comparisons have to go all the way to the end
of both strings for a decision.

*Any* symbol table, no matter what data structure it uses, will have an
O(m) component in a successful lookup, for the same reason.

Matthias Blume

unread,
Jan 31, 2006, 9:22:30 PM1/31/06
to
hab...@math.su.se (Hans Aberg) writes:

> John Levine wrote:
>
>> I make my hash tables big enough that the chains are short, i.e.,
>> design so that k>=N so it's O(1). Hash headers need only be one word
>> pointers, so pick a number larger than the number of symbols you
>> expect in a large program, say, 10,000, and use that. So it adds 40K
>> to the size of the program, that's down in the noise these days.
>
> Even though one in a practical application can choose k large enough,
> hoping that somebody will not choose a larger N, from the theoretical
> point, when computing the complexity order, N is not limited. So, the
> adjustment of k must be taken into account when computing the
> complexity order. Then, by doubling k, one gets down to logarithmic
> time.

Hans, please, do the calculation! It does /not/ give you logarithmic
time, it gives you constant time (in the amortized sense).

Suppose, for simplicity, you set the load threshold to 1 (i.e., you
double the size when there are as many elements as there are buckets).
Clearly, access is O(1) on average (assuming a reasonable hash function
without major collisions). So let's look at insertions:

In the worst case, the very last insert just pushes the load over the
edge and causes another doubling. Thus, every element in the table
has suffered through at least one rebuild. Half the elements have
suffered through at least one additional rebuild, one quarter has
suffered through at least a third rebuild, and so on. This gives rise
to a sum of a geometric series:

1 + 1/2 + 1/4 + ...

which is 2 in the limit. Thus, in the limit, each element has (on
average) suffered through 2 rebuilds. The time each rebuild takes is
constant per element that participates in that rebuild. Since each
element that is in the table must have been inserted at some point, we
charge the amortized constant amount of work due to rebuilds to that
insert operation. As a result, the time for one insert is constant on
average.

(In the best case, the last insert just fills the table without
causing another rebuild. In this case the series to be summed is

1/2 + 1/4 + ... = 1

i.e., the average cost of insertions due to rebuilds is only half as big.
In either case, it is constant.)

> I also think one cannot do any better, because the best sorting
> algorithm, over all choices is O(N log N), so if one has better
> inserting complexity in a container than O(log N), one can beat that
> by merely insert elements one by one.

A hash table does not sort, so this line of reasoning is not
applicable.

Matthias

Mr.E

unread,
Jan 31, 2006, 9:24:36 PM1/31/06
to
Actually that was one of the articles I started with and I found
others on differerent versions of tries stating they were faster than
hashing which led me to wonder why if they are so much faster then why
is it tries arent advocated in any compiler book I've read.

W.

Daniel C. Wang wrote:
> Perhaps, the suggestion that a search tree might be faster than a hash
> tree came from this paper.
>
> http://www.cs.princeton.edu/~rs/strings/

[If you want to do sorting and partial matches on long strings, tries
are great. But most compilers do neither with their symbols. -John]

Dmitry A. Kazakov

unread,
Feb 2, 2006, 11:34:22 AM2/2/06
to
On 31 Jan 2006 21:18:30 -0500, Henry Spencer wrote:

> In a compiler, successful lookups would be expected to be rather
> more frequent than insertions, so most of the comparisons have to go
> all the way to the end of both strings for a decision.

Yes, but as it was already mentioned in the thread, there might be
sufficiently more than one table to look up. This brings the rate of
failures up.

> *Any* symbol table, no matter what data structure it uses, will have an
> O(m) component in a successful lookup, for the same reason.

True

glen herrmannsfeldt

unread,
Feb 2, 2006, 11:35:30 AM2/2/06
to
Matthias Blume wrote:

> glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:

>>(snip)

Assuming you can afford the extra memory, yes.

Note that O() notation assumes the large N limit, though most
compilers operate with more reasonable N.

The question, then, is with the available memory and expected size
which is the best choice? IBM chose a balanced tree for Fortran H.

The PL/I (F) compiler will move the symbol table to disk if it
runs out of memory. That might affect the choice, too.

Memory wasn't always as cheap as it is today.

-- glen
[Indeed, but does anyone see it getting more expensive? The Fortran compiler
on the IBM 1130 was about a dozen passes loaded from disk, and I sure hope
nobody suggests that as a way to build new comilers. -John]

Hans-Peter Diettrich

unread,
Feb 2, 2006, 11:46:03 AM2/2/06
to
Henry Spencer wrote:

> >...The hash function evaluation time should be counted as well. The
> >penalty depends on not only n, the table size, but also on m, the token
> >length.

...


> *Any* symbol table, no matter what data structure it uses, will have an
> O(m) component in a successful lookup, for the same reason.

Since every symbol name is extracted from a text file, could we get
any improvement from doing the inevitable O(m) things already in the
lexer?

At least it would be possible to calculate an hash code once, for
subsequent lookup in multiple symbol tables. For tree structures the
penalty occurs with every table to search, so it might be a good idea
to use a single global STB, possibly with a scope list added to every
symbol entry?

DoDi
[Um, once you calculate the hash codes, what would you call the data
structure you use to remember the hash codes and the related data for
subsequent occurrences of the same string in the input text?

Re multiple symbol tables, it seems to me there are fairly
straightforward ways to avoid the cost of multiple lookups, starting
by using the same string hash in each one. -John]

Matthias Blume

unread,
Feb 3, 2006, 6:36:41 PM2/3/06
to
glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:

> Matthias Blume wrote:
>
>> glen herrmannsfeldt <g...@ugcs.caltech.edu> writes:
>
>>>(snip)
>
>>>>[I don't understand what you're proposing. Balanced trees are O(log N)
>>>>to search and require rebalancing whenever you insert something. Hash
>>>>tables of the sort we usually use for compiler symbol tables are O(1)
>>>>and balancing isn't an issue. -John]
>
>>>Hash tables are O(1) until you have to resize them.
>
>> The cost of an insertion or an access of a hash table element is O(1)
>> amortized, taking the occasional need for resizing into account.
>> (This assumes that the resizing policy is chosen appropriately.
>> Example: Double the size when the load factor exceeds some fixed
>> threshold.)
>
> Assuming you can afford the extra memory, yes.

If you double your hashtable when the load factor exceeds 1, then you
waste an expected number of buckets that is bounded by something very
close to the number of actual elements in the table. If you do not
double the size, then you have to use more chain links in each bucket.
In a typical implementation, the required number of extra pointers is
the same as the number of those wasted buckets. Thus, doubling the
number of buckets does not waste memory (in comparison with chaining).

> Note that O() notation assumes the large N limit, though most
> compilers operate with more reasonable N.

Yes. But that is exactly what the "double the size on demand"
mechanism achieves.

I expect a binary tree implementation of the table to use _more_
memory than (or at best an equal amount as ) such a hashtable.

Matthias

Karsten Nyblad

unread,
Feb 3, 2006, 6:38:30 PM2/3/06
to
Hans-Peter Diettrich wrote:
> Since every symbol name is extracted from a text file, could we get
> any improvement from doing the inevitable O(m) things already in the
> lexer?
>
> At least it would be possible to calculate an hash code once, for
> subsequent lookup in multiple symbol tables. For tree structures the
> penalty occurs with every table to search, so it might be a good idea
> to use a single global STB, possibly with a scope list added to every
> symbol entry?
>
> DoDi
> [Um, once you calculate the hash codes, what would you call the data
> structure you use to remember the hash codes and the related data for
> subsequent occurrences of the same string in the input text? -John]

It depends on how much control you have on symbol tables for
debugging, etc. whether or not Hans-Peter's idea can be used. Perhaps
you are already having a table of all spellings of identifiers. In
one compiler I was working on, we were using the offset into the table
of spellings of identifiers as to represent for the spelling.

> [Re multiple symbol tables, it seems to me there are fairly


> straightforward ways to avoid the cost of multiple lookups, starting
> by using the same string hash in each one. -John]

That would force you to have all tables of one fixed size, and that
size can not be that big considering that a compiler should be capable
of handling thousands of scopes.

A normal hash function consists of two parts: First a pseudo random
number generator that takes the string to a 32-bit number, and
secondly the reminder of that number divided by the size of the hash
table is calculated. When looking up in differently sized tables you
may as well use the same pseudo random generator for all the hash
functions. Then there is no need for recalculating the pseudo random
over and over again. All you need to do is to calculate the pseudo
random number once. Then you calculate the reminder for each table.
Many people make sure their table size is a prime number, but there is
only so much to gain if your pseudo random number is evenly
distributed over the integers. You can use table sizes that are
powers of 2, and then the reminder operation is an AND operation.

On insertion you may also put the pseudo random number into the table
and later on when looking up start by comparing the random number of
the string being looked up with the random number in the entry in the
table. Of course that also speed up resizing the table because you do
not need to recalculate all the pseudo random numbers.

Karsten Nyblad
[Yes, that's what I meant, same PRNG, then divide down to the table
size. -John]

Henry Spencer

unread,
Feb 3, 2006, 6:40:07 PM2/3/06
to
Dmitry A. Kazakov <mai...@dmitry-kazakov.de> wrote:
>> In a compiler, successful lookups would be expected to be rather
>> more frequent than insertions...

>
>Yes, but as it was already mentioned in the thread, there might be
>sufficiently more than one table to look up. This brings the rate of
>failures up.

If you've got a bunch of tables to deal with, but no requirement for
prefix search or sorted table dump, one way of getting rid of the O(m)
issues is to hash each input string as part of scanning, and put it into a
single master hash table, whose purpose is to store each distinct string
only once. That makes string equality comparisons O(1) rather than O(m):
you can just compare pointers and never look at the strings themselves.
All the other tables just hold pointers.

Hans-Peter Diettrich

unread,
Feb 3, 2006, 6:40:27 PM2/3/06
to
Hans-Peter Diettrich wrote:

> At least it would be possible to calculate an hash code once, for
> subsequent lookup in multiple symbol tables. For tree structures the
> penalty occurs with every table to search, so it might be a good idea
> to use a single global STB, possibly with a scope list added to every
> symbol entry?
>
> DoDi
> [Um, once you calculate the hash codes, what would you call the data
> structure you use to remember the hash codes and the related data for
> subsequent occurrences of the same string in the input text?

E.g. I give the search routine both the string and the hash code, or an
special value (-1) if no hashcode was calculated yet. Once calculated,
the hash code can be propagated to any table search procedure. Finally
the hash code can be stored in the symbol object, for use in the
reorganization of the hash tables.

>
> Re multiple symbol tables, it seems to me there are fairly
> straightforward ways to avoid the cost of multiple lookups, starting
> by using the same string hash in each one. -John]

This was also my idea, but when the bucket lists of the various tables
have different sizes, the modulo of the full hash code can result in
false matches, detectable only by another string comparison. A unique
and sufficiently high list size will be a waste of memory, for many
local scopes, therefore my idea about postponing the separation into
"virtual" tables, attached to every symbol in the common STB; then the
symbol name must be matched only once, followed by an less expensive
match of the (first currently visible...) table in the symbol's table
list.

DoDi

Hans Aberg

unread,
Feb 3, 2006, 6:43:33 PM2/3/06
to
Matthias Blume <bl...@tti-c.org> wrote:

> Hans, please, do the calculation!

I find it difficult to carry out any computations before first converting
the problem to a proper model within axiomatic set theory. :-)

> It does /not/ give you logarithmic
> time, it gives you constant time (in the amortized sense).

....


> A hash table does not sort, so this line of reasoning is not
> applicable.

There are apparently two data types in play: one with only equality
(equivalence relation), and one sorted (with total order). The containers
(like the hash table) only doing the equality comparison apparently (if
one should believe the posters here :-)) can do it in O(1) time.
The containers doing the sorting (like the balanced tree) can at best do
it in O(log n) time.

Now, a compiler only needing to identify the tokens would then be best off
with the first type. But a parser of some type that also needs to compare
the tokens, would be best off with the second type.

Roughly.

--
Hans Aberg
[I think we're all in agreement here. For compiling, all you do with symbols
is look them up, and hashes do that well. In more complex enviroments where
you sort them or use them for name completion, tries have advantages. -John]

Anton Ertl

unread,
Feb 6, 2006, 12:05:25 AM2/6/06
to
Matthias Blume <bl...@tti-c.org> writes:
>If you double your hashtable when the load factor exceeds 1, then you
>waste an expected number of buckets that is bounded by something very
>close to the number of actual elements in the table. If you do not
>double the size, then you have to use more chain links in each bucket.
>In a typical implementation, the required number of extra pointers is
>the same as the number of those wasted buckets.

If you are using chaining, the number of wasted pointers is the number
of buckets: Even if a bucket is used, there is an unused pointer at
the end of the chain. So you can ensure that no more than one pointer
per element is wasted by always keeping the load factor >=1 (e.g., by
doubling when the load factor reaches 2).

Now if we compare this to a binary tree, the tree always wastes one
pointer per element: there are two pointers in each element, and one
pointer to the element, so one pointer is wasted on average.

So, if you can afford the space for a tree, you can also afford the
space for a hash table with chaining that doubles when the load factor
reaches 2.

jjyynmb

unread,
Feb 24, 2006, 9:38:36 AM2/24/06
to
take a look at lex.c in lcc project, the keyword didn't usage hash

0 new messages