Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

symtab

340 views
Skip to first unread message

Hugh Aguilar

unread,
Nov 24, 2009, 5:07:57 PM11/24/09
to
On Nov 23, 11:59 pm, John Passaniti <john.passan...@gmail.com> wrote:
> On Nov 22, 11:23 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
>
> > I recently ported my symtab.factor program to Forth (and fixed a bug
> > in it). You couldn't figure out symtab last time I posted it, but
> > maybe you'll get it this time. Here is the Factor version again:
>
> >http://www.rosycrew.com/symtab.factor
>
> I would prefer that you revert the code back to what I casually
> reviewed and *then* tell me precisely what I got wrong. Note the word
> "precisely"

Here is a Factor program:
www.rosycrew.com/symtab.factor
Here is a Forth program:
www.rosycrew.com/symtab.4th
www.rosycrew.com/novice.4th

The novice.4th program contains some support functions needed by
symtab.4th. The novice.4th program isn't complete; I am adding new
functions to it and will later make it available to novices to help
them get started. What we have here is just a step in that direction.

You had said that the nodes in the tree are sorted according to their
frequency of access. This isn't true; they are sorted alphabetically,
the same as in any binary tree. The symtab tree is *structured*
according to frequency of access. The most frequently accessed node is
the root, the next-most frequently accessed nodes are its children,
and so on down to the leaves, which are the least frequently accessed
nodes. The idea is to provide fast access to words such as DUP that
are used a lot, and slow access to words such as QUIT and BYE that are
used less. The result should be an overall improvement in speed.

Hash tables are the most popular data structure used for symbol tables
nowadays. These also have faster access for some nodes, and slower
access for other nodes (the nodes that have to be resolved through
linked lists). The problem is that *which* nodes are fast or slow is
pretty much a matter of chance, although the earlier-defined nodes
tend to be faster than the more recent ones. Bad luck might give you
slow access on DUP and fast access on BYE.

Hash tables are popular these days because computers have a lot of
memory. The traditional complaint against hash tables was that they
use too much memory, and therefore are a bad choice for 16-bit
computers. The memory problem is still an issue with 32-bit computers
though. Consider the case in which each wordlist has its own hast
table. Also consider the case in which the programmer uses associative
arrays (as done in Lua and AWK and various other high-level
languages). Every hash table is going to be gigantic, even if it only
has a few dozen elements in it. The compiler doesn't know how many
elements are going to be inserted into the hash table, and so it must
make each hash table large enough that it won't overflow in the worst-
case scenario that this will be the one that gets thousands of
elements. By comparison, my symtab data structure doesn't have any up-
front overhead. It does have more per-node overhead though, as each
node has a left and right pointer field, and a count field. In symtab,
the size of the data structure is linearly related to the number of
elements. The same amount of memory is going to be used whether you
have one symtab with X elements in it, or N symtabs with an average of
X/N elements in each.

No, I'm not going to reintroduce a bug that I have already fixed. Why
would I want to do that? It was a pretty obscure bug that only came up
when a node was inserted into the tree with the same key as an
existing node. The bug was that I smudged the new node rather than the
old node. In a Forth dictionary, the old node is supposed to be
smudged and the new node is supposed to be inserted into the
dictionary, typically with a "redefinition" warning. Fixing this bug
didn't change the symtab algorithm.

I stand by what I said previously, that the algorithm used in symtab
is my own invention. I have never seen any published algorithm that
structured the tree according to frequency of access.

All of the binary tree algorithms that I have read about restructure
the tree with the idea of balancing the tree. By comparison, symtab
trees aren't balanced and it is possible for them to be quite
lopsided. If there is no relationship between alphabetic ordering and
frequency of access however, then the tree will be pretty much
balanced. In regard to a Forth dictionary, the names of the words are
chosen for readability and so there shouldn't be any relation. In an
associative array, the names might be generated according to some
pattern ("1st", "2nd", "3rd", "4th", etc.) and so there would be a
relation. I'm not going to worry about it.

Andrew Haley

unread,
Nov 25, 2009, 5:10:07 AM11/25/09
to
Hugh Aguilar <hugoa...@rosycrew.com> wrote:

> Hash tables are the most popular data structure used for symbol
> tables nowadays. These also have faster access for some nodes, and
> slower access for other nodes (the nodes that have to be resolved
> through linked lists).

Sure, but if those linked lists are long then the hash table much too
small. Also, you can order those lists by frequencey of access.

I would have thought that for symbol tables it makes much more sense
to use open addressing. I've always been a bit mystified by the
poularity of collision resolution by chaining.

> Every hash table is going to be gigantic, even if it only has a few
> dozen elements in it. The compiler doesn't know how many elements
> are going to be inserted into the hash table, and so it must make
> each hash table large enough that it won't overflow in the worst-
> case scenario that this will be the one that gets thousands of
> elements.

Hash tables can be resized.

> I have never seen any published algorithm that structured the tree
> according to frequency of access.

It's in Knuth, Section 6.2.2.

Andrew.

John Passaniti

unread,
Nov 25, 2009, 2:32:56 PM11/25/09
to
On Nov 24, 5:07 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> You had said that the nodes in the tree are sorted according to their
> frequency of access. This isn't true; they are sorted alphabetically,
> the same as in any binary tree.

I had also wrote that (1) I am new to Factor, (2) I only casually
reviewed your code, (3) that I didn't find your code's comments clear,
and that as a result (4) I might not understand your code. I invited
you to explain what I got wrong. You didn't respond. This was 49
days ago.

> The symtab tree is *structured* according to frequency of access.

Oh, like a splay tree. Why didn't you say so? Probably because you
don't know what a splay tree is or the larger class of ordered data
structures that adjust themselves based on frequency of access. And
not knowing leads to statements from you like the following:

> I stand by what I said previously, that the algorithm used in symtab
> is my own invention.

That may be true. Every day, programmers who are unaware of the work
of others often come up with sub-optimal implementations of well-known
algorithms.

Your algorithm wastes both space (storing) and time (comparing) an
explicit count in each node. Splay trees use the actual act of
searching the tree to identify which nodes to move; no explicit count
is necessary.

> I have never seen any published algorithm that
> structured the tree according to frequency of access.

That may be true. You may not have access to Knuth's third volume of
"The Art of Computer Programming" or the title "Self-Adjusting Binary
Search Trees" in the original article describing splay trees back in
1985 (Journal of the ACM) might not have piqued your interest. And
depending on what keywords you tossed at various search engines, the
entire range of academic research into the larger class of data
structures (including various kinds of trees) that adjust themselves
based on frequency of access could have been hidden from your view.

> All of the binary tree algorithms that I have read about restructure
> the tree with the idea of balancing the tree.

Again, that may be true. I have no idea how much effort you put into
a search.

John Passaniti

unread,
Nov 25, 2009, 3:32:57 PM11/25/09
to
On Nov 25, 5:10 am, Andrew Haley <andre...@littlepinkcloud.invalid>
wrote:

> I would have thought that for symbol tables it makes much more sense
> to use open addressing.  I've always been a bit mystified by the
> poularity of collision resolution by chaining.

No real mystery: As the load factor of the hash table increases, the
cost of probing around to insert or find a key can get quite
expensive. In applications where hashes play a critical role (such as
many scripting languages that do dynamic lookup of functions and other
data), that expense is going to directly impact on the speed of the
language. The usual resolution to this is to increase the size of the
hash when the load factor crosses some threshold and then rehash
everything. But that itself can be expensive.

The popularity of chaining probably comes from two things. First, as
the load factor of the hash increases, the cost of lookups tends to
grow linearly (as opposed to open addressing, where the cost can,
depending on the hash quality and probing strategy, grow much
faster). Second, by ordering the data in the chains, it's possible to
use faster techniques (like skip lists, binary searches, etc.) to
reduce the search cost within the chain.

In other words, the conceptual simplicity of open addressing comes at
the cost of unpredictable access times mixed with the waste of unused
space to keep the load factor low. Chaining is more complex, but that
complexity pays off in terms of predictability and (potentially) less
wasted space, which in some applications matters more.

Hugh Aguilar

unread,
Nov 26, 2009, 2:19:32 AM11/26/09
to
On Nov 25, 12:32 pm, John Passaniti <john.passan...@gmail.com> wrote:
> On Nov 24, 5:07 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> > The symtab tree is *structured* according to frequency of access.
>
> Oh, like a splay tree.  Why didn't you say so?  Probably because you
> don't know what a splay tree is or the larger class of ordered data
> structures that adjust themselves based on frequency of access.  And
> not knowing leads to statements from you like the following:
>
> > I stand by what I said previously, that the algorithm used in symtab
> > is my own invention.
>
> That may be true.  Every day, programmers who are unaware of the work
> of others often come up with sub-optimal implementations of well-known
> algorithms.

Nobody here has shown me any evidence to indicate that my symtab is
sub-optimal.

> Your algorithm wastes both space (storing) and time (comparing) an
> explicit count in each node.  Splay trees use the actual act of
> searching the tree to identify which nodes to move; no explicit count
> is necessary.

If splay trees don't store a count the way that my symtab trees do,
then they aren't going to be as efficient, because they have less
information to work with. It seems unrealistic that any algorithm is
going to minimize both space and time. More likely, there is a trade-
off. These splay trees minimize space (no count field) at a cost in
time, whereas my symtab trees minimize time (they are optimal) at a
cost in space (a count field). Also, I don't think that comparing
integers takes up much time.

I'll do some research on these splay-trees and get back to you with a
comparison. I still feel confident that, whatever splay-trees are, my
symtab will be more efficient. If splay trees don't have a count
field, then they have only a vague guess of how many times each node
has been accessed. How much information can be stored in zero bits?
I'm not an expert on information theory, but I think the answer would
be: "None."

I don't consider the memory usage of the count field to be much of an
issue. Even with an extra word per node, symtab is still going to be
more memory efficient than hash tables with their big sparse array.
Also, a symtab tree is going to be more robust because it grows
dynamically. To resize a hash table you have to reinsert all of the
elements. That is very slow. It is true that I have my OPTIMIZE
function that traverses the tree and optimizes it, but OPTIMIZE is
quite fast. Also, OPTIMIZE is only necessary when the programmer is
using a large DAMPER value. If you set the DAMPER value low you get
more churning, which slows down the operation of the tree, but you
don't have to run OPTIMIZE as often (maybe not at all).

Andrew Haley

unread,
Nov 26, 2009, 5:42:29 AM11/26/09
to
John Passaniti <john.pa...@gmail.com> wrote:
> On Nov 25, 5:10?am, Andrew Haley <andre...@littlepinkcloud.invalid>
> wrote:

> > I would have thought that for symbol tables it makes much more
> > sense to use open addressing. I've always been a bit mystified by
> > the poularity of collision resolution by chaining.

> No real mystery: As the load factor of the hash table increases, the
> cost of probing around to insert or find a key can get quite
> expensive.

Let me explain a little more more what I meant. Sometimes you see
simple applications of hash tables that are not really critical to
application performance, where the programmer has gone through the
rigmarole of resolution by chaining, for additional code complexity
and no real advantage. This contradicts the basic rule of "start by
doing something simple, measure the results, and only do anything more
complicated if performance really justifies it."

> In applications where hashes play a critical role (such as many
> scripting languages that do dynamic lookup of functions and other
> data), that expense is going to directly impact on the speed of the
> language. The usual resolution to this is to increase the size of
> the hash when the load factor crosses some threshold and then rehash
> everything. But that itself can be expensive.

Well, you have to do that for any hash table.

On the one hand you have the simplicity of open addressing, but its
downside is that you mustn't allow the hash table ever to become more
than about a third to half full. Even when a table is half full there
are only on average 1.5 probes for a successful search and 2.5 probes
for an unsuccessful one.

> The popularity of chaining probably comes from two things. First,
> as the load factor of the hash increases, the cost of lookups tends
> to grow linearly (as opposed to open addressing, where the cost can,
> depending on the hash quality and probing strategy, grow much
> faster).

Well, I'm assuming here a hash that behaves like a random function.
If not, all bets are off, but I'm not at all convinced by the strategy
of collision resolution by chaining as a way to mitigate a bad hash
function!

> Second, by ordering the data in the chains, it's possible to use
> faster techniques (like skip lists, binary searches, etc.) to reduce
> the search cost within the chain.

> In other words, the conceptual simplicity of open addressing comes
> at the cost of unpredictable access times mixed with the waste of
> unused space to keep the load factor low. Chaining is more complex,
> but that complexity pays off in terms of predictability and
> (potentially) less wasted space, which in some applications matters
> more.

I don't agree about predictability: as long as the table is kept
reasonably empty, there's no difference in predictability between the
two approaches. I accept that If you want a really dense hash table
in memory and you're prepared to put up with the extra complexity,
there may be something to be said for chaining.

Andrew.

Anton Ertl

unread,
Nov 26, 2009, 6:33:30 AM11/26/09
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Sometimes you see
>simple applications of hash tables that are not really critical to
>application performance, where the programmer has gone through the
>rigmarole of resolution by chaining, for additional code complexity
>and no real advantage.

Rigmarole? Additional code complexity? I find a hash table with
external chaining to be very simple to implement: it's a little
demuxing code tacked in front of a linked-list implementation.

In contrast, with open addressing the implementation is much more
involved: The text books say that I should compute another, different
hash function for dealing with the collisions, and I have to deal with
the possibility of having more entries than slots. With chaining
having a higher load factor results in degraded performance, with open
addressing it results in failure; so I have to implement table
overflow handling even in cases where a simple fixed-size table would
be good enough with chaining.

>This contradicts the basic rule of "start by
>doing something simple, measure the results, and only do anything more
>complicated if performance really justifies it."

Yes, that's why I have always stuck to chaining.

>I accept that If you want a really dense hash table
>in memory and you're prepared to put up with the extra complexity,
>there may be something to be said for chaining.

I would have said exactly the same thing about open addressing. With
chaining, I always waste as many pointers as the table contains. With
that many free slots, open addressing can do quite well; the
complexity is in the implementation.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2009: http://www.euroforth.org/ef09/

Andrew Haley

unread,
Nov 26, 2009, 9:40:44 AM11/26/09
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Andrew Haley <andr...@littlepinkcloud.invalid> writes:

> >Sometimes you see simple applications of hash tables that are not
> >really critical to application performance, where the programmer
> >has gone through the rigmarole of resolution by chaining, for
> >additional code complexity and no real advantage.

> Rigmarole? Additional code complexity? I find a hash table with
> external chaining to be very simple to implement: it's a little
> demuxing code tacked in front of a linked-list implementation.

Well, yeah, as opposed to a little demuxing code.

> In contrast, with open addressing the implementation is much more
> involved: The text books say that I should compute another, different
> hash function for dealing with the collisions,

Double hashing is only one option.

> and I have to deal with the possibility of having more entries than
> slots.

As I said previously, I assume that you're going to have to resize
eventually, no matter what kind of hash table you use. If that
assumption doesn't hold, then you will get into a situation where
every slot is full, and then resolution by chaining is the only
possibility. Clearly in that situation you can't possibly use open
addressing.

Andrew.

Anton Ertl

unread,
Nov 26, 2009, 11:35:18 AM11/26/09
to
Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> Andrew Haley <andr...@littlepinkcloud.invalid> writes:
>
>> >Sometimes you see simple applications of hash tables that are not
>> >really critical to application performance, where the programmer
>> >has gone through the rigmarole of resolution by chaining, for
>> >additional code complexity and no real advantage.
>
>> Rigmarole? Additional code complexity? I find a hash table with
>> external chaining to be very simple to implement: it's a little
>> demuxing code tacked in front of a linked-list implementation.
>
>Well, yeah, as opposed to a little demuxing code.

Plus a little collision handling code (both on lookup and insert; only
on lookup with chaining); plus a little code for resizing.

Bernd Paysan

unread,
Nov 26, 2009, 12:18:52 PM11/26/09
to
Anton Ertl wrote:
> In contrast, with open addressing the implementation is much more
> involved: The text books say that I should compute another, different
> hash function for dealing with the collisions, and I have to deal with
> the possibility of having more entries than slots. With chaining
> having a higher load factor results in degraded performance, with open
> addressing it results in failure; so I have to implement table
> overflow handling even in cases where a simple fixed-size table would
> be good enough with chaining.

Another note: Gforth's way to deal with multiple vocabularies in just one
shared hash table won't work with open addressing. And our method is much
more practical with the grow-your-hash-table-when-needed approach: Since we
have only one hash table, it's already large, and doesn't have to grow
often.

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

Anton Ertl

unread,
Nov 26, 2009, 2:09:39 PM11/26/09
to
Bernd Paysan <bernd....@gmx.de> writes:
>Another note: Gforth's way to deal with multiple vocabularies in just one
>shared hash table won't work with open addressing.

Why not? Because our method relies on separate buckets staying
separate, whereas in open addressing you jump to another bucket if the
first is full. One could work around that by storying the wordlist
number with each word, but that costs extra memory.

Another advantage of chaining in the context of Forth is that it
provides easy and natural shadowing of older definitions with the same
name, because it keeps this property of linked lists. However, IIRC
we don't really make use of that in Gforth, because we always rebuild
the hash table when executing a marker.

>And our method is much
>more practical with the grow-your-hash-table-when-needed approach: Since we
>have only one hash table, it's already large, and doesn't have to grow
>often.

But when it grows, that's much more expensive. Still, if we used
separate hash tables, when somebody does 10 wordlists, each with 100
words, 640 of these 1000 words will have gone through one doubling,
320 of these 640 through two doublings, etc.; in contrast, because we
supersized the hash table from the start (one doubling beyond what the
algorithm would do by itself), none of the words would have to go
through any doubling with our scheme.

Another advantage is that there is less meta-data overhead: no need to
store a pointer, table size and population size for every wordlist.

John Passaniti

unread,
Nov 27, 2009, 2:57:17 AM11/27/09
to
On Nov 26, 2:19 am, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> Nobody here has shown me any evidence to indicate that my symtab is
> sub-optimal.

At best, the only thing you're able to claim with regard to optimality
is that at any particular moment in time, the nodes are ideally
arranged given some *past* cumulative frequency of access. That isn't
a very useful measurement. Nobody cares about the optimality of the
tree *after* some number of symbols have been accessed. What matters
are the basic statistical measures (sum, mean, minimum, maximum,
standard deviation) of the access times leading up to that point (and
the same statistics afterward). If your measure of optimality is how
accurately the tree's shape reflects the frequency distribution of
symbols, then congratulations-- you've apparently solved a problem
that nobody cares about.

You're also putting a tremendous amount of faith that the cumulative
frequency distribution at any point in time actually matters. What
seems more likely to me is that one cares far less about the overall
frequency distribution than the temporal pattern of access. Say I
have a symbol accessed more than any other during a run, but only near
the start of the run. Your scheme will promote that symbol to the
root, where it will uselessly remain there during the remainder of the
run. How is that helpful? How is that adaptive?

> If splay trees don't store a count the way that my symtab trees do,
> then they aren't going to be as efficient, because they have less
> information to work with.

The insight that you're missing is that searching the tree is itself
information. It's the identification of a "hot" path of pointers to a
desired node. No count is necessary to store or later compare.

You are using a singular definition of efficiency based on the tree's
shape accurately reflecting the frequency distribution of symbols.
Ignoring for the moment that I question the very premise of that being
useful in the kinds of applications one typically wants a symbol table
for, it completely ignores the cost of restructuring the tree. Let's
compare:

1) When some threshold is reached, a symtab decides to stop the world
and restructure the tree. This requires iterating over the *entire*
tree, juggling nodes around. As the tree grows, that is a
increasingly expensive operation in terms of processor time.

2) A splay tree optimizes only along the "hot" path of pointers to a
node when searching. The rest of the tree is unaffected and retains
the shape that past patterns of access left in those untouched parts
of the tree. The end result is that splay trees do less work than
your symtab and that work done is amortized over time. In some
variations of splay trees, the work done is even less because the
searched node isn't moved to the root, but just promoted closer to it.

> It seems unrealistic that any algorithm is
> going to minimize both space and time. More likely, there is a trade-
> off. These splay trees minimize space (no count field) at a cost in
> time, whereas my symtab trees minimize time (they are optimal) at a
> cost in space (a count field).

Again, you've optimized for something nobody cares about. What good
is having an optimal tree structure in memory when you've just
finished searching the tree and the program is ending? What good is
having an optimal tree structure based on frequency if the frequency
of symbols no longer represents the temporal pattern of access? Splay
trees make no claim (and thus, put forth no effort) in ensuring that
the tree at any point in time is globally optimal. Instead, splay
trees are concerned with a looser local, temporal optimality. And
they do so amortizing the cost over time, making access times more
evenly distributed than your "hold on while I stop the world and
restructure the whole tree" strategy.

> Also, I don't think that comparing integers takes up much time.

Not comparing integers takes up even less time. And less code. And
less complexity. And less opportunity for error. And no need for
code that tests to see if count exceeds a fixnum and then iterates
over the entire tree, dividing the count by two.

> I'll do some research on these splay-trees and get back to you with a
> comparison. I still feel confident that, whatever splay-trees are, my
> symtab will be more efficient. If splay trees don't have a count
> field, then they have only a vague guess of how many times each node
> has been accessed. How much information can be stored in zero bits?

The information is stored in the path of pointers leading to the node.

> I'm not an expert on information theory, but I think the answer would
> be: "None."

Yes, it's clear you aren't an expert on information theory. It's
probably because you have a rather limited definition of
"information."

> I don't consider the memory usage of the count field to be much of an
> issue. Even with an extra word per node, symtab is still going to be
> more memory efficient than hash tables with their big sparse array.

Most real-world implementations of hash tables start small and then
grow as necessary.

> Also, a symtab tree is going to be more robust because it grows
> dynamically. To resize a hash table you have to reinsert all of the
> elements.

False. Many hash table implementation do indeed reinsert all the
elements because that is simple. But it isn't necessary. Linear
hashing is a technique that allows for growing a hash table one node
at a time, and it only requires rehashing the elements in a single
collision chain. Other techniques like "consistent hashing" works by
partitioning the storage space, and only requires reinsertion of the
elements that fall in that space.

> That is very slow.

False. Hash tables are used in real-time situations where consistent
speed matters (and where your symtab would fail). This is done by
amortizing the cost of reinserting nodes of over time. One common
technique to resizing a hash table that has real-time constraints is
to allocate a new empty table, and then use it for primary searches,
falling back to the original table when searches fail. New insertions
occur in the new table, and nodes are incrementally moved from the
original to the new table, bound to an upper limit (such as number of
nodes or an interval of time). When finished, you delete the original
table.

> It is true that I have my OPTIMIZE
> function that traverses the tree and optimizes it, but OPTIMIZE is
> quite fast. Also, OPTIMIZE is only necessary when the programmer is
> using a large DAMPER value. If you set the DAMPER value low you get
> more churning, which slows down the operation of the tree, but you
> don't have to run OPTIMIZE as often (maybe not at all).

Quite fast? I'm sorry, I'm unfamiliar with that unit of measure.
Help me convert that to processor cycles per node so that I can
meaningfully decide if what you think is "quite fast" is suitable for
any particular application? Thanks.

Here's your homework. It has six parts. This really shouldn't take
much time:

1) Your data set at the end of your code is cute. Let's instead pick
something more useful and representative. You call this a symbol
table, so let's throw real-world symbols at it. Let's take the
symbols from source code. Factor is kind of cool, so let's use it.
When I look at the current distribution for Linux, I see there are
3177 files named "*.factor". I want you to iterate through those
files and put the symbols you find in those files in your symtab.
(I'll define a symbol as whitespace-separated tokens for ease of
parsing.) Ideally, you'll eliminate comments first, but maybe that
doesn't matter. Regardless, let's roughly estimate how many symbols
there are:

$ find -name "*.factor" | xargs cat | awk '{ for(i=1;i<NF;i++) print
$i;}' | wc -w
1228623

Cool, roughly 1.2M symbols. I wonder how many of those are unique?

$ find -name "*.factor" | xargs cat | awk '{for(i=1;i<=NF;i++) print
$i;}' | sort -u | wc -w
106714

So we're talking in the neighborhood of 100k unique symbols to store
in your symtab. That should tell us something.

2) Instrument your code so that you record the time taken to insert/
access symbols in the table. I don't know Factor, but it probably has
a library routine to look at the system's high-resolution timer (or
better, the performance counters in your CPU). Use the best timer you
have to measure the insert/access time. Collect simple statistics;
the sum, mean, minimum, and maximum.

3) Comment-out the timer additions from the previous step and replace
it with a count of the number of probes it took to find the symbol.
So if a symbol was found at the root, that's 1. If it was found down
one level, that's 2. And so on. Again, collect simple statistics.

4) Report your numbers. Include the value(s) you used for DAMPER.

Okay, that will be interesting. And now it's time for the *really*
fun part:

5) Change the code to replace and increment of "count" when inserted/
accessed to storing the current time (hopefully with millisecond
resolution, but okay if not). Keep the rest of the code the same. In
other words, we're going to completely ignore frequency and instead
reorder the tree based in terms of when a field was last used.

6) Repeat steps 2 through 4, above.

And here's my fearless prediction: Getting rid of the frequency
counting will reduce the average number of probes to find a symbol,
probably dramatically. That's because you want more of a cache and
less of a statistical model of your symbols. I don't know if this
will result in a reduction of the average access time to find symbols;
that time could potentially be dominated by the OPTIMIZE routine
stopping the world and restructuring nodes.

Have fun.

Bernd Paysan

unread,
Nov 24, 2009, 5:34:35 PM11/24/09
to
Hugh Aguilar wrote:
> Hash tables are popular these days because computers have a lot of
> memory. The traditional complaint against hash tables was that they
> use too much memory, and therefore are a bad choice for 16-bit
> computers. The memory problem is still an issue with 32-bit computers
> though. Consider the case in which each wordlist has its own hast
> table.

In Gforth, we use a single hash table for all wordlists. Different
wordlists have a different "seed" into the hash table, i.e. when DUP is
in wordlist 0 (the first one), it would go to slot 316 (just an
example), if it's in wordlist 3, it would go to slot 319, you get the
basic idea (we use XOR). The size of the hash table is the next power
of two larger than the total number of entries. That way, you won't
have too many collisions, and not waste too much space. If the number
of entries reaches the critical value, the hash-table size is doubled
and the table re-populated.

In case of collisions, later entries are going to be faster, since they
get to the head of the chain (that's Forth's search semantics). To
avoid slowdowns by too many redefinitions (all of them will end up in
the same bucket, and would make access to the tail of the chain slower),
we could eliminate duplicates, but the current implementation just
doesn't worry about that.

I've experimented with AVL tress (balanced binary trees), and the
benchmark conclusion was that

a) AVL trees are harder to implement
b) they are significantly slower than the hashes as implemented above
for larger vocabularies (O(log(N)) instead of O(1))

Hugh Aguilar

unread,
Nov 27, 2009, 10:40:23 PM11/27/09
to
On Nov 27, 12:57 am, John Passaniti <john.passan...@gmail.com> wrote:
> You're also putting a tremendous amount of faith that the cumulative
> frequency distribution at any point in time actually matters. What
> seems more likely to me is that one cares far less about the overall
> frequency distribution than the temporal pattern of access. Say I
> have a symbol accessed more than any other during a run, but only near
> the start of the run. Your scheme will promote that symbol to the
> root, where it will uselessly remain there during the remainder of the
> run. How is that helpful? How is that adaptive?

My symtab tree restructures itself to be optimal for a program that is
getting recompiled repeatedly. There is some churning in the beginning
of the development of that program. Pretty soon though, the tree
settles down into an optimal state and there is very little churning,
which mostly occurs as the programmer writes new words and these start
out at a leaf and work their way up to an appropriate place in the
tree.

This "temporal pattern of access" doesn't make sense to me. You seem
to imply that the entire tree gets restructured during the course of
compilation of a program. That is a lot of churning!

> You are using a singular definition of efficiency based on the tree's
> shape accurately reflecting the frequency distribution of symbols.
> Ignoring for the moment that I question the very premise of that being
> useful in the kinds of applications one typically wants a symbol table
> for, it completely ignores the cost of restructuring the tree.

I'm not "completely ignoring the cost of restructuring the tree." As I
said above, the tree gradually settles down into an optimum state and
there is less and less churning needed.

> 1) When some threshold is reached, a symtab decides to stop the world
> and restructure the tree. This requires iterating over the *entire*
> tree, juggling nodes around. As the tree grows, that is a
> increasingly expensive operation in terms of processor time.

I have no idea what you are talking about here; there is no threshold
that triggers anything to happen. My symtab never "stops the world"
and restructures the entire tree. I do have my OPTIMIZE function that
restructures the entire tree, but its use is optional. It is possible
to set DAMPER to a value > 0 (the default is 30), in which case the
tree is sub-optimal. The higher the DAMPER value, the further it is
from optimality, and the lower the DAMPER value the closer the tree is
to optimality, with a DAMPER value of zero providing perfect
optimality. The use of OPTIMIZE is always optional though. I recommend
that if you have a high DAMPER value, you should call OPTIMIZE
occasionally. A good time to do this is at the end of the day when you
are shutting down the system. It will "stop the world" and restructure
the entire tree, but you have already gone home at this time.

I do have my PRUNE function that removes all of the smudged nodes from
the leaves. This is also optional. PRUNE is more important than
OPTIMIZE though, as it saves memory by getting rid of all of those
smudged nodes. It also speeds up the tree because searches through the
tree won't have to go through all of those smudged nodes. PRUNE also
halves all of the count values in the tree if they have gotten too
big. This is important. If you never call PRUNE, the count values will
eventually overflow the single-precision integers that they are stored
in. This may take several months to happen, but it will happen
eventually, so you do want to call PRUNE occasionally --- when you are
going home at the end of the day is a good time. Fortunately, PRUNE is
extremely fast because it doesn't move nodes around at all --- it just
traverses the tree.

Your comments above, about stopping the world to restructure the
entire tree, and about some kind of threshold, imply that you don't
understand how symtab works. Study the code --- you'll figure it out
eventually.

> 1) Your data set at the end of your code is cute. Let's instead pick
> something more useful and representative. You call this a symbol
> table, so let's throw real-world symbols at it. Let's take the
> symbols from source code. Factor is kind of cool, so let's use it.

Symtab normally initializes each node with a count value of 1. If you
have a-priori knowledge of what symbols are going to be used, you can
initialize each node with a count value that you provide. Assuming
that your a-priori knowledge is a good predictor of the future, this
will get the symtab tree into an optimal structure a lot sooner than
if every node starts out at 1. In either case though, the tree will
eventually become optimal --- it just gets there sooner with some help
up front.

I read in Forth Dimensions a long time ago that somebody did some
statistics over a large number of Forth programs to determine which
are the most commonly used. This was way back in the days of linked
lists, and the idea was to structure the linked list by order of
frequency, rather than by the order that the words happened to be
defined. If symtab were to be used for a Forth dictionary, collecting
statistics like this would be a good idea so that the dictionary would
be pretty close to optimal out-of-the-box, assuming that the user's
programs will be similar to the programs written by other Forth
programmers.

The programmer might be different than everybody else. He might, for
example, write a lot of programs involving floating-point arithmetic,
whereas the programs that we collected statistics on might not have
used floating-point very much. In this case, there will be some
churning in the early days of the guy's usage of the compiler as the
floating-point words burble upwards in the tree. Pretty soon though,
the tree will have adapted itself its new owner's programming style
and will be optimal for him (assuming that his programming style
doesn't change over time).

Hugh Aguilar

unread,
Nov 27, 2009, 11:27:27 PM11/27/09
to
On Nov 24, 3:34 pm, Bernd Paysan <bernd.pay...@gmx.de> wrote:

> Hugh Aguilar wrote:
> > Consider the case in which each wordlist has its own hast
> > table.
>
> In Gforth, we use a single hash table for all wordlists.  Different
> wordlists have a different "seed" into the hash table, i.e. when DUP is
> in wordlist 0 (the first one), it would go to slot 316 (just an
> example), if it's in wordlist 3, it would go to slot 319, you get the
> basic idea (we use XOR).  The size of the hash table is the next power
> of two larger than the total number of entries.  That way, you won't
> have too many collisions, and not waste too much space.  If the number
> of entries reaches the critical value, the hash-table size is doubled
> and the table re-populated.

UR/Forth did the same thing in its single hash tree; it prepended the
voc# on the front of each word's name. It is a lot more efficient to
put each word-list in its own data structure however. In the case of
my symtab, this means that we avoid comparing our search string to a
large number of nodes that are in other word-lists and therefore
aren't going to match. In the case of a hash table, it means a lot
fewer collisions. UR/Forth pretty much had to use a single hash table
because the 16-bit 8086 uses segmented memory and so there was only
one realistic size for a hash table (64K).

That business of resizing and repopulating the hash tree is very slow.
Also, this happens at unpredictable times and can cause the system to
freeze up unexpectedly. All in all, I think that my symtab is much
more robust than any hash table.

> I've experimented with AVL tress (balanced binary trees), and the
> benchmark conclusion was that
>
> a) AVL trees are harder to implement

I agree that AVL trees are hard to implement, and my experience is
that symtab trees are even harder to implement. Fortunately, I've
already done the heavy lifting, so it should be fairly easy for
anybody to integrate my symtab code into his compiler.

> b) they are significantly slower than the hashes as implemented above
> for larger vocabularies (O(log(N)) instead of O(1))

I agree that AVL trees are slow, but this doesn't have anything to do
with symtab trees.

John Passaniti

unread,
Nov 29, 2009, 12:31:34 PM11/29/09
to
On Nov 27, 10:40 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> My symtab tree restructures itself to be optimal for a program that is
> getting recompiled repeatedly.

Fascinating use case! So your symtab *isn't* optimal for:

1) The symbol table in on-demand interpreted languages (such as
scripting languages) which are only loaded for the lifetime of a
program.

2) The symbol table for compiled languages that make only a single
pass over the source code (which is pretty much every language
designed since the 70's).

3) The symbol table for languages like Forth where the simplified
memory management and fast recompilation makes it simpler to just
restart.

There's not a whole lot left.

Instead, you've got an "optimal" symbol table for a language which
remains resident in memory across repeated recompilations of the same
program. A language that (presumably) has such high start-up costs
that it makes sense to keep it in memory at all times. A language
that apparently doesn't offer any means to save and later reload the
environment, avoiding those high start-up costs and rewinding to a
checkpoint. Wow. You've apparently invented a symbol table for a
language that I can't imagine most programmers would want to use.
Congratulations.

But really, none of that matters. Even if you can cite a real-world,
non-contrived use case where your odd take on a symbol table actually
makes sense, you are *still* missing the fundamental flaw in your
design: Your notion of optimality is based on the *cumulative*
frequency of symbols up to any particular point in time. The problem
with that should be (hopefully) obvious:

You're assuming the *complete* past (that's the cumulative part)
matters. If at the start of a run, I have a symbol that occurs far
more than any other, your symtab puts it at the top of the tree. But
how does this help if from that point on, the frequency of that symbol
drops (or is never seen again)? Yes, if another symbol later becomes
dominate, you'll eventually adjust for it-- assuming its frequency is
higher than the previous highest-frequency symbol. In other words,
you're *ignoring* that accesses to a symbol have peaks and valleys and
that they often occur in clusters. You're effectively only handling
peaks. If you were truly adapting to the symbols, you would act more
as a cache, pushing symbols that were recently accessed up higher
since it is more likely that they will be accessed again. In other
words, you're optimizing for the wrong thing.

> This "temporal pattern of access" doesn't make sense to me.

Evidently, or you would have thrown away your symtab long ago. Allow
me to demonstrate:

Let's take the symbols from every "*.factor" file in the current
Factor distribution for Linux. For simplicity, I'm going to call a
symbol anything separated by whitespace. That means we get some non-
symbols (like comments), but that's just noise and it's good enough to
illustrate the point. As I wrote before, there are roughly 1.2
million symbols in that code, of which 100k are unique. I took those
symbols and counted their frequency. Here's a URL with the top 768
symbols:

http://JapanIsShinto.com/stuff/top768

If we graph these we get a frequency distribution that looks like
this:

http://JapanIsShinto.com/stuff/top768.png

Please note the Y axis is logarithmic. So that's the frequency
distribution your symtab will adjust toward. But is that really what
you want? Frequency doesn't tell you the whole story about the
symbols. For that, we need to look at the symbols over time-- the
temporal pattern of how they are accessed. Here's a picture:

http://JapanIsShinto.com/stuff/vis.png

The horizontal axis is, roughly, time. The vertical axis is symbols,
sorted from highest frequency (top row) to the lowest (bottom row).
My screen is big, but it isn't 1.2 million pixels wide. So I'm
scaling the image-- each pixel represents the lookups of 1200
symbols. So, if a pixel is lit, it means that there was at least one
access of the symbol during that group of 1200 lookups. The point of
all this? So we can get an idea not just of the frequencies of the
symbols, but how that frequency is distributed over time. That's the
part you've been missing.

So there are a couple obvious things to note. First, the top few
symbols have rows that are mostly white. That should be intuitive--
you would expect that in any group of 1200 lookups, there is a high
probability that a high-frequency symbol will be looked up. The
second thing to note is that as the frequency decreases, the rows get
progressively darker. That also should be intuitive-- the frequency
is lower, so you would expect there would be a lower probability of a
lookup in any group of 1200. So far, so good.

Now look a bit closer. One thing you'll note is that there are some
dark vertical bands of varying thickness. These dark bands usually
extend right up into the high frequency symbols. This means there are
periods of time when the symbols with the highest frequency simply
aren't looked up. Yet, your symtab would have by those times put
those high-frequency symbols at or near the top of the tree, making
the access to the lower-frequency symbols far more expensive.

You'll also note if you look at the rows closely that even for the
high-frequency symbols, there are times when those symbols simply
aren't accessed. And that's the next point-- what good is it to have
a tree that has symbols that you're not accessing at the time stuck
near the top? That's not some Zen koan-- you can see for yourself
that there are some symbols that are used quite a lot, but only in
small clusters.

But it gets worse. Let's take the same picture, but now, I'm going to
vary the brightness of the pixel according to how many times each
symbol is looked up in each group:

http://JapanIsShinto.com/stuff/vis-scaled.png

Same story, but now with more detail. Now we can see the clusters
better-- they show up as brighter pixels. As you look at the picture,
you'll see all sorts of interesting things. Symbols with relatively
low frequencies but which are accessed fairly regularly (many rows of
the same brightness). Short periods of time when groups of relatively
low-frequency symbols are accessed frequently. Even more vertical
banding, indicating that lower-frequency symbols are being accessed.
And so on.

The source code I used is real-world. It isn't a handful of symbols
like the ones you tossed at the bottom of your symtab code as a test.
It represents real-world frequencies for code in a real language. If
I repeated this experiment with other languages, I fearlessly predict
you would see the exact same thing. You would see that in the real-
world, independent of language, symbols are clustered together, and
that being "optimal" in terms of just frequency doesn't address that
clustering.

> Your comments above, about stopping the world to restructure the
> entire tree, and about some kind of threshold, imply that you don't
> understand how symtab works. Study the code --- you'll figure it out
> eventually.

Sorry, but I'm not going to invest time to fully understand badly
documented code in a language I am new to. And I'm especially not
going to do so when I can easily show that the premise your symtab is
based on is flawed. Why exactly should I invest time to fully
understand something that is on it's face inferior to other
solutions? Ordering the tree by just frequency may be "optimal" in
the narrow sense that the tree models the frequency distribution, but
it is certainly not "optimal" in terms of what one actually wants from
a symbol table.

Your comments suggest that a motivation for writing symtab was (in
part) to avoid the excess memory of a hash table and the presumption
of random access times. But there too, you have failed.

Memory: Every node in your symtab has a count and two pointers of
overhead. So in your scheme, you're using up N*3 words on top of
whatever storage you're using to store the node's contents. Most hash
tables are designed so that collisions are minimized up to a certain
load factor-- usually around 75%. So they typically "waste" around
25% of their allocated space. So going back to the real-world symbols
I used above, with 100k unique symbols, your symtab is going to be
wasting 100k*3 = 300k words in overhead. That's 1.2 megabytes of
overhead wasted for a highly dubious definition of "optimal." The
same symbols in a 75%-full hash table would "waste" about 25k words or
about 100k bytes. So using an open-addressed hash table with the
symbols I described previously results in a net savings of 1.1
megabytes over your symtab. So much for hash tables not being memory
efficient. So your symtab sucks if one cares about the amount of
memory used-- a hash table clearly wins because your overhead grows
faster than than a hash table's. More to the point, if I used the
same amount of memory you use in overhead, I can have a hash table
where the probability of collisions is nearly zero.

Incidentally, when talking about hash tables that use chaining to
handle collisions, most people think linked lists. That's in my
experience is rare. Most "chains" are actually arrays. So in hash
tables that use chaining, the only overhead for a chain is 1+M words;
one word for the length of the array, M words for the array. That's
*still* less overhead than your symtab.

Time: Your symtab is too complicated for me to characterize the
amount of time it takes to lookup symbols, and you were too lazy to do
as I suggested and *measure* it. A straight balanced binary tree
would do lookups in O(log2 N) time, but your symtab isn't balanced.
Your hope is that most lookups will be in O(M) time for some small M,
but that (wrongly) assumes the frequency distribution matches the
temporal pattern of access, which I've shown isn't the case. On top
of this, you're shifting nodes around based on frequency, so coming up
with a big-O measure of your symtab is going to be hard (maybe one of
the academics here can work that out). In contrast, most real-world
hash tables have hash functions that have very good performance (that
is, produce few collisions). That means that an open addressed hash
table will typically give you lookups in constant time close to O(1),
up to a load factor of about 75%. A bad hash function can make that
worse, but even then, most real-world hash tables order collisions by
putting the most-recently used at the head of the chain. The end
result is that even when hash tables have bad hash functions, they
still exploit the temporal pattern of access that your symtab does
not. So practically, most hash table accesses typically will do
lookups in O(1) time, or O(2) time if the hash function is bad. So
your symtab sucks if one cares about the efficiency of lookups.

So tell me again under what highly-specific conditions your symtab is
beneficial? It's slower and takes up more memory than a hash. Hell,
I'll be willing to bet that the performance of symtab is dramatically
worse than a balanced binary tree.

> I read in Forth Dimensions a long time ago that somebody did some
> statistics over a large number of Forth programs to determine which
> are the most commonly used. This was way back in the days of linked
> lists, and the idea was to structure the linked list by order of
> frequency, rather than by the order that the words happened to be
> defined.

Yeah, you'll get some savings in lookup time, but how much? Using the
previous symbols I was working with, let's first determine the cost of
keeping the linked list in order of definition. So I'll feed the
symbols to this Perl script:

while (<>) {
chomp;
foreach my $sym (split) {
$order{$sym} = ++$last if !exists $order{$sym};
$sum += $order{$sym};
}
}
print "$sum probes\n";

The result is 17835382742 probes for 1.2 million symbols. So our
baseline is 17.8 billion probes for this set of symbols, in order of
definition. That's a mean of 14516 probes per symbol. Now let's see
how much sorting the list helps. That's actually pretty simple; we
just take the sum of each symbol's frequency and multiply by its rank:

find -name "*.factor" | xargs cat | awk '{ for(i=1;i<NF;i++) print

$i;}' | sort | uniq -c | sort -nr | awk '{sum+=NR*$1} END{print
(sum)}'
6556487746

That's clearly better. Sorting the list takes us down to 6.5 billion
probes for a mean of 5336 probes per symbol. But still, even on a
fast processor, that's a huge amount of time flying through the linked
list. And yes, I would imagine that a symtab would save some more.
How much is hard to say-- and you don't seem too keen on actually
testing your symtab against real-world data. But we can note that a
balanced binary tree (or a binary search over a sorted array) will
probably out-perform your symtab. Yes, the cost of frequent symbols
is low, but the cost of less-frequent symbols is higher-- potentially
much higher in degenerate cases.

So let's see how many probes a binary search would take. We know that
the worst-case for a binary search is going to be log2(N), and we know
we have about 100k unique symbols. So that means at worst, a binary
search will take 17 probes per symbol. Let's run that against the
same set of symbols; we do the same thing as before, but now multipy
by 17 instead of the rank:

find -name "*.factor" | xargs cat | awk '{ for(i=1;i<NF;i++) print

$i;}' | sort | uniq -c | sort -nr | awk '{sum+=17*$1} END{print
(sum)}'
20886591

That's dramatically better-- and again, this number is actually
artificially inflated because I'm not using the actual number of
probes, but the worst case of 17. And it's easy to implement too--
store the Forth dictionary in two parts. The user's words are a
standard linked list. If the word you're looking for isn't found, you
fallback to the sorted array and do a binary search. It's more
complex than a straight linked list, but not tremendously so, and for
a system with a large dictionary, it's a *huge* win over storing the
list in order of frequency. So let's review, ranked in order of
number of probes:

1) 17.8 billion probes for a linked list in order of definition.
2) 6.5 billion probes for a linked list ordered by frequency.
3) Some unknown number of probes for your symtab (feel free to measure
it). My educated guesstimate would be around 500 million probes.
4) 20 million probes-- and that's artificially large-- for a simple
binary search.
5) Roughly 1.2 million probes for a hash table.

If memory was a premium, I would clearly choose the simple binary
search. It's likely to be an order of magnitude faster than your
symtab and far easier to implement. If speed mattered more than
space, I'd go for the hash table which will be an order of magnitude
faster than the simple binary search.

> If symtab were to be used for a Forth dictionary, collecting
> statistics like this would be a good idea so that the dictionary would
> be pretty close to optimal out-of-the-box, assuming that the user's
> programs will be similar to the programs written by other Forth
> programmers.

If by now you still don't get why this isn't true, then I *strongly*
suggest that instead of pulling a (bad) definition of "optimal" out of
your ass, that you actually *prove* it. I've given you a real-world
data set that you probably have on your computer right now. I've
given you visualizations to show that symbols naturally cluster, and
noted that your symtab doesn't exploit this natural property. I've
shown you that your symtab uses more memory and will have far worse
performance than a hash or even just a simple balanced binary tree.

If you're too lazy to prove your case by following the six steps I
previously detailed, just say so. Afraid to see how your symtab works
against real-world data? I can see why. You've made the classic
mistake a lot of programmers make. You've confused the optimality of
a data structure with it being optimal for a specific application.

P.S.: While rummaging through the Factor source code, I see they have
an implementation of splay trees. So with very little effort on your
part, you can test your symtab against a splay tree.

Hugh Aguilar

unread,
Nov 29, 2009, 8:58:39 PM11/29/09
to
I ignored your lengthy post because of your vulgar mindset. Here are
some examples:

On Nov 29, 10:31 am, John Passaniti <john.passan...@gmail.com> wrote:
> Sorry, but I'm not going to invest time to fully understand badly
> documented code in a language I am new to.

> So your symtab sucks if one cares about the amount of
> memory used--

> So


> your symtab sucks if one cares about the efficiency of lookups.

> If by now you still don't get why this isn't true, then I *strongly*


> suggest that instead of pulling a (bad) definition of "optimal" out of
> your ass, that you actually *prove* it.

Your use of the terms "sucks" and "ass" indicate that you are member
of a community of people that don't belong on comp.lang.forth. You
should already be well aware of my opinion of this community of
people, and you should not have used your community's terminology in a
post directed at a decent person such as myself. I suggest that you go
back to your own community, where such language is appreciated, and
stop posting vulgar messages on comp.lang.forth.

Elizabeth D Rather

unread,
Nov 29, 2009, 9:07:39 PM11/29/09
to

Ok, Hugh, that's the last straw. You're in my "kill file" now. I have
no interest in any further postings from you on any subject.

Not because I have any great personal affinity for John -- I've never
met him, and know him only through his posts here, which are unfailingly
interesting, perceptive, and technically sound (even when we disagree).
But I am *personally* offended by your homophobic bias and use of
personal insults. I do not choose to converse with people of your
intolerant mindset. John has made far more technical contributions to
comp.lang.forth than you have.

Bye.

Elizabeth

--
==================================================
Elizabeth D. Rather (US & Canada) 800-55-FORTH
FORTH Inc. +1 310.999.6784
5959 West Century Blvd. Suite 700
Los Angeles, CA 90045
http://www.forth.com

"Forth-based products and Services for real-time
applications since 1973."
==================================================

Hugh Aguilar

unread,
Nov 29, 2009, 10:42:14 PM11/29/09
to
On Nov 29, 7:07 pm, Elizabeth D Rather <erat...@forth.com> wrote:
> > Your use of the terms "sucks" and "ass" indicate that you are member
> > of a community of people that don't belong on comp.lang.forth. You
> > should already be well aware of my opinion of this community of
> > people, and you should not have used your community's terminology in a
> > post directed at a decent person such as myself. I suggest that you go
> > back to your own community, where such language is appreciated, and
> > stop posting vulgar messages on comp.lang.forth.
>
> Ok, Hugh, that's the last straw.  You're in my "kill file" now.  I have
> no interest in any further postings from you on any subject.
>
> Not because I have any great personal affinity for John -- I've never
> met him, and know him only through his posts here, which are unfailingly
> interesting, perceptive, and technically sound (even when we disagree).
>   But I am *personally* offended by your homophobic bias and use of
> personal insults.  I do not choose to converse with people of your
> intolerant mindset.  John has made far more technical contributions to
> comp.lang.forth than you have.

I didn't say anything homophobic. The "community" that I was referring
to was those people who use vulgarity in polite company.

Bernd Paysan

unread,
Nov 30, 2009, 5:08:10 AM11/30/09
to
John Passaniti wrote:
[lengthy analysis of a sub-optimal proposal]

Wow, what amount of effort you put in that. John, it's not worth the time.
Hugh will dismiss it, because analysis contains the word "anal", and he's
not responsive to critics, anyways. Other people like me have already done
that analysis (tree vs. hash) 20 years ago, and the result hasn't changed -
a proper implemented hash wins hands down.

Bernd Paysan

unread,
Nov 30, 2009, 5:33:31 AM11/30/09
to
Hugh Aguilar wrote:
> UR/Forth did the same thing in its single hash tree; it prepended the
> voc# on the front of each word's name. It is a lot more efficient to
> put each word-list in its own data structure however.

I've done that, too, and the conclusion was that the unified hash table was
faster and used up less memory. Please use a quantitative approach:
Implement the options, and *measure*.

> In the case of
> my symtab, this means that we avoid comparing our search string to a
> large number of nodes that are in other word-lists and therefore
> aren't going to match.

We are not searching, we are looking up inside a hash table, which is not
fully populated, so with all good likelyhood, we will only find one element
to compare to, and that one will be either hit or miss.

> In the case of a hash table, it means a lot
> fewer collisions.

The number of collisions just depends on how large the hash table is vs. how
many entries there are.

> That business of resizing and repopulating the hash tree is very slow.

Well, 3 milliseconds on my machine for a hashdouble in Gforth. I've
implemented the algorithm on a machine which was 1000 times slower, and
didn't feel any pain with the repopulating delay there.

> Also, this happens at unpredictable times and can cause the system to
> freeze up unexpectedly.

Yes, it's not real-time, at least not as implemented. However, the freeze
up of 3 milliseconds is fairly tolerable. If you really need it, you can
use an incremental repopulation algorithm.

>> a) AVL trees are harder to implement
>
> I agree that AVL trees are hard to implement, and my experience is
> that symtab trees are even harder to implement. Fortunately, I've
> already done the heavy lifting, so it should be fairly easy for
> anybody to integrate my symtab code into his compiler.

I've also done the heavy lifting, got the numbers, and ditched it again.
It's not worth the time.

>> b) they are significantly slower than the hashes as implemented above
>> for larger vocabularies (O(log(N)) instead of O(1))
>
> I agree that AVL trees are slow, but this doesn't have anything to do
> with symtab trees.

Trees are trees. Even when you order the tree so that frequently accessed
words are closer to the root, you still have more to do. Just read John's
long analysis, it's worth reading.

Hugh Aguilar

unread,
Nov 30, 2009, 9:29:37 PM11/30/09
to
On Nov 30, 3:33 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> Hugh Aguilar wrote:
> > UR/Forth did the same thing in its single hash tree; it prepended the
> > voc# on the front of each word's name. It is a lot more efficient to
> > put each word-list in its own data structure however.
>
> I've done that, too, and the conclusion was that the unified hash table was
> faster and used up less memory.  Please use a quantitative approach:
> Implement the options, and *measure*.

I don't have my own Forth compiler, so implementing the two algorithms
and measuring their performance in the compilation of a test suite is
not realistic for me. I very likely will end up writing a Forth
compiler, and I will do that then.

> > In the case of a hash table, it means a lot
> > fewer collisions.
>
> The number of collisions just depends on how large the hash table is vs. how
> many entries there are.

I've said from the beginning that hash tables are a good choice in a
large-memory system.

> > That business of resizing and repopulating the hash tree is very slow.
>
> Well, 3 milliseconds on my machine for a hashdouble in Gforth.  I've
> implemented the algorithm on a machine which was 1000 times slower, and
> didn't feel any pain with the repopulating delay there.
>
> > Also, this happens at unpredictable times and can cause the system to
> > freeze up unexpectedly.
>
> Yes, it's not real-time, at least not as implemented.  However, the freeze
> up of 3 milliseconds is fairly tolerable.  If you really need it, you can
> use an incremental repopulation algorithm.

I really see the advantage of symtab to be robustness. I don't think
that symtab compares very well with hash tables in regard to speed if
you ignore the consideration of the time needed to rebuild the hash
table when it fills up, or the time needed to resolve collisions in
the event that you don't rebuild the hash table but just allow it to
bog down. The advantage of symtab is that it expands and (with the
help of PRUNE) contracts dynamically. You don't have to wonder if it
is getting too "full" and needs to be rebuilt, because the concept of
fullness doesn't exist. Symtab is maintenance free --- other than
calling PRUNE occasionally, symtab is "fire and forget." As I said
before, OPTIMIZE is entirely optional; if you set DAMPER to a low
value you can ignore OPTIMIZE entirely, for a pretty-much maintenance-
free system.

Hugh Aguilar

unread,
Nov 30, 2009, 10:32:29 PM11/30/09
to
On Nov 30, 3:08 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> John Passaniti wrote:
>
> [lengthy analysis of a sub-optimal proposal]
>
> Wow, what amount of effort you put in that.  John, it's not worth the time.  
> Hugh will dismiss it, because analysis contains the word "anal", and he's
> not responsive to critics, anyways.  

I actually did read Mr. Passaniti's lengthy post. He was ranting.
Telling me that my program "sucks" isn't an analysis --- I'm not going
to dignify a post like that with a response. Also, he repeatedly
denounced me as being "lazy." That really irks me. Writing symtab
wasn't the work of one day. I put a lot of effort into inventing that
algorithm and getting it to work, and then I get called lazy by
somebody who AFAIK has never contributed *any* Forth code to c.l.f..
If he thinks that Splay Trees are so wonderful, why hasn't he written
an implementation and contributed it to c.l.f.? Maybe it is easier to
belittle other people's efforts than to put forth any effort oneself.
I've actually never heard of *anybody* implementing Splay Trees. Could
this be because they don't work very well?

I remember when I was in H.S. that there was a certain group of people
who attended every athletic event for the purpose of hooting at the
athletes and belittling their efforts. They considered themselves to
be quite the experts on athletics. They weren't fooling anybody ---
everybody had noticed that they *never* engaged in any athletics
themselves. I see quite a lot of that same mindset here on c.l.f..
I've only been visiting c.l.f. for a couple of months, and yet I have
contributed more Forth code in that time than most of the long-time
c.l.f. members (including Elizabeth Rather) have *ever* contributed.
Why is that?

On that subject btw, here is an updated version of novice.4th:
www.rosycrew.org/novice.4th

Forth programming is one of the few things that I have to take pride
in. What else have I got? I work as a cab driver after all --- that is
nothing to be proud of. Now Elizabeth Rather says that *nothing* I
contribute to c.l.f. will ever be worth a response. She is a
salesperson though. If I am not planning on spending money at Forth
Inc., then why would I want to communicate with Forth Inc.'s marketing
department?

> Other people like me have already done
> that analysis (tree vs. hash) 20 years ago, and the result hasn't changed -
> a proper implemented hash wins hands down.

I don't think that there is one *best* algorithm. As I said before,
the advantage of symtab over hash tables is robustness, and the
advantage of hash tables over symtab is speed --- but only assuming
that they have abundant memory available. The choice depends upon the
circumstances. It is simplistic to blithely announce that symtab
"sucks" and deserves to be "trashed."

If you believe that there is necessarily one *best* algorithm, then
you likely also believe that there is necessarily one *best* language.
If popularity is your only guide, then this would imply that everybody
*should* use C and ignore Forth --- that the analysis of C versus
Forth was concluded 20 years ago and C won "hands down."

Jonah Thomas

unread,
Nov 30, 2009, 10:47:07 PM11/30/09
to
Hugh Aguilar <hugoa...@rosycrew.com> wrote:
> Bernd Paysan <bernd.pay...@gmx.de> wrote:
> > Hugh Aguilar wrote:
> > > UR/Forth did the same thing in its single hash tree; it prepended
> > > the voc# on the front of each word's name. It is a lot more
> > > efficient to put each word-list in its own data structure however.
> >
> > I've done that, too, and the conclusion was that the unified hash
> > table was faster and used up less memory.  Please use a quantitative
> > approach: Implement the options, and *measure*.
>
> I don't have my own Forth compiler, so implementing the two algorithms
> and measuring their performance in the compilation of a test suite is
> not realistic for me. I very likely will end up writing a Forth
> compiler, and I will do that then.

GForth is free. If you use Windows or maybe Linux, SwiftForth and
VXForth both have free evaluation versions that you can measure
performance on. The things the evaluation versions lack will almost
certainly turn out to be things you don't need right now.

John Passaniti

unread,
Dec 1, 2009, 3:12:02 AM12/1/09
to
On Nov 29, 8:58 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> I ignored your lengthy post because of your vulgar mindset.

Really? Is *that* the excuse you're going to use to back out of a
technical defense of symtab?! Really? You're honestly going to
pretend that your delicate sensibilities were so offended by two
common expressions in use by every English-speaking country for at
least the past 45 years I've been alive?! Really? This is a joke,
right?

> Your use of the terms "sucks" and "ass" indicate that you are member
> of a community of people that don't belong on comp.lang.forth. You
> should already be well aware of my opinion of this community of
> people, and you should not have used your community's terminology in a
> post directed at a decent person such as myself. I suggest that you go
> back to your own community, where such language is appreciated, and
> stop posting vulgar messages on comp.lang.forth.

Ohmygod! It ISN'T a joke! You really are going to try to suggest
that when I wrote that your "symtab sucks" and that you're pulling a
definition of optimality "out of your ass" that both were really
vulgar, secret-coded terminology of the "community" I'm in and not
common English idioms used by... everyone! Wow! Leave it to you and
your stunning perception and insight to be able to see what nobody
else does... including myself.

Your response to Elizabeth was priceless:

> I didn't say anything homophobic. The "community" that I was
> referring to was those people who use vulgarity in polite company.

Awww, isn't that cute? You think you're being clever.

John Passaniti

unread,
Dec 1, 2009, 3:42:21 AM12/1/09
to
On Nov 30, 9:29 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> I don't have my own Forth compiler, so implementing the two algorithms
> and measuring their performance in the compilation of a test suite is
> not realistic for me. I very likely will end up writing a Forth
> compiler, and I will do that then.

Funny, you didn't implement Factor, you just wrote an algorithm that
runs under it. So why exactly do you need to implement a Forth before
you write an algorithm that runs under it? You've written that you're
an experienced Forth programmer. Does that experience not include the
wide variety of free Forth systems available?

Excuses. You need nothing more than a representative data set and a
small driver program to feed that data into your algorithm. You need
only to drizzle some minor instrumentation in the code to do
measurements. You're making excuses for doing the basic work any
serious programmer would do.

> I've said from the beginning that hash tables are a good choice in a
> large-memory system.

They're fine in small-memory systems too.

> I really see the advantage of symtab to be robustness. I don't think
> that symtab compares very well with hash tables in regard to speed if
> you ignore the consideration of the time needed to rebuild the hash
> table when it fills up, or the time needed to resolve collisions in
> the event that you don't rebuild the hash table but just allow it to
> bog down. The advantage of symtab is that it expands and (with the
> help of PRUNE) contracts dynamically. You don't have to wonder if it
> is getting too "full" and needs to be rebuilt, because the concept of
> fullness doesn't exist. Symtab is maintenance free --- other than
> calling PRUNE occasionally, symtab is "fire and forget." As I said
> before, OPTIMIZE is entirely optional; if you set DAMPER to a low
> value you can ignore OPTIMIZE entirely, for a pretty-much maintenance-
> free system.

Define "robustness." It's probably about as weak as your definition
of "optimal."

Since it appears you don't own a serious book on data structures and
you're too lazy to go to your local library and check one out, then I
strongly suggest you at the very least go to Wikipedia and read the
section on hash tables. There, you'll learn that there is no single
"hash table." There are a wide variety of flavors and implementation
strategies that more than address your unfounded concerns about
collisions and resizing. And when you're done, invest the time to
learn what big-O notation is all about.

John Passaniti

unread,
Dec 1, 2009, 4:35:24 AM12/1/09
to
On Nov 30, 10:32 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> I actually did read Mr. Passaniti's lengthy post. He was ranting.
> Telling me that my program "sucks" isn't an analysis --- I'm not going
> to dignify a post like that with a response. Also, he repeatedly
> denounced me as being "lazy." That really irks me.

What other word other than "lazy" applies? Your claims of "optimal"
and "robust" are at best casual definitions, not based on any
demonstrated proof or measurement. You have presented no timings, no
analysis of the number of probes *actually* taken to find symbols, no
statistics regarding memory use over time. You have provided
*nothing* more than your word that symtab is way-cool-awesome.

> Writing symtab
> wasn't the work of one day. I put a lot of effort into inventing that
> algorithm and getting it to work, and then I get called lazy by
> somebody who AFAIK has never contributed *any* Forth code to c.l.f..

Note what you wrote-- you spent a lot of effort into inventing the
algorithm and getting it to work. Missing is the work on proving it
actually provides a benefit. That requires measurement, which you
seem terrified to spend any time on, probably because once you do,
you're going to see it not only takes up more memory than even a naive
hash table or balanced binary tree, but has worse performance--
possibly dramatically so. Your fragile ego apparently can't deal with
that.

And sure, while I freely admit to being abrasive towards you in my
message, the statistics and visualizations I provided you aren't
rants. I've shown why your symtab isn't worth the effort. You just
refuse to accept criticism.

> If he thinks that Splay Trees are so wonderful, why hasn't he written
> an implementation and contributed it to c.l.f.? Maybe it is easier to
> belittle other people's efforts than to put forth any effort oneself.
> I've actually never heard of *anybody* implementing Splay Trees. Could
> this be because they don't work very well?

I don't think splay trees are wonderful. I would normally use a hash
for a symbol table. But if someone is fixated on trees as you
apparently are, a splay tree is a better choice than your symtab.

If you had actually read my message, you would have seen at the very
bottom that Factor has splay trees already implemented for you. So
you have absolutely no reason to not measure their performance against
symtab.

> Forth programming is one of the few things that I have to take pride
> in. What else have I got? I work as a cab driver after all --- that is
> nothing to be proud of.

Excuses.

In the building I work in is an art studio that has classes. Every
day, non-professional artists come to learn about technique, tools,
technology, history, and perspectives. Few who go there actually
think they're going to change their career and become professional
artists. Instead they spend time in the classes to become better at
their hobby, and because taking pride in their work means not just
sitting in their home alone, putting paint to canvas, and thinking
they have a masterpiece. Taking pride in their work means improving
themselves and dropping self-important arrogance over their own
talent. And it means measuring your talent. It can be a brutal blow
when an artist with more experience points to a flawed technique or a
naive sense of art history that leads to trite or redundant work.

The parallel to hobbyist programmers should be apparent. I'm employed
as a software engineer working primarily in digital audio and
networking systems. I've been doing professionally what you've doing
as a hobby for the past twenty something years. There are others in
this newsgroup who have more experience than me as professionals and/
or academics. It is arrogant for you to assume that because you
locked yourself in a room and "invented" a data structure that the
rest of us who actually write code for a living are missing the
stunning genius that lurks in your code.

> Now Elizabeth Rather says that *nothing* I
> contribute to c.l.f. will ever be worth a response. She is a
> salesperson though. If I am not planning on spending money at Forth
> Inc., then why would I want to communicate with Forth Inc.'s marketing
> department?

Elizabeth is often cited as being the second Forth programmer after
Charles Moore. She's got more real-world experience with Forth
implementation and applications than most everyone in this newsgroup
and certainly you. To dismiss her as a salesperson is offensive.

> I don't think that there is one *best* algorithm. As I said before,
> the advantage of symtab over hash tables is robustness, and the
> advantage of hash tables over symtab is speed --- but only assuming
> that they have abundant memory available. The choice depends upon the
> circumstances. It is simplistic to blithely announce that symtab
> "sucks" and deserves to be "trashed."

I didn't write that symtab deserves to be trashed. It may have some
useful property for some application-- just not as a symbol table.

What I did write (and you apparently ignored) is that memory use of
symtab grows faster than memory use of hash tables. If you don't see
why, then your limited understanding of real-world hash tables is the
problem. May I suggest you start here: http://en.wikipedia.org/wiki/Hash_table

Bernd Paysan

unread,
Dec 1, 2009, 4:59:13 AM12/1/09
to
Hugh Aguilar wrote:
> I don't have my own Forth compiler, so implementing the two algorithms
> and measuring their performance in the compilation of a test suite is
> not realistic for me. I very likely will end up writing a Forth
> compiler, and I will do that then.

That's a very lame excuse. Just take a free Forth like Gforth, where you
can "own" the compiler just like anybody else, replace the already existing
hash table, and do the checks. Gforth has a nearly-OO vocabulary style, so
you can even have symtab and hash tables side by side in the same system, if
you need that for comparison.

You also miss some other points entirely: A tree has an overhead of 3 cells
per node (pointer to left/right/leaf). A chained hash has an overhead of 2
cells per node (leaf+chain). So if the hash is 2/3 filled, it takes as much
space as the tree. A hash that's growing by a factor 2 per growth iteration
is filled about 2/3 on average. Result: Size overhead is about the same.

Doug Hoffman

unread,
Dec 1, 2009, 5:03:26 AM12/1/09
to
On Nov 30, 10:32 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:

...

> I've only been visiting c.l.f. for a couple of months, and yet I have
> contributed more Forth code in that time than most of the long-time
> c.l.f. members (including Elizabeth Rather) have *ever* contributed.

With close ties to Forth Inc. she may be constrained in posting code.
Regardless, the value of Elizabeth's contributions to this newsgroup
is immense. Often she will leave a comment to someone's question that
helps guide them to the right answer, rather than simply giving the
answer itself. Much like what a good instructor would/should do. It
is a better learning experience for the person with the question.

-Doug

Anton Ertl

unread,
Dec 2, 2009, 11:05:29 AM12/2/09
to
Bernd Paysan <bernd....@gmx.de> writes:
>You also miss some other points entirely: A tree has an overhead of 3 cells
>per node (pointer to left/right/leaf). A chained hash has an overhead of 2
>cells per node (leaf+chain). So if the hash is 2/3 filled, it takes as much
>space as the tree. A hash that's growing by a factor 2 per growth iteration
>is filled about 2/3 on average. Result: Size overhead is about the same.

A little more analysis on this:

I'll ignore the "leaf" pointer; if it is used, it's the same for both
structures; moreover, the tree or list pointers can be in-line with
the other data, then no leaf pointer is needed. Also, I think it
should not be called "leaf pointer", because that has another meaning
when dealing with trees. "Data pointer" might be more appropriate.

So, for a binary tree (any kind, balanced or not) we need two pointers
per node; in a tree with N nodes, there is space for 2N+1 pointers (2N
pointers inside the N nodes plus one pointer to the root node); N
pointers point to nodes, and N+1 pointers contain NULL.

For a hash table with external chaining with N nodes and M buckets,
there are N pointers to nodes and M pointers that contain NULL. So if
we want to use at most as much space as a tree, we have to ensure that
M<=N+1. If we double the table size on resizing (resulting in O(N)
total resizing overhead), we must not resize before 2M<=N+1. Gforth
resizes when 2M=N, so it never[1] uses more space than a binary tree,
and it uses up to N/2 pointer slots less space. IIRC that was the
reason for setting the doubling limit as it is.

As a result, the average chain length (load factor L=N/M) in Gforth is
between 1 and 2, and the average number of probes on a successful
search is a little worse than (L+1)/2 given a good hash function.
Compared to that, the average number of probes on a successful search
in a perfectly balanced binary tree with 3 elements is already 5/3;
more than three elements, and the hash table will not only win on
space consumption, but also on the average number of probes.

And the hash table allows to adjust the resizing boundary and policy
to the size and speed needs of the application. Also, a hash table
with external chaining can be more easily used to reflect Forth's
shadowing than a binary tree.

Looking at a hash table with open addressing, where we put pointers to
the actual data in the hash table, we only need M pointers, but M has
to satisfy M<=N. The number of NULL pointers is M-N. If we never
want to use more space than a binary tree, we need M-N<=N+1, i.e.,
M<=2N+1. With doubling for resizing, that would mean that just before
doubling 2M<=2N+1, which is so close to having a full hash table that
the performance becomes really bad. So, with open addressing, one
will typically use more space than a binary tree at least some of the
time (of course, one could use some other resizing strategy than
doubling, but that also has its costs).

E.g., Andrew Haley suggested doubling when L=1/2; then this kind of
hash table uses as much space as a binary tree in the best case.
Another option would be doubling when L=2/3. Then L varies between
1/3 and 2/3, with an average of L=1/2 (assuming a uniform
distribution), and this kind of hash table will use as much space as a
binary tree on average.

A binary search tree may be a good data structure if sorted processing
is needed frequently, but if that is not needed, it loses.

Hugh Aguilar

unread,
Dec 2, 2009, 5:21:46 PM12/2/09
to
On Dec 1, 2:59 am, Bernd Paysan <bernd.pay...@gmx.de> wrote:
> You also miss some other points entirely: A tree has an overhead of 3 cells
> per node (pointer to left/right/leaf).  A chained hash has an overhead of 2
> cells per node (leaf+chain).  So if the hash is 2/3 filled, it takes as much
> space as the tree.  A hash that's growing by a factor 2 per growth iteration
> is filled about 2/3 on average.  Result: Size overhead is about the same.

You calculation doesn't make sense to me. For one thing, I don't know
what a "leaf" is. If this refers to the pointer to the key string,
then it can be ignored. The string can be placed in-line with the rest
of the node's data. Also, whether you use a pointer or not, has
nothing to do with the data structure, hash table or symtab, that you
are working with.

My own calculation is that a hash table uses the same amount of memory
as a symtab when the hash table is 1/2 full. This calculation helps
the argument for hash tables because hash tables are a lot more
efficient (fewer collisions) when 1/2 full rather than 2/3 full. Your
calculation of 2/3 not only didn't make sense, but it hurt your
argument anyway. According to the wikipedia article
(http://en.wikipedia.org/wiki/Hash_table), the load factor should not
exceed .7 because the cost of the collisions becomes excessive. A load
factor of .67 (2/3) is pretty close to .7. I've never implemented a
hash table, so I don't have any personal experience with this. AFAIK
however, almost everybody resizes the hash table when the load factor
passes the 1/2 threshold.

My calculation works like this: The memory consumed per node in a hash
table is equal to N+N/L, with N being the number of nodes and L being
the load factor. I am assuming the use of a linked list for collision
resolution. Because of this, each node contains a forward link pointer
to the next node in the linked list, or NIL if there are no more nodes
in the linked list. This gives us N words of memory (one word per
node). The load factor L is the fraction of the table that contains
pointers to linked lists, and 1-L is the fraction of the table that
contains NIL pointers. N/L is the size of the table given N nodes
(assuming no collisions at all). I add N to N/L to obtain the total
amount of memory used per node. If L is 1/2, then N+N/L is 3. My
symtab also takes 3 words per node (left, rite and count), so a load
factor of 1/2 is where hash tables equal symtabs in memory overhead.
This, coincidentally, is also the load factor where most people resize
their hash tables. When the load factor is < 1/2, the hash table uses
more memory per node than a symtab does. When the load factor reaches
the 1/2 threshold, the hash table is resized. The result is that the
hash table *always* uses more memory per node than a symtab does. If
you resize the hash table by doubling its memory usage, the hash
table's load factor becomes 1/4. Now the hash table uses 5 words
(1+1/.25) per node. This is 167% of symtab's 3 words per node, which
is pretty bad.

My calculation above assumes that there are no collisions at all. This
isn't realistic. Even in a sparsely populated hash table you are
likely to have a few collisions. When the number of nodes is around
1/2 the hash table size however, there aren't too many collisions and
it is reasonable to assume that there aren't any collisions. As the
number of nodes increases to 2/3 or more, there are a lot more
collisions and my equation becomes increasingly inaccurate. This is
nothing to worry about because, as I said, it is typical to resize the
hash table at 1/2 anyway.

Message has been deleted

John Passaniti

unread,
Dec 3, 2009, 2:34:04 AM12/3/09
to
On Dec 2, 7:10 pm, Hugh Aguilar <hugoagui...@rosycrew.com> wrote:
> For me, seeing the beauty is what life is all about ---
> and I think there is beauty in symtab, no matter what other
> people may say about it. That is my analysis!

This is sad in a way. There may indeed be some useful properties of
symtab-- applications where it makes sense. Symbol tables aren't one
of them. If you would take the time to *measure*, you would see that,
but you refuse. And it's great if you as a hobbyist got some
enjoyment out of writing symtab. But when you present symtab as a
tool that others might want to use, you should be open to criticism.
Most people learn from criticism, and improve themselves. I certainly
do. You should try it.

> I think that John Passaniti should work with his video-and-audio
> files in private, and he should stop posting messages on
> comp.lang.forth. He represents corruption and meanness --- he is
> apposite to everything that Forth represents. People need to stop
> responding to his posts. I have.

You poor thing. Do you need a hug?

Hugh Aguilar

unread,
Dec 6, 2009, 5:19:16 PM12/6/09
to
On Dec 2, 9:05 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> For a hash table with external chaining with N nodes and M buckets,
> there are N pointers to nodes and M pointers that contain NULL. So if
> we want to use at most as much space as a tree, we have to ensure that
> M<=N+1. If we double the table size on resizing (resulting in O(N)
> total resizing overhead), we must not resize before 2M<=N+1. Gforth
> resizes when 2M=N, so it never[1] uses more space than a binary tree,
> and it uses up to N/2 pointer slots less space. IIRC that was the
> reason for setting the doubling limit as it is.

You seem to be saying that hash tables are resized when the load
factor passes the 1/2 threshold, which corresponds to what I said in
my previous post.
I have repeatedly said that the advantage of symtab is robustness.
The
problem with hash tables is that they have to "stop the world" and
resize themselves when their load factor passes the 1/2 threshold. A
symtab never has to rebuild itself like this --- there is no
threshold
that we have to worry about --- a symtab grows smoothly and *always*
has 3 words per node of overhead. If hash tables go beyond the
threshold without resizing, their efficiency degrades rapidly and
they
run too slowly. After they are doubled up in memory they run fast
again, but now they use too much memory. Hash tables are *always*
inefficient either in speed or memory.


> And the hash table allows to adjust the resizing boundary and policy
> to the size and speed needs of the application. Also, a hash table
> with external chaining can be more easily used to reflect Forth's
> shadowing than a binary tree.

With symtab the programmer doesn't have to figure out a "policy" in
regard to what load factor threshold he resizes on. That is the point
that I'm trying to make. Also, what you said about shadowing doesn't
make sense to me either. I don't see any problem with doing this in
binary trees.


> A binary search tree may be a good data structure if sorted processing
> is needed frequently, but if that is not needed, it loses.

My symtab is a good choice in applications where there is a large
variance in the frequency that the nodes are being accessed. Imagine
the case of an associative array in which the nodes are pretty much
equally likely to be accessed. If symtab were used, all of the nodes'
count fields would be pretty close to the same value. Symtab would be
a grossly inappropriate data structure. One node would get promoted
above another node because it has a higher count value, but then that
other node would soon get promoted back into the higher position
again. There would be a lot of churning of the nodes like this, which
would kill the speed, and there would be no efficiency advantage
gained because all of the nodes are pretty much equally likely to be
accessed --- none of the nodes deserve to be at the root any more
than
any of the other nodes do.
My symtab is intended to be used for symbol tables; that is why I
call
it "symtab." This is an application that has a lot of variance in the
frequency that the nodes are being accessed. It is abundantly obvious
that some Forth words (DUP, +, etc.) are used a lot more than other
Forth words (IF, THEN, etc.), that are in turn used a lot more than
other Forth words (BEGIN, ELSE, etc.), and so forth. There is a huge
amount of variance in how frequently the words are used, and it is a
pretty smooth continuum from the popular to the unpopular. This is
not
just true of the core word set, but also of user-written libraries.
Consider my novice.4th package:
http://www.rosycrew.com/novice.4th BIT-FIELD is not likely to be
used
very much. 2ARRAY is going to be used less than 1ARRAY --- 1ARRAY is
going to be used less than FIELD --- FIELD is going to be used less
than HSTR --- and so forth. There is a lot of variance in how often
words are used. Also, the popularity of word usage depends upon the
application (for example: games might use RND a lot, but other
applications will never use RND). Because of this application
specifity, the symbol table has to be able to dynamically adjust
itself to the user's application. The user will likely load the
libraries that he needs for his application (novice.4th for almost
everybody) and will then recompile his application repeatedly (with
TRY) as he develops his code. As he is repeatedly recompiling his
application, symtab is quietly restructuring itself to be
increasingly
efficient at compiling this particular application. It might sound
somewhat grandiose to describe this as "artificial intelligence," but
that is the effect. Symtab adapts itself to its human.
I'm pretty proud of symtab --- I think that it is way cool. As a
practical matter though, symbol table efficiency isn't an important
issue given modern computers with their gigantic memory and
incredible
speed. I showed symtab to Slava with the idea of him using it in
Factor (symtab was actually my first-ever Factor program), but he
wasn't interested. He said that the multiple code-optimization passes
were what was making Factor compilation slow, and that the symbol
lookup took such a tiny amount of time as to be inconsequential. By
far the easiest solution to writing a symbol table is to just use a
hash table and give it a gigantic size so that the load factor will
*never* get anywhere near 1/2. My own computer has 766 megabytes, and
it is just a laptop. Symtab is really a throwback to the bad-old days
of MS-DOS when we were fitting data structures into 64K segments. For
the most part, I just wrote symtab for fun --- it might have been
important if I had introduced it 20 years ago, but nobody is going to
care today.
Symtab is borderline A.I., which I think most Forth programmers
consider to be cool. Also, symtab was written for fun, which I think
most Forth programmers can also appreciate. Nobody considers Forth to
be a career move --- Forth is mostly popular because it is fun, fun,
fun. This is why I considered John Passaniti's grossly vulgar and
obnoxious rant to be inappropriate. Where I come from (Colorado),
this
is not done. [note: I deleted this part where I discussed working on a
cattle ranch. I don't want to mention that lady's name because this
makes here a target for vulgar attacks by c.l.f. members, which would
not be fair to her.]... when I worked as a Forth programmer I
only made $10 per hour, and I got laid-off after I completed my
program. I got to write my own cross-compiler though, which makes it
worthwhile. Most C programmers will *never* write a compiler. For
that
matter, most programmers of any language will *never* invent
something
like symtab. For me, seeing the beauty is what life is all about ---


and I think there is beauty in symtab, no matter what other people
may
say about it. That is my analysis!

Anton Ertl

unread,
Dec 7, 2009, 12:59:54 PM12/7/09
to
Hugh Aguilar <hugoa...@rosycrew.com> writes:
>On Dec 2, 9:05 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)
>wrote:
>> For a hash table with external chaining with N nodes and M buckets,
>> there are N pointers to nodes and M pointers that contain NULL. So if
>> we want to use at most as much space as a tree, we have to ensure that
>> M<=N+1. If we double the table size on resizing (resulting in O(N)
>> total resizing overhead), we must not resize before 2M<=N+1. Gforth
>> resizes when 2M=N, so it never[1] uses more space than a binary tree,
>> and it uses up to N/2 pointer slots less space. IIRC that was the
>> reason for setting the doubling limit as it is.
>
>You seem to be saying that hash tables are resized when the load
>factor passes the 1/2 threshold

The load factor is L=N/M (that's also mentioned in the posting where
my text above comes from). A little elementary mathematics will let
you transform the threshold from 2M=N into N/M=2, i.e., L=2. So:
Gforth resizes the hash table when the load factor reaches 2.

>The
>problem with hash tables is that they have to "stop the world" and
>resize themselves

No, if you want to avoid stopping the world, you can always keep a
hash table A with size M and a hash table B with size 2M around, and
search them both. Whenever you insert an element, you insert it into
B, and also delete an element from A and insert it into B. When the
load factor in B reaches the threshold, A will be empty and can be
freed, B becomes A, and you can create a hash table with size 4M that
becomes the new B. In this way the cost of resizing is distributed
smoothly across all the insertions. Of course, the disadvantage is
that the lookup becomes more expensive.

The problem with hash tables in hard real-time is not the resizing,
but the possibility of worst-case collisions in the hash function.
Even if your hash function distributes as good as it gets in the
typical case (how do you ensure that), are you willing to rely on the
improbability of bad things happening rather than a mathematical proof
that they won't happen?

John Passaniti

unread,
Dec 7, 2009, 6:02:49 PM12/7/09
to
On Dec 7, 12:59 pm, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> No, if you want to avoid stopping the world, you can always keep a
> hash table A with size M and a hash table B with size 2M around, and
> search them both.  Whenever you insert an element, you insert it into
> B, and also delete an element from A and insert it into B.  When the
> load factor in B reaches the threshold, A will be empty and can be
> freed, B becomes A, and you can create a hash table with size 4M that
> becomes the new B.  In this way the cost of resizing is distributed
> smoothly across all the insertions.  Of course, the disadvantage is
> that the lookup becomes more expensive.

Don't contradict Hugh. He's an expert who knows everything about hash
tables, despite making a stream of statements that are demonstrably
false or at the very least not based on experience. So take it back--
just because you described one real-world technique that amortizes the
cost of resizes without stopping the world doesn't mean it's true
because Hugh never heard of it. And just because there are other
techniques to allow incrementally resizing hash tables (such as linear
hashing and consistent hashing) without stopping to rehash doesn't
mean that's true either. Hugh could spend a few minutes researching
hash tables on Wikipedia and other resources and learn about what he
doesn't know, but ignorance is bliss. And Hugh seems like a very
happy fellow.

Amortizing rehashing over time does indeed make the lookup more
expensive. Lookups go from O(1-ish) to O(2-ish), which is still
vastly superior to the performance of his symtab. But Hugh doesn't
understand big-O notation, so let's ignore that junk and just tell
Hugh that all hash tables always degrade to worse-case behavior,
turning into an expensive linear search. That will make him feel good
about symtab, and that's all that matters.

> The problem with hash tables in hard real-time is not the resizing,
> but the possibility of worst-case collisions in the hash function.
> Even if your hash function distributes as good as it gets in the
> typical case (how do you ensure that), are you willing to rely on the
> improbability of bad things happening rather than a mathematical proof
> that they won't happen?

I'm sure Hugh will agree, but will miss that his symtab is also
completely unsuited for hard real-time applications. He won't believe
that, because his analysis says that symtab is "quite fast." And he
doesn't need timing measurements to know what he *feels*.

Regardless, there is nothing that prevents using hash tables in hard
real-time systems. Hard real-time in this context just means that you
have predictable maximum lookup times. And for most common hash table
implementations, that's easy. Ignoring collision resolution by open
addressing for the moment, using a simple linked list means that the
worst case lookups will take as long as needed to fly down the list.
I've used a hash table in a hard real-time system and had zero fear--
that's because I knew my hash function would distribute the keys well
(because the keys were known and structured, and I modeled entire key
space and measured the distribution). But even if I didn't know my
key space and I didn't know how well my hash function distributed
them, I still had no fear because I knew the maximum number of
elements possible in the hash table. And after verifying that the
worst case was still within the constraints of the timing requirements
of the system, I went ahead and used it and benefited greatly from the
more typical O(1) behavior instead of the worst case O(N) behavior.

Hugh Aguilar

unread,
Dec 7, 2009, 6:16:33 PM12/7/09
to
On Dec 7, 10:59 am, an...@mips.complang.tuwien.ac.at (Anton Ertl)
wrote:

> Hugh Aguilar <hugoagui...@rosycrew.com> writes:
> >You seem to be saying that hash tables are resized when the load
> >factor passes the 1/2 threshold
>
> The load factor is L=N/M (that's also mentioned in the posting where
> my text above comes from).  A little elementary mathematics will let
> you transform the threshold from 2M=N into N/M=2, i.e., L=2.  So:
> Gforth resizes the hash table when the load factor reaches 2.

My definition of "load factor" seems to be the inverse of yours. Your
2 is my 1/2. With my definition, the load factor can never be greater
than 1. My definition is the one being used in the Wikipedia article,
and it is also the definition that I have always been familiar with.

> >The
> >problem with hash tables is that they have to "stop the world" and
> >resize themselves
>
> No, if you want to avoid stopping the world, you can always keep a
> hash table A with size M and a hash table B with size 2M around, and
> search them both.  Whenever you insert an element, you insert it into
> B, and also delete an element from A and insert it into B.  When the
> load factor in B reaches the threshold, A will be empty and can be
> freed, B becomes A, and you can create a hash table with size 4M that
> becomes the new B.  In this way the cost of resizing is distributed
> smoothly across all the insertions.  Of course, the disadvantage is
> that the lookup becomes more expensive.

That is a good method for avoiding the "stop the world" scenario. I
hadn't been aware of this. The downside is that you now have to do two
insertions and one deletion for every word that gets inserted. Also,
you have to do one or two (average 1.5) lookups for every word that
gets looked up. It seems unlikely that this is going to be faster than
symtab. Also, you have two hash tables in use at the same time, which
is taking up a lot of memory. Previously your hash-table had been
equal to symtab in memory usage when your load factor reached 1/2 and
worse than symtab when your load factor was still less than 1/2. Now
you are *always* worse than symtab. I think that symtab is better than
your solution in regard to both speed and memory. I think that hash
tables are only a good choice if you just throw a lot of memory at
them from the get-go, so that your load factor never approaches the
1/2 threshold.

> The problem with hash tables in hard real-time is not the resizing,
> but the possibility of worst-case collisions in the hash function.
> Even if your hash function distributes as good as it gets in the
> typical case (how do you ensure that), are you willing to rely on the
> improbability of bad things happening rather than a mathematical proof
> that they won't happen?

My symtab relies on a couple of assumptions:
1.) I assume that there is a lot of variance in the frequency that the
words are accessed, and that this variance is fairly smooth all the
way through the word set.
2.) I assume that there is no relation between the frequency that the
words are accessed and their alphabetical ordering.

There is no mathematical proof that guarantees these assumptions. With
associative arrays these assumptions might not hold up, as we have no
idea what kind of data the programmer might be collecting. With symbol
tables these assumptions should hold up pretty well.

My #2 assumption is similar to your problem with worst-case
collisions. If all of the words follow some pattern, then they all
might hash out to only a very few hash-table index values. For
example, I wrote a 65C02 cross-compiler a long time ago, and it did
single-step debugging. I stored all of the breakpoints in the
dictionary (I had a breakpoint between each Forth word compiled). I
had hundreds (sometimes thousands) of words in the dictionary all with
the name: "Lxxxx" (with xxxx being a hexadecimal address on the target
machine). This was under UR/Forth that uses a hash table. It was
somewhat slow because a lot of those similar names collided with each
other. Symtab wouldn't have worked very well either though, as there
would have been a relation between alphabetic ordering and frequency,
which would have resulted in a very lopsided tree.

Bernd Paysan

unread,
Dec 8, 2009, 4:35:05 AM12/8/09
to
Hugh Aguilar wrote:
>> The load factor is L=N/M (that's also mentioned in the posting where
>> my text above comes from). A little elementary mathematics will let
>> you transform the threshold from 2M=N into N/M=2, i.e., L=2. So:
>> Gforth resizes the hash table when the load factor reaches 2.
>
> My definition of "load factor" seems to be the inverse of yours. Your
> 2 is my 1/2. With my definition, the load factor can never be greater
> than 1. My definition is the one being used in the Wikipedia article,
> and it is also the definition that I have always been familiar with.

Your definition and Anton's are the same. You just assume that all hash
tables are open addressed. However, if you managed to look up the English
Wikipedia article about "Load factor", you should just read the next
section: Separate chaining. That's what we do, and there is no limit to the
load factor in this scenario. We limit it to 2, i.e. our size/speed
tradeoff goes into size.

We still could improve the size/speed ratio of our hash table if we accepted
a bit more complex code. E.g. since we align all entries, we could use the
LSB to indicate if the slot just has one element or if it needs chaining.
We could store all single-entry buckets straight into the table, and only
chain the ones with collisions (e.g. into an array, which is both more
cache-friendly and memory-friendly than a list).

The article in Wikipedia about hashing contains all points discussed here
and more, so *please* *read* it.

hughag...@gmail.com

unread,
Dec 11, 2018, 1:05:14 AM12/11/18
to
Here are two quotes one decade later supporting John Passaniti:

On Sunday, December 9, 2018 at 10:58:35 AM UTC-7, Gerry Jackson wrote:
> On 09/12/2018 14:42, foxaudio...@gmail.com wrote:
> > http://www.forth.org/novice.html
>
> Yes but it's an old version (dated 2010) that was criticised for various
> reasons e.g. John Passaniti proved to everyone but Hugh that Hugh's
> symtab wasn't as good as hashing which led to a torrent of abuse
> (homophobic amongst other things) and which drove John P away from
> c.l.f. - a real loss to the group. Since then for several years Hugh has
> claimed that novice.fth has been extended e.g. rquotations, SWITCH,
> SYNONYM saying a new version will be released but it hasn't happened.
> I've just remembered he was going to include a package of mine at one
> time - that would be ironic given the vitriol he routinely directs at me
> and which will undoubtedly follow this post.

On Monday, December 10, 2018 at 8:24:09 AM UTC-7, a...@littlepinkcloud.invalid wrote:
> Alex McDonald <al...@rivadpm.com> wrote:
>
> > Frankly, this looks like you approve of Aguilar driving [John
> > Passaniti] off. Is it what you mean? I hope not.
> >
> > And yes, my mileage varies. I like educated and accurate, and his
> > pointed with a dash of acid wit posts made entertaining reading. In
> > my not so humble opinion (and yes, John also lacked humility) he
> > added a great deal to the conversations here. I miss him.
>
> I agree. I think that you can get away with a lot of attitude if
> you're right.
>
> Andrew.

John Passaniti, of course, didn't know what he was talking about.
The gay Perl programmer says:
"When some threshold is reached, a symtab decides to stop the world and restructure the tree.
This requires iterating over the *entire* tree, juggling nodes around."

That isn't true. A binary tree doesn't have any threshold at which time it has to restructure the tree.
This is a problem with hash-tables. You have to determine ahead of time how big the hash-table will be.
When you reach a threshold (typically half full) you have to "stop the world" and rebuild the hash-table at a larger size.

Note that my symtab was a work in progress at the time that John Passaniti attacked it.
I said above:
"...the programmer writes new words and these start out at a leaf and work their way up to an appropriate place in the tree."

I changed this later on. New words are now inserted at the root and then work their way down to an appropriate place in the tree.
This is better because it assumes that shortly after a word is defined, it will be used, but over time it will be used less (possibly not at all).
This change made symtab similar to a Splay Tree.

All in all, it seems bizarre to me that, in the year 2018, Gerry Jackson, Alex McDonald and Andrew Haley want to support John Passaniti
in opposition to me. The faggot troll left a long time ago! If he no longer cares, then why do they?

I think John Passaniti was getting paid by Elizabeth Rather to attack Jeff Fox, myself, and anybody else who criticized ANS-Forth.
After Jeff Fox died, he continued to attack me for a short while, then he disappeared.
Most likely, Elizabeth Rather was primarily paying him to attack Jeff Fox, but didn't think that I was important enough to pay him just to attack me.
She stopped paying him, so he left --- he wasn't going to work for free --- faggot trolls want to be paid for their work, after all.

It is also possible that John Passaniti only really enjoyed attacking Jeff Fox, so after Jeff Fox died he got bored at left.
I was mostly ignoring his attacks, so I wasn't as much fun for him as Jeff Fox had been, as Jeff Fox always felt obliged to correct his lies.
Jeff Fox was a much bigger target than I was, because Jeff Fox had written a lot more Forth code than I had written.

hughag...@gmail.com

unread,
Dec 11, 2018, 11:24:50 PM12/11/18
to
Here is yet another John Passaniti fan:

On Tuesday, December 11, 2018 at 6:27:21 PM UTC-7, Brad Eckert wrote:
> John Passaniti was a good poster who knew his stuff. Hugh's abuse made him go away. I don't remember anything salacious.

Note that John Passaniti never posted any Forth code. When he posted code, it was in Perl, which is most likely the only language he knew.
So, when Brad Eckert says that he "knew his stuff," his "stuff" didn't include Forth programming.

Who else will pitch in and support John Passaniti and his "acid wit"?
Elizabeth Rather? Stephen Pelc? Bernd Paysan? All of you provided him with 100% support when he was here.
Don't back down now! Continue to support him 100% even now that he is gone!

I want ANS-Forth to be totally discredited, John Passaniti's affiliation with ANS-Forth is a big step in that direction.
Just because John Passaniti is gone from comp.lang.forth doesn't mean that he can't continue to discredit ANS-Forth.
To a large extent, John Passaniti is the poster boy for ANS-Forth and Forth-200x.
Whoever embraces ANS-Forth and Forth-200x must also embrace John Passaniti and his "acid wit."

Kerr-Mudd,John

unread,
Dec 12, 2018, 10:00:46 AM12/12/18
to
On Wed, 12 Dec 2018 04:24:48 GMT, hughag...@gmail.com wrote:

> On Monday, December 10, 2018 at 11:05:14 PM UTC-7, hughag...@gmail.com
> wrote:
>> Here are two quotes one decade later supporting John Passaniti:


Tedious personal attacks omitted.

Why oh why?

--
Bah, and indeed, Humbug.

minf...@arcor.de

unread,
Dec 12, 2018, 10:42:08 AM12/12/18
to
Figuratively: verbal entropy with a whiff of H2S added

c.l.f. is an unmoderated group about Forth, so whoever...
not that it matters, but most followers ignore that mad hatter
0 new messages