As a part of an ongoing interpreter project I had cause to develop a
symbol table in 6502 assembly. Since symbol lookup typically occurs
mostly during tokenisation and not runtime, I chose the following
implementation strategy:
- Symbols are stored sequentially in memory as Pascal String, with the
symbols address immediately following.
- At the oposing end of "symbol memory" a vector keeps track of the
start of each symbol and is "insertion sorted" when a new symbol is
created.
- Symbol lookup works as a binary search via the sorted vector
Being conscious of memory usage and management complexity I avoided
using a hashing algorithm for this project. Of course, as the symbol
table gets big, this approach becomes slower...
I'm interested to know (without having to do a lot of disassembly) how
others have handled this in their projects? Dave, how do you handle
symbols in vm02? Have others disassembled or written assemblers on the
Apple II or other small architectures?
Thanks,
Matt
The SYMASS assembler, published in source form in the Transactor magazine,
used an unsorted list of fixed-size 8-byte records (six bytes name and 16
bits of value). New entries grow the end of the list, searches start at the
beginning and go 'til they find the entry sought or reach the end of the
list.
I believe Merlin does something very similar.
The BASIC interpreter in the C64 stores scalar variables this way as well,
making each entry seven bytes even though only floating point variables use
all seven.
Given that arrays followed scalars in memory, making a block memory move
necessary in order to make room to add a new scalar, it doesn't seem to me
like it would be that much trouble to place each new scalar in alphabetical
sequence - just set the block move to adjust the following scalars as well
as the arrays. Even a linear search would on average double in speed, and
the fixed size would make binary searches feasible. Oh well - in practice I
just made sure the scalars I used most often appeared at or near the start
of the table (by making an explicit assignment to them at the start of
execution).
So this seems fairly common. Are more complicated schemes worthwhile? Well,
assume it takes 100 clock cycles to discover a mismatch and move to the next
candidate in the list. Then 100 mismatches take ten thousand cycles, and
1000 mismatches take a hundred thousand, or about a tenth of a second on a
1Mhz machine. How many entries in your table, and how often are you going to
search it?
If you expect a lot of "locality of reference", in that symbols used
recently are quite likely to be used again soon, you might implement some
sort of "move to front" algorithm for matched symbols, or keep a small cache
of recently used symbols that you look at before you search the main table.
- Anton Treuenfels
> Given that arrays followed scalars in memory, making a block memory move
> necessary in order to make room to add a new scalar, it doesn't seem to me
> like it would be that much trouble to place each new scalar in alphabetical
> sequence - just set the block move to adjust the following scalars as well
> as the arrays. Even a linear search would on average double in speed, and
> the fixed size would make binary searches feasible. Oh well - in practice I
> just made sure the scalars I used most often appeared at or near the start
> of the table (by making an explicit assignment to them at the start of
> execution).
That was the thinking that prompted the design I followed, except I
mitigated the cost of the block move with a vector (which of course
introduces indirection, but smc gets this back)
> So this seems fairly common. Are more complicated schemes worthwhile? Well,
> assume it takes 100 clock cycles to discover a mismatch and move to the next
> candidate in the list. Then 100 mismatches take ten thousand cycles, and
> 1000 mismatches take a hundred thousand, or about a tenth of a second on a
> 1Mhz machine. How many entries in your table, and how often are you going to
> search it?
Well, that's the problem - there's no way to tell what the gain is
without implementing the simpler method as well.....
> If you expect a lot of "locality of reference", in that symbols used
> recently are quite likely to be used again soon, you might implement some
> sort of "move to front" algorithm for matched symbols, or keep a small cache
> of recently used symbols that you look at before you search the main table.
That's a further optimisation I've considered. At least handling
locals in some sort of stack fashion is probably going to be a win.
Matt
> Hi,
>
> As a part of an ongoing interpreter project I had cause to develop a
> symbol table in 6502 assembly. Since symbol lookup typically occurs
> mostly during tokenisation and not runtime, I chose the following
> implementation strategy:
>
> - Symbols are stored sequentially in memory as Pascal String, with the
> symbols address immediately following.
> - At the oposing end of "symbol memory" a vector keeps track of the
> start of each symbol and is "insertion sorted" when a new symbol is
> created.
> - Symbol lookup works as a binary search via the sorted vector
If I understand your approach, it looks particularly economical. You
never have to move a symbol. Lookup is O(log<sub>2</sub>n), compared to
O(n) for a linear search. Storage averages n * (name-length + 5),
amortizing the cost of long names. You can perhaps use a mark and
release scheme to handle local symbols.
> Being conscious of memory usage and management complexity I avoided
> using a hashing algorithm for this project. Of course, as the symbol
> table gets big, this approach becomes slower...
But the growth is quite slow: eight comparisons for 256 symbols and only
10 for 1024 symbols. Inserts will have to move, on average, half the
vector entries; but that's better than having to move the symbols.
> I'm interested to know (without having to do a lot of disassembly) how
> others have handled this in their projects? Dave, how do you handle
> symbols in vm02? Have others disassembled or written assemblers on the
> Apple II or other small architectures?
Sorry, limited direct knowledge. For reference, I see that ProForth
(EDASM) runs 4314 lines and contains 1063 symbols.
I started with Lisa and BigMac/Merlin, both fast, in-memory assemblers.
Later I switched to Apple's EDASM and Kyan's AS, two disk-based
assemblers. A hard drive became essential, but symbol table space was no
longer a limitation. Which way are you leaning?
--
John B. Matthews
trashgod at gmail dot com
home dot woh dot rr dot com slash jbmatthews
That's it - although you forgot the search cost on insert. However,
you're already ahead over a naive implementation after a couple of
lookups...
> > Being conscious of memory usage and management complexity I avoided
> > using a hashing algorithm for this project. Of course, as the symbol
> > table gets big, this approach becomes slower...
>
> But the growth is quite slow: eight comparisons for 256 symbols and only
> 10 for 1024 symbols. Inserts will have to move, on average, half the
> vector entries; but that's better than having to move the symbols.
That's the big win. What's interesting is whether this will beat some
form of MRU cache. I suspect in this scenario it will, but I think
that assemblers tend to have a lot of locality; common symbols tend to
be at the top of the table (EQU's), branches are either local or close
to the bottom. Only subroutine and data table labels would tend to be
costly, but an MRU approach would seem to be the best there.
> > I'm interested to know (without having to do a lot of disassembly) how
> > others have handled this in their projects? Dave, how do you handle
> > symbols in vm02? Have others disassembled or written assemblers on the
> > Apple II or other small architectures?
>
> Sorry, limited direct knowledge. For reference, I see that ProForth
> (EDASM) runs 4314 lines and contains 1063 symbols.
How long does it take to build on your system?
> I started with Lisa and BigMac/Merlin, both fast, in-memory assemblers.
> Later I switched to Apple's EDASM and Kyan's AS, two disk-based
> assemblers. A hard drive became essential, but symbol table space was no
> longer a limitation. Which way are you leaning?
I started with an assembler of my own design, which provided symbol
table support for the mini-assembler via a small BASIC program. After
that, I did most of my assembler as optimisations for Apple Pascal
programs, so I used the p-System portable assembler which I rather
liked, but it was slow.
When I got into ProDOS stuff later on I bought myself a copy of Merlin
Pro since that was the assembler of choice of Sandy Mossberg who wrote
the disassembly lines column for Nibble Magazine. After that (and I
had decent mass storage on my IIe), I used the ProDOS EDASM which
remains my favourite, but I've tended to go back to Merlin since it's
both readily available and something of a defacto standard. I must
admit I find the user interface rather cumbersome these days, and
would rather move back to EDASM, but the lack of public availability
of documentation for it (I can't even find my manual anymore) makes me
uneasy, and would be a hassle for many I think, if I ever released
anything :-)
Matt
Hi Matt-
I used a multi-tiered approach with VM02 based on what Java class files
look like and Java bytecode executes. All symbol names in the Java
class file are strings and collected along with other sorts of runtime
data into something called the Constant Pool. Strings, strings,
everywhere. So, I have a hash table for all the strings in the system.
Along with a reference counted, handle based memory manager, I can
quickly add a string into the table and return a handle to it's memory.
If the same string is already resident in the table, I just return
it's handle and increment the reference count. String comparisons for
equality just become a 16 bit handle comparison.
Later in the runtime when the symbol is first referenced by a method
call or field load/store I have to traverse some object's superclass
hierarchy to resolve the handle to the code or data. I keep a full 32
bits per symbol reference in the Constant Pool so I can have 16 bits for
the string handle and 16 bits for the resolved code/data.
This ends up being a lot of book-keeping, but the late dynamic binding
nature of Java doesn't give you much option. VM02 also doesn't have a
very large global namespace to maintain (especially when you have to fit
everything in 64K!). There is the global class table that doesn't get
very large for a given program so it can be searched linearly - but
again, the results are saved in the symbol reference so it only happens
once per symbol. My approach also chews up memory pretty fast as
everything in the system is a 32 bit value. The upside is it runs
pretty fast once the initial binding has occurred - i.e. no more lookups.
The only part of this that is probably interesting to you is the string
hash table. My table is 256 entries of unordered linked lists. The
hash function is simple enough too - just rotate and xor through all the
characters in the string. I ran quite a few class files through the
hashing function and got a nice smooth distribution.
I don't know what your project is targeting, so I'm not sure any of
these ideas are useful. It always seems to end up being the memory vs
speed trade off.
Dave...
Sorry if I sound like I complain alot, but, here's a few
experienced programmers discussing a topic that could be
interesting to a less-experienced programmer, but, due to
you guys knowing what you mean, no examples are given.
That is the purpose of these groups, to educate those who
are less knowledgeable. That said, could you please post
some samples of what you're discussing, or give us links
to source code so we can see it for ourselves?
Thank you,
Bill Garber from GS-Electronics
http://www.garberstreet.com
That's interesting. How do you handle the case of iteration over a
collection? Is there some way to tell that the iterator is still
pointing at the same class type, hence no lookup, or does it have to
rehash every time a reference changes?
I don't bother trying to fit things into 64k. Well, sort of, but I
tend to work almost completely in the AUX bank and overflow into main.
> The only part of this that is probably interesting to you is the string
> hash table. My table is 256 entries of unordered linked lists. The
> hash function is simple enough too - just rotate and xor through all the
> characters in the string. I ran quite a few class files through the
> hashing function and got a nice smooth distribution.
So that's, per class, there's a 512 byte vector to unordered lists
containing the symbols? Impressive. Your memory manager must be pretty
sophisticated.
> I don't know what your project is targeting, so I'm not sure any of
> these ideas are useful. It always seems to end up being the memory vs
> speed trade off.
Oh, just another Lisp interpreter. Lisp programs tend to have lots of
symbols and strings. I'll have to do some runtime profiling and see
how it goes. Every now and then I get the functional programming bug,
which is probably why I still use Emacs :-) I also have the character
trait that it's more fun to write tools than to use tools...
Matt
> Sorry if I sound like I complain alot,
Its okay, we're used to it ;-)
> but, here's a few
> experienced programmers discussing a topic that could be
> interesting to a less-experienced programmer, but, due to
> you guys knowing what you mean, no examples are given.
> That is the purpose of these groups, to educate those who
> are less knowledgeable. That said, could you please post
> some samples of what you're discussing, or give us links
> to source code so we can see it for ourselves?
I fairly sure if the nomenclature we're using doesn't make sense, then
the source would be worse. That said, if there's specific concepts you
don't understand, I'll try to explain them succinctly. However this is
a discussion forum; it has to be a two way street. I'd suggest running
some of the terminology through Wikipedia, and posting again when you
have a more specific question.
Matt
>> Later in the runtime when the symbol is first referenced by a method
>> call or field load/store I have to traverse some object's superclass
>> hierarchy to resolve the handle to the code or data. I keep a full 32
>> bits per symbol reference in the Constant Pool so I can have 16 bits for
>> the string handle and 16 bits for the resolved code/data.
>>
>> This ends up being a lot of book-keeping, but the late dynamic binding
>> nature of Java doesn't give you much option. VM02 also doesn't have a
>> very large global namespace to maintain (especially when you have to fit
>> everything in 64K!). There is the global class table that doesn't get
>> very large for a given program so it can be searched linearly - but
>> again, the results are saved in the symbol reference so it only happens
>> once per symbol. My approach also chews up memory pretty fast as
>> everything in the system is a 32 bit value. The upside is it runs
>> pretty fast once the initial binding has occurred - i.e. no more lookups.
>
> That's interesting. How do you handle the case of iteration over a
> collection? Is there some way to tell that the iterator is still
> pointing at the same class type, hence no lookup, or does it have to
> rehash every time a reference changes?
>
Pretty much. When I load the class file and convert to an internal
representation is when I add the string to the string table and convert
it to a handle. That is really the only time hashing occurs (string
insertion). I save the string handle in the Constant Pool. Using 32
bits offers lots of options for encoding information in the object
reference itself. My actual object references are broken up into 16
bits of handle, 8 bits class index (type), and 8 bits type specific
value (arrays use this for dimension, strings for hash, etc). The
global class table is small; I started out at 2556 entries but now have
it at 64. It contains the handles to all the object's class data
structures. A single program (that will fit in an Apple II) doesn't use
a huge number of classes. I can always expand this back up if I find it
getting full - especially when I start using the AUX bank for bytecode.
So the jist of this is that the object reference is actually a small
data structure encoding enough data that no further lookups are required.
> I don't bother trying to fit things into 64k. Well, sort of, but I
> tend to work almost completely in the AUX bank and overflow into main.
>
>> The only part of this that is probably interesting to you is the string
>> hash table. My table is 256 entries of unordered linked lists. The
>> hash function is simple enough too - just rotate and xor through all the
>> characters in the string. I ran quite a few class files through the
>> hashing function and got a nice smooth distribution.
>
> So that's, per class, there's a 512 byte vector to unordered lists
> containing the symbols? Impressive. Your memory manager must be pretty
> sophisticated.
>
No, the string pool and class table are global. Symbols are stored in
each class's runtime Constant Pool and resolved by traversing the class
structure hierarchy (which contains the field's or method's name and
data). I wasn't sure when I started out how often I would actually have
to do string comparisons. String are immutable in Java, so once they're
there, they're there (unless their reference count goes to zero and they
can be deleted). The memory manager only gets complicated when garbage
collection or (soon to be) swapping happens. Otherwise the handles
could just be pointers.
>> I don't know what your project is targeting, so I'm not sure any of
>> these ideas are useful. It always seems to end up being the memory vs
>> speed trade off.
>
> Oh, just another Lisp interpreter. Lisp programs tend to have lots of
> symbols and strings. I'll have to do some runtime profiling and see
> how it goes. Every now and then I get the functional programming bug,
> which is probably why I still use Emacs :-) I also have the character
> trait that it's more fun to write tools than to use tools...
>
> Matt
Nice. Yours probably has a larger global symbol space than VM02.
Depending on how dynamic the strings are could help decide how to look
them up. As for tools, I hear you. I'm probably the worst Java
programmer on the planet. I still have no clue what the different
versions are or what generics do.
Dave...
Hi Bill-
I'm not sure I can be too helpful here. As Matt said, you're going to
have to do some legwork. However, there is nothing more educational
than trying an example yourself. Pencil and paper are the preferred
tools. Make a list of words and try to organize them for efficient
searching. What happens if the words change on the fly? What of a lot
more words are added or deleted? That's all we're really talking about.
As for source code, I did put the link to vm02 source awhile ago. If
you can make sense of my algorithms from cryptic 6502 assembly, you're a
stud. I still have to go back and study what the heck I was doing. It
isn't always immediately obvious - and I wrote it!
Dave...
> Nice. Yours probably has a larger global symbol space than VM02.
> Depending on how dynamic the strings are could help decide how to look
> them up. As for tools, I hear you. I'm probably the worst Java
> programmer on the planet. I still have no clue what the different
> versions are or what generics do.
Generics are just compiler sugar to save you the bother of casting
things in and out of collections. So instead of
Stack s = new Stack();
String top = (String) s.pop();
You can write:
Stack<String> s = new Stack<String>();
String top = s.pop();
It provides a slightly better type check and less messy code, but it's
nothing like the Templating mechanism in C++ or the Generics facility
in Ada. The same goes for all the extra pieces in the later versions;
the bytecode remains the same but you get Generics, Autoboxing,
foreach style iterators, and variable argument lists. A variable
argument list is also syntactic sugar, so while you can have:
void someMethod(int arg, int otherarg, ...) {}
In reality it' just:
void someMethod(int arg, int otherarg, Object[] varargs);
The calling convention looks nicer this way, but in reality you're
just doing:
Object args[] = {this, that, and, the, other};
someMethod(x,y, args);
Nothing fancy, but it did mean that Java finally got
System.out.printf() :-)
Matt
> Nice. Yours probably has a larger global symbol space than VM02.
> Depending on how dynamic the strings are could help decide how to look
> them up. As for tools, I hear you. I'm probably the worst Java
> programmer on the planet. I still have no clue what the different
> versions are or what generics do.
Generics are just compiler sugar to save you the bother of casting
> Nice. Yours probably has a larger global symbol space than VM02.
> Depending on how dynamic the strings are could help decide how to look
> them up. As for tools, I hear you. I'm probably the worst Java
> programmer on the planet. I still have no clue what the different
> versions are or what generics do.
Generics are just compiler sugar to save you the bother of casting
Now why the hell did google groups go and do that? Grr. I thought
they'd fixed that particular issue.
As long as it doesn't change the VM, it's cool with me :-)
Thanks for the description - you actually answered all my questions in
less than a page that I couldn't easily discover pouring through tons of
Sun documents.
Dave...
I thought maybe you wanted to make sure I got it through my thick skull
by repeating it three times ;-)
Lines 4314; Symbols 651; Time 88 s. [1.02 MHz, KEGSMAC]
Sorry I miscounted symbols before.
> > I started with Lisa and BigMac/Merlin, both fast, in-memory assemblers.
> > Later I switched to Apple's EDASM and Kyan's AS, two disk-based
> > assemblers. A hard drive became essential, but symbol table space was no
> > longer a limitation. Which way are you leaning?
>
> I started with an assembler of my own design, which provided symbol
> table support for the mini-assembler via a small BASIC program. After
> that, I did most of my assembler as optimisations for Apple Pascal
> programs, so I used the p-System portable assembler which I rather
> liked, but it was slow.
Disk-based. I was lucky to get a hard drive, even if I had to write the
driver:-)
> Symbols are stored sequentially in memory as Pascal String
Looking at the EDASM symbol table, I see they store variable length
strings with the high bit clear on the last character, a sort of
inverted DCI. It saves a byte per symbol. More here:
<http://home.woh.rr.com/jbmatthews/a2/rel.html#stform>
> As long as it doesn't change the VM, it's cool with me :-)
The only bytecode change was in 1.1 to introduce anonymous inner
classes so writing event listeners was less ugly.
1.2 bought Swing into the standard framework (This is what's known as
Java 2)
1.3 bought in the JIT compiler and a few more APIs
1.4 bought in the HotSpot JIT and more APIs
1.5 provides the syntactic compiler sugar and disk based JIT caching
for faster startups
1.6 is a performance oriented rewrite, and saw the rerelease under the
GNU GPL
> Thanks for the description - you actually answered all my questions in
> less than a page that I couldn't easily discover pouring through tons of
> Sun documents.
Glad to help!
Matt
Thank you Dr. Matthews, finally something tangible
in the way of reference material. ;-)
You're always helpful to me Dave. And I appreciate it.
So, from your English description, I gather it is
similar to what I learned the first year at Vo-Tech,
uh, 1969? OMG!, that long ago? Anyway, our computer
class was responsible for keeping track of the students
and their information, including activities, grades,
etc..... We took turns at different activities, IE;
typing in the data onto punch cards, yuck!, or collating
them, and this was done by programming it by jumper wires,
and here is where what you're talking about comes in,
each day we'd collate a little differently, by name,
or grade, or class. The front office determined what
data we would use for this.
So, I know what you're talking about, I just like to
see how each programmer handles it. Every one is different.
Thanks for clarifying this for me. ;-)
See, sometimes legwork doesn't always help everytime now, huh? ;-)
Hee-hee! ;-)
Oops, forgot the starting address:
<http://home.woh.rr.com/jbmatthews/a2/rel.html#stform>
> Thank you Dr. Matthews, finally something tangible
> in the way of reference material. ;-)
I thank the gentleman from the great state of PA for his kind remarks!
Near as I can tell, EDASM uses a linear search for existing entries when
it builds it's symbol table. The more unique symbols (named addresses),
the slower it gets:
<http://en.wikipedia.org/wiki/Linear_search>
Matt proposes to use a binary search to speed things up. I still recall
when my daughter figured out how to use it for guessing numbers:
<http://en.wikipedia.org/wiki/Binary_search>
I would expect though that in an assembler, you'd search alternately
from the start and end of the table, and your expected performance
would be in that case, significantly better than O(n). It's still
going to be inferior to O(Log n), but it might be a net win given the
low insertion cost.
However it does seem silly in the case of a REL module to not write
the module out sorted ...
> Matt proposes to use a binary search to speed things up. I still recall
> when my daughter figured out how to use it for guessing numbers:
We all do quasi-binary searching when flipping through a phone
book :-)
Matt
Ah, I'd forgotten that optimization. EDASM uses variable length strings,
but it has no apparent structure corresponding to your vector. I don't
see how it could scan backward from a known entry using the name's sign
bit without stumbling over the intervening flag/value bytes.
> However it does seem silly in the case of a REL module to not write
> the module out sorted.
Use the LST directive, adding either [NO]asym (alpha symbol table) or
[NO]vsym (value symbol table):
<http://home.woh.rr.com/jbmatthews/a2/rel.html#tblb1>
> > Matt proposes to use a binary search to speed things up. I still recall
> > when my daughter figured out how to use it for guessing numbers:
>
> We all do quasi-binary searching when flipping through a phone book :-)
But not table of logarithms :-)
<http://en.wikipedia.org/wiki/Benford's_law>
Ok, so, using the phone book example, most of the listing
includes the entire name, <Last><First><Middle/Initial>,
but in our's if there are more than 10 people with the
same last name, they print the last name, and a sub-list
of first names, etc.....
How would a program search a listing set up in this way?
* Understandable that 'I' wouldn't set it up like this. *
With just 10 names that match the last name, a linear search would
suffice. However, if you live in the Bay Area, that would break down
searching Nguyen.
Java matches on name and signature. The signature is a derived string
based on class, type, parameters, etc. The combination of the two must
be unique, but the typical case is only a few unique signatures for a
matching name. So I simply match the name first, then check the
signature for a complete match.
Every situation will be different, so should your solution. That's why
there are still jobs for software designers and not just cookie cutter
solutions developed 40 years ago that solve all problems.
Having said that, you should pick up Knuth's "The Art of Computer
Programming" series, a 40 year old classic ;-)
http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming
Dave...
> Ok, so, using the phone book example, most of the listing includes the
> entire name, <Last><First><Middle/Initial>, but in our's if there are
> more than 10 people with the same last name, they print the last name,
> and a sub-list of first names, etc.....
> How would a program search a listing set up in this way?
> * Understandable that 'I' wouldn't set it up like this. *
You'd do a two-stage search--first finding the matching last name,
then searching for the matching first name.
If this organization were used in a form suitable for a computer
search (not a human one), there would be a link associated with
each last name that would point to the next last name, so the
last names could be searched without bothering with the first
names.
A related method that actually is used in computer search data
structures is a "radix" tree, where each node can branch up to
128 ways (in the case of 7-bit ASCII), with each branch corresponding
to the next letter having the value of the number of the branch.
This is like the "last name" factoring--but on each successive letter.
Or the common "first letter" factoring (all the A's, then all the B's,
etc.), with the factoring continued recursively at each successive
letter position.
The time to look up a word in such a table is of the order of the
length of the word, independent of the size of the table. Of course,
in practice, there would be serious time penalties if the part of the
table being traversed resided in slower memory. ;-)
A closely-related method which represents the branching links sparsely,
with only those actually used present, labeled with their respective
character, is known as a "trie"--a tree organized to expedite
re*trie*val.
-michael
AppleCrate II: An Apple II "blade server"!
Home page: http://members.aol.com/MJMahon/
"The wastebasket is our most important design
tool--and it's seriously underused."
My 2nd edition is only 35 years old; I'm such a newbie! :-)
There are 4 volumns, do I need them all?
Serious confession - I don't have any of them. One of those, I will get
around to it when I have time. I looked at MIX, the sample assembly
language used for the examples many years ago. Don't remember a thing.
So your assignment is to pick up Vol 1 and give us a critique after
you finish it :-)
Dave...
> There are 4 volumns, do I need them all?
Actually, there's 7. Only the first 3 are published in their entirety
at this point.
Knuth's work is the "Principia Mathematica" of computer science;
brilliant, but very hardcore.
A more accessible work would be "Data Structures using C" by
Tenenbaum. Assuming you're familar with C that is. You can probably
pick up a 2nd hand copy from Amazon for next to nothing.
Matt
Volume 2 is Sorting and Searching.
But also the only work that actually analyzes the speed of algorithms
as they would execute *on a machine*--MIX until recently, and now (or
soon!) in a more RISC-like pseudo-machine. The part that is left out
is all the complexity of today's multi-threaded, cached, superscalar,
out-of-order machine implementations, and the significant implications
that complexity has for actual performance.
The rules for computer performance have changed radically from the
days of cycle counting, just as circuit speeds are now limited not
by transistors and gate delays, but by wiring delays and, ultimately,
power densities.
It's pretty clear that a major reason that the later "chapters" have
not been published is that Don continues to spend considerable effort
on correcting and updating the published parts. He is a real stickler
for accuracy and completeness! (Does this remind you of any programmers
you have known? ;-)
Michael J. Mahon wrote:
> > There are 4 volumns, do I need them all?
>
> Volume 2 is Sorting and Searching.
Not in my copy; Volume 2 is Seminumerical Algorithms, and Volume 3 is
Sorting and Searching.
IMO, you can't read the others without starting out on Volume 1 though
(being Fundamental Algorithms)
Matt
> A related method that actually is used in computer search data
> structures is a "radix" tree, where each node can branch up to
> 128 ways (in the case of 7-bit ASCII), with each branch corresponding
> to the next letter having the value of the number of the branch.
>
> This is like the "last name" factoring--but on each successive letter.
> Or the common "first letter" factoring (all the A's, then all the B's,
> etc.), with the factoring continued recursively at each successive
> letter position.
>
> The time to look up a word in such a table is of the order of the
> length of the word, independent of the size of the table. Of course,
> in practice, there would be serious time penalties if the part of the
> table being traversed resided in slower memory. ;-)
Radix trees are an example of a 'computationally optimal' algorithm,
meaning space is traded for time to the point where the time spent
cannot be any shorter. In practice, you'd rarely use this algorithm
since as you point out, part of the table would have to end up in
secondary storage. In this case, it's often faster to use for example
a hash table which would be more space efficient with somewhat higher
time cost, as you might avoid having to move to slower memory.
> A closely-related method which represents the branching links sparsely,
> with only those actually used present, labeled with their respective
> character, is known as a "trie"--a tree organized to expedite
> re*trie*val.
Anyone who uses regular expressions regularly should think about such
reductions to FSMs... Very cool stuff..
Matt
> It's pretty clear that a major reason that the later "chapters" have
> not been published is that Don continues to spend considerable effort
> on correcting and updating the published parts. He is a real stickler
> for accuracy and completeness! (Does this remind you of any programmers
> you have known? ;-)
Now that you mention it ... :-)
I tend to find that the for-pay work I do, where many compromises must
be made to achieve timely outcomes makes me desire "best I can do"
quality in my personal projects. With regards to Apple II stuff, I
tend to do things purely for the academic pleasure of it; I enjoy that
there are no deliverables, timeframes or other such pragmatic
considerations to distract me from idealism :-)
I've just spent 3.5 months away from home on a contract, and am very
much looking forward to being able to immerse myself in my Apple II's
again in a few weeks. I started this thread because I was musing one
evening about a project that had been on ice for about 2 years!
There's a few things that have been "on the cooker" so once I get home
and settled back in I'll have to *focus* on actually delivering
something useful ;-)
Matt
Alibris has a boxed Vol.1-3 hardbound set for $150.87,
and Vol.4 softbound for $15.78, in case you're interested.
That's a bit much, don't you think? $50 per book.
Bill Garber from GS-Electronics
http://www.garberstreet.com
"If you wish to forget anything on the spot,
make a note that this thing is to be remembered."
(Edgar Allen Poe)
I understand exactly. ;-) Don is much the same, I think, with a
liberal dose of OCD on top of it--like many of us. ;-)
In 1967, I used mimeographed (!) drafts of parts of chapter 7 in a
class he taught, so it's been a long time coming... (This was the
same time frame in which he said jokingly: "There's no problem in
computer science that can't be fixed by another level of indirection!")
> I've just spent 3.5 months away from home on a contract, and am very
> much looking forward to being able to immerse myself in my Apple II's
> again in a few weeks. I started this thread because I was musing one
> evening about a project that had been on ice for about 2 years!
Such musings about interesting projects are, for me, one of the
most pleasurable activities related to the Apple II--almost as
good as settling on alternatives and actually making something! ;-)
> There's a few things that have been "on the cooker" so once I get home
> and settled back in I'll have to *focus* on actually delivering
> something useful ;-)
You've got *my* interest... ;-)
Aaack! Of course! (Slapping forehead...)
I still remember waiting for it to come out...
> IMO, you can't read the others without starting out on Volume 1 though
> (being Fundamental Algorithms)
I would agree. The foundation is required, and it's a very enjoyable
read.
> Alibris has a boxed Vol.1-3 hardbound set for $150.87,
> and Vol.4 softbound for $15.78, in case you're interested.
>
> That's a bit much, don't you think? $50 per book.
It's a bargain. By the way, Don pays a finders fee of $2.56 (one
hexadecimal dollar :-) ) For each error. So in the unlikely event that
there's 60 errors left (those checks are probably the most sought
after geek prize) you could come out in front ;-)
Also with regards to volume 4, there's actually several 'fascicles'
published (from Wikipedia)
Volume 4, Fascicle 0: Introduction to Combinatorial Algorithms and
Boolean Functions. 2008. ISBN 0-321-53496-4
Volume 4, Fascicle 1: in preparation. (hopefully soon)
Volume 4, Fascicle 2: Generating All Tuples and Permutations, 2005.
ISBN 0-201-85393-0
Volume 4, Fascicle 3: Generating All Combinations and Partitions,
2005. ISBN 0-201-85394-9
Volume 4, Fascicle 4: Generating All Trees -- History of Combinatorial
Generation, 2006. ISBN 0-321-33570-8
Additionally, there is:
Volume 1, Fascicle 1: MMIX — A RISC Computer for the New Millennium,
2005. ISBN 0-201-85392-2
So a "complete set" (with more to come!) will set you back quite a
bit...
Matt