In article <wkr9ss1obg....@mit.edu>, David Bakhash <ca...@mit.edu> wrote:
>This, somehow, never was overlooked. I cannot thank you enough for >clearing up this issue. it was bothering me. I never thought about >using the package system + CLOS for hiding data. I guess I'll just >_NOT_ export the slot names, and be done with that. That makes perfect >sense, though doing it might open up some issues.
Be careful, though. I've seen some people try to use the package system to implement the same granularity of data hiding as they can get with some other OO languages' public/private distinctions. It may be possible to do this, but I would recommend against trying. The package system is tricky, even for expert CL programmers (it's no coincidence that the package system is described in Chapter 11 of both CltL editions and the ANSI spec). It's best used for coarse levels of sharing -- typically an application or library would define one or two packages. You'll likely go crazy if you try to implement per-class packages.
-- Barry Margolin, bar...@bbnplanet.com GTE Internetworking, Powered by BBN, Burlington, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups. Don't bother cc'ing followups to me.
>this sounds really interesting. I must admit that I had not known of >TRIE, and still know nothing.
>I highly appreciate this information, though I must admit that I'm >skeptical at how it can _possibley_ be faster than a hashtable.
It's because to compute a hash value you have to process all the characters of your string.
in a hash table you 1 compute the hash value in O(length(key)) 2 compare the found item to the key also in O(lengthKey) 3 you compare more keys if they are collisions
On the other hand, TRIES and their look alike process character by character. so if the key is not there you can know if faster.
All this depends on the statistical repartition of your keys. You can imagine a data set with 100 characters keys on where the first 5 are usually enough to differentiate them. In that case It's likely that hashtables will be slower.
In article <7F6861BC8A65CC8B.F0645AE687214FDC.64E9F9773E5A7...@library-proxy.airnews.net>,
Marc Battyani <Marc_Batty...@csi.com> wrote: >You can imagine a data set with 100 characters keys on where the first 5 are >usually enough to differentiate them. >In that case It's likely that hashtables will be slower.
If you have control over the hashing function, you could simply hash on only the first 5 characters. Items where the difference is only in the remaining characters would collide, and be differentiated by the collision resolution routine.
However, tries and their ilk don't require you to know where the cutoff is a priori. They'll get faster all by themselves depending on the actual distribution of the data.
-- Barry Margolin, bar...@bbnplanet.com GTE Internetworking, Powered by BBN, Burlington, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups. Don't bother cc'ing followups to me.
* "Marc Battyani" <Marc_Batty...@csi.com> | It's because to compute a hash value you have to process all the | characters of your string. : | All this depends on the statistical repartition of your keys. You can | imagine a data set with 100 characters keys on where the first 5 are | usually enough to differentiate them. In that case It's likely that | hashtables will be slower.
in all likelihood, a hashing function is able to take advantage of the same property of the input, and but then somewhat more for collisions.
the real difference, as I see it, is that tries will be able to take advantage of _any_ smallest set of differences, so if you have a bunch of people names starting with "Smith", the trie will effectively just move past the first 5 and use the next 5 characters, instead. a hash function would have to be excessively intelligent to adapt the same way.
#:Erik -- SIGTHTBABW: a signal sent from Unix to its programmers at random intervals to make them remember that There Has To Be A Better Way.
In article <3125682171162...@naggum.no>, Erik Naggum <e...@naggum.no> wrote:
> in all likelihood, a hashing function is able to take advantage of the > same property of the input, and but then somewhat more for collisions.
> the real difference, as I see it, is that tries will be able to take > advantage of _any_ smallest set of differences, so if you have a bunch of > people names starting with "Smith", the trie will effectively just move > past the first 5 and use the next 5 characters, instead. a hash function > would have to be excessively intelligent to adapt the same way.
Well, everyone can stop worrying about the Y2K problem. Erik and I just posted almost the same comment, and I think agreement between us is one of the signs of the Apocalypse. :-)
-- Barry Margolin, bar...@bbnplanet.com GTE Internetworking, Powered by BBN, Burlington, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups. Don't bother cc'ing followups to me.
In article <AXLo2.319$oD6.26...@burlma1-snr1.gtei.net>, Barry Margolin <bar...@bbnplanet.com> wrote:
>In article <wkr9ss1obg....@mit.edu>, David Bakhash <ca...@mit.edu> wrote: .... >Be careful, though. I've seen some people try to use the package system to >implement the same granularity of data hiding as they can get with some >other OO languages' public/private distinctions.
Yes. I'd recommend more of Modula3 type of approach with an interface file which did the namespace definition and the import/export business and a definition file which did the "defining". [ Even though keeping the interface and definition synchronized involves lots of manual grunt work.]
Long chains of "public" hierarchies can all be in the same file since they, for the most part, share namespace. Even "private" hierarchies as long at the author interally respects the abstraction boundaries. You do have to be more "disciplined" when using CLOS+packages then when depending upon the C++ compiler as an abstraction boundary "enforcer".
This gives you "libraries" (or a framework) of classes.
Probably a closer analogy would be to Java (and its packages) than to C++ (namespaces). You can develop quite a maze of namespaces with names space control comingled with the object class definitions quite a bit more easily with public/protected/private than with namespace and class definition mechanisms seperated.
> The package system is tricky, >even for expert CL programmers
Unfortunately this is true. :-( Not so much that the mechanisms' syntax is tricky, but it is the "landmines" you may stumble across if certain aspects of the language interact in nonintuitive ways. And the fact that package system changed significantly so there is legacy issues to deal with.
The "defsystem" , "modules" , and "package" solution is one of those things CL doesn't quite do in an "inspired" fashion. You can make it work. However, it usually seems more painful than it "should be".
I think some people get "bit" by some "package problem" and then just shy away from it in the future. And I'm sure there are also more than few for whom in their standard "modus operandi" it all works well.
--
Lyman S. Taylor "The Borg -- party poopers of the Galaxy. " (ly...@cc.gatech.edu) EMH Doctor Star Trek Voyager.
* Barry Margolin <bar...@bbnplanet.com> | Well, everyone can stop worrying about the Y2K problem. Erik and I just | posted almost the same comment, and I think agreement between us is one | of the signs of the Apocalypse. :-)
problem is, we might have caused the OSCE incident in Kosovo. the sum total of hostilities in the world has to remain constant, so what was the world to do? things just _had_ to get ugly somewhere else, instead.
#:Erik -- SIGTHTBABW: a signal sent from Unix to its programmers at random intervals to make them remember that There Has To Be A Better Way.
Erik Naggum <e...@naggum.no> writes: > * Barry Margolin <bar...@bbnplanet.com> > | Well, everyone can stop worrying about the Y2K problem. Erik and I just > | posted almost the same comment, and I think agreement between us is one > | of the signs of the Apocalypse. :-)
> problem is, we might have caused the OSCE incident in Kosovo. the sum > total of hostilities in the world has to remain constant, so what was the > world to do? things just _had_ to get ugly somewhere else, instead.
On top of that I just got (again) the Internet Joke stating that
Erik Naggum <e...@naggum.no> wrote: +--------------- | the real difference, as I see it, is that tries will be able to take | advantage of _any_ smallest set of differences, so if you have a bunch of | people names starting with "Smith", the trie will effectively just move | past the first 5 and use the next 5 characters, instead. +---------------
Exactly. Plus, if you're doing typical symbol-table or other plaintext stuff, you can *significantly* cut the amount of storage used with tries by various forms of "compression" on each node (much like the way most C compilers generate different styles of code for "switch" statements depending on the pattern of "case" values), tagging each node with its "style" or kind:
- As Erik notes, you can collapse a run of fanout==1 nodes into a single "match"-only node consisting of a string and one further node pointer.
- If the degree of fanout is >1 but still very low, the node might be encoded as a small string that's searched sequentially, together with a vector of the same length (indexed by the searched-for character's position in the string). While this is a good deal slower than a pure "radix-sort" style index, it's also *much* smaller.
- With higher fanout, a node might contain "min" & "max" character codes, and a vector as large as "max - min + 1", indexed by "this_char - min".
- With still higher fanout, revert to a full max-char-code size vector. In most practical applications, you'll only need a few nodes like this (maybe even only the root node).
+--------------- | a hash function would have to be excessively intelligent to adapt | the same way. +---------------
That reminds me of a trick that can help hash tables: If you use a really good hash, one for which nearly every bit position contains nearly a full bit of information (a CRC-32 is such a hash), then if you store the *full* hash value with the entry, when you want grow the hash table because the per-bucket chains get too long, you don't have to re-hash the keys -- you can just use more bits of the original hash value.
Finally, there's an interesting hybrid of hash tables and tries -- a multi-level hash table that uses different bits of the hash at each level of the tree/trie (that is, it's a binary tree if you only use one bit of the hash per level). When you hit a leaf node you do a comparison on the full hash first before wasting time with a full string compare. If the hashes are the same but the strings differ, then the strings are probably *very* different (assuming a "good" hash, such as a CRC), so hanging a "compressed trie" (as above) off that node should be a cheap way to resolve the conflict (that is, the strings will very likely differ in the first character!).
[Hmmm... It would be interesting to compare these three methods for space & speed to implement "intern"...]
-Rob
----- Rob Warnock, 8L-855 r...@sgi.com Applied Networking http://reality.sgi.com/rpw3/ Silicon Graphics, Inc. Phone: 650-933-1673 2011 N. Shoreline Blvd. FAX: 650-964-0811 Mountain View, CA 94043 PP-ASEL-IA
r...@rigden.engr.sgi.com (Rob Warnock) writes: > That reminds me of a trick that can help hash tables: If you use a really > good hash, one for which nearly every bit position contains nearly a full > bit of information (a CRC-32 is such a hash), then if you store the *full* > hash value with the entry, when you want grow the hash table because the > per-bucket chains get too long, you don't have to re-hash the keys -- you > can just use more bits of the original hash value.
It is not really a 'trick' it is the base for some fancy indexing schemes used in data bases in alternative to BTree based data structures.
> Finally, there's an interesting hybrid of hash tables and tries -- a > multi-level hash table that uses different bits of the hash at each level > of the tree/trie (that is, it's a binary tree if you only use one bit of > the hash per level). When you hit a leaf node you do a comparison on > the full hash first before wasting time with a full string compare. > If the hashes are the same but the strings differ, then the strings are > probably *very* different (assuming a "good" hash, such as a CRC), so > hanging a "compressed trie" (as above) off that node should be a cheap > way to resolve the conflict (that is, the strings will very likely differ > in the first character!).
> [Hmmm... It would be interesting to compare these three methods for > space & speed to implement "intern"...]
On Mon, 18 Jan 1999 19:28:32 GMT, Barry Margolin <bar...@bbnplanet.com> wrote: >In article <wkr9ss1obg....@mit.edu>, David Bakhash <ca...@mit.edu> wrote:
>Be careful, though. I've seen some people try to use the package system to >implement the same granularity of data hiding as they can get with some >other OO languages' public/private distinctions. It may be possible to do >this, but I would recommend against trying. The package system is tricky, >even for expert CL programmers (it's no coincidence that the package system >is described in Chapter 11 of both CltL editions and the ANSI spec). It's >best used for coarse levels of sharing -- typically an application or >library would define one or two packages. You'll likely go crazy if you >try to implement per-class packages.
This is exactly what I'm doing in an application I'm currently working on. Why would per-class packages be so difficult. They seem to implement the exact behavior I desire without any great complexity.
Could you please expand on the difficulties that one might encounter using this method of data encapsulation?
Anyone who challenges the prevailing orthodoxy finds himself silenced with surprising effectiveness. A genuinely unfashionable opinion is almost never given a fair hearing. -- George Orwell
>> If he does much CLOS, he'll get so fed up of calling *everything* >> virtual that he'll stop bothering. Then there'll be no problem. >> (Until he goes back to C++ and starts asking about generic functions >> in C++. (-: )
> He wasn't calling everything virtual, he was making a point in a discussion > about programming methodologies. The fact that CLOS methods have the > properties associated with virtual member functions in another language I > won't name was germaine to his point. He expressed it very clearly, IMHO.
He said "all methods in CLOS are virtual". I have no quarrel with that statement; I agree that it's a sensible thing for someone whose background is in C++ to use; I have (so far as I can see) no disagreement with you here. (I also understand why Erik is bothered by people using foreign terminology to describe Lisp, though I think he's wrong. That's why I made the observation that anyone who spends a lot of time doing CLOS will get used to all methods being "virtual" in CLOS, and therefore stop using the word except when trying to relate CLOS and C++; for which reason, I don't think his concern that you were encouraging someone in a bad habit was justified.)
-- Gareth McCaughan Dept. of Pure Mathematics & Mathematical Statistics, gj...@dpmms.cam.ac.uk Cambridge University, England.
> In article <AXLo2.319$oD6.26...@burlma1-snr1.gtei.net>, > Barry Margolin <bar...@bbnplanet.com> wrote: > >In article <wkr9ss1obg....@mit.edu>, David Bakhash <ca...@mit.edu> wrote: > .... > >Be careful, though. I've seen some people try to use the package system to > >implement the same granularity of data hiding as they can get with some > >other OO languages' public/private distinctions.
> Yes. I'd recommend more of Modula3 type of approach with an > interface file which did the namespace definition and the import/export > business and a definition file which did the "defining". [ Even though > keeping the interface and definition synchronized involves lots of > manual grunt work.]
An other alternative I've seen is having an interface file just defining the package and a macro DEFEXPORT to define exported functions (and analoguous macros for exported macros, constants etc.). This solves the syncronisation problem.
Does anyone have experience with this style?
I find it hard to understand what exactly about the CL package mechanism hinders people to create many small (fine grained) name spaces. Perhaps the fact that for easy interactive use they all would tend to get used in COMMON-LISP-USER anyway. Garnet seems to be the exception as their recommended style advises against using the package and for an explicit package:symbol style.
-- Lieven Marchand <m...@bewoner.dma.be> --------------------------------------------------------------------------- --- Few people have a talent for constructive laziness. -- Lazarus Long
> The package system is tricky, > even for expert CL programmers (it's no coincidence that the package system > is described in Chapter 11 of both CltL editions and the ANSI spec). It's > best used for coarse levels of sharing -- typically an application or > library would define one or two packages.
Choosing an appropriate size for packages, epecially in large applications, is something that is difficult to call and, I guess, there's no "right" answer.
What do people tend to do? How large a logical unit should a package be? How large in terms of lines of code or exported symbols should a package be? What other measures might be useful when considering the size of packages? Are mutually dependent (either directly or indirectly) packages always a Bad Thing? Are there any style guides that address these issues in detail?
In article <787nse$ni...@nickel.uunet.be>, Lieven Marchand <m...@bewoner.dma.be> wrote:
>I find it hard to understand what exactly about the CL package mechanism >hinders people to create many small (fine grained) name spaces. Perhaps >the fact that for easy interactive use they all would tend to get used >in COMMON-LISP-USER anyway. Garnet seems to be the exception as their >recommended style advises against using the package and for an explicit >package:symbol style.
Unlike most other things in Lisp, it can be tricky to make package changes incrementally. Things work best if all the package settings (imports, exports, shadowing, etc.) are made before any other uses of the package, because of the way they pervade. If you have lots of little packages, the web of relationships gets more complicated.
-- Barry Margolin, bar...@bbnplanet.com GTE Internetworking, Powered by BBN, Burlington, MA *** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups. Don't bother cc'ing followups to me.
Barry Margolin <bar...@bbnplanet.com> writes: > In article <787nse$ni...@nickel.uunet.be>, > Lieven Marchand <m...@bewoner.dma.be> wrote: > >I find it hard to understand what exactly about the CL package mechanism > >hinders people to create many small (fine grained) name spaces.
> Unlike most other things in Lisp, it can be tricky to make package changes > incrementally. Things work best if all the package settings (imports, > exports, shadowing, etc.) are made before any other uses of the package, > because of the way they pervade. If you have lots of little packages, the > web of relationships gets more complicated.
Good point. The languages in which I most like the module/package system, Ada and Modula-3, are both statically compiled and come with a build system that figures out the package dependencies automagically which takes care of the web of relationships. In such a language, small packages are beneficial to avoid long compilation times for localized changes, but in Lisp you already have a better way for this.
-- Lieven Marchand <m...@bewoner.dma.be> --------------------------------------------------------------------------- --- Few people have a talent for constructive laziness. -- Lazarus Long
>>>>> "David" == David Bakhash <ca...@mit.edu> writes:
David> this sounds really interesting. I must admit that I had David> not known of TRIE, and still know nothing.
David> I highly appreciate this information, though I must admit David> that I'm skeptical at how it can _possibley_ be faster than David> a hashtable.
I'll use the regular trie here, because I know it better. Roughly speaking:
Searching a trie and coming up with an element if one exists is roughtly an O(n) operation, with n being the length of the indexing string. So (trie-ref trie "foo") takes three steps. But if the element does not exist (suppose there are no strings starting with #\f in the trie), then it can take less than n steps to reject the reference.
A hashing function on strings takes some number of characters and combines them to come up with an index into an array. This is an O(n) operation with n being the length of the indexing string (although some implementations arbitrarily cut the hashing function off at say the 10th character). This hashing function must be performed regardless of success or failure, and so a hashing function always requires n steps.
In practice, this isn't really very interesting, and tries may or may not have poor cache behaviour etc., so a simple version just says hash tables and tries are about as fast unless you deal with ridiculously long strings with the same, say, first 10 characters as a prefix.
Tries have another advantage; they grow better than hash tables (lookup *always* takes about O(n) in length of string, regardless of how big the trie is; closed hash tables often are worse than O(n) in the length of the string for large tables and open hash tables often need resizing, so they're only O(n) in the length of the string amortized), but they're also nice from a functional point of view because you can add nodes to a trie and retain the old one.
It's an interesting data structure to be sure, but not particularly well known, when it is better than a hash table it is often not much better, implementing them without wasting a great deal of space can be very annoying, and various other caveats. It is very useful for a programmer in a purely functional language like Haskell because Haskell doesn't *get* hash tables (copying the array around to update one element is ridiculous but necessary), and I can see the use for a symbol table in a compiler, but for the most part it's not really worth bothering with, like splay trees[0].
[0] Splay trees are a famous amortized data structure that have advantages but are often not *so* much better than e.g. red black trees that it's worth the effort to implement them. -- Graham Hughes <ghug...@cs.ucsb.edu> PGP Fingerprint: 36 15 AD 83 6D 2F D8 DE EC 87 86 8A A2 79 E7 E6
It turns out that these data structures -- tries -- are exactly what I was looking for. I guess it's not important to explain why, because what I'm doing is so specfic that it's not too interesting. But since you guys have been so helpful, I may as well try to thank you by describing what I was doing.
I am making a huge string table. There are times when I'll look up the string "the" and, right afterwards, I might look up the string "them". In this case, it might help to retain the node that the "the" got me to, and just keep going from there. Either way, the point is that there's a lot of shared data, and so the potential for these tries in my code is great. I was still hoping that someone would lend me some code that implements tries (in CL, of course). but otherwise, I'll simply try to find out where the algorithms are documented and redo the work.
but I have, very successfully, implemented keyless hash-tables (I call them `array-tables'), and they perform comparably to ACL's built-in hash-tables, and (in my opinion) are extremely easily integrated into programs which use hash-tables. I implementing a macro called `array-table-loop', which lets you do things like:
(array-table-loop for entry being the array-table-values of atable [do|maximize|..etc..] ...[etc.])
so you can still get your hash-table iteration working. CLOS made making this very elegant, and it seems to run pretty fast. Oh, and one nice thing about the keyless array-tables: they arn't limited in the :test that they can use. so, for example, you can use #'string-equal instead of #'equalp. I think this is a big plus, and I'm still wondering why ANSI CL limits the :test for hash-tables to (member #'eql #'eq #'equal #'equalp). I guess it's because hash-tables in CL can store _any_ type of object, and #'string-equal, #'char=, and friends don't operate on just any Lisp object.
David Bakhash <ca...@engc.bu.edu> writes: > I think this is a big plus, and I'm still wondering why ANSI CL > limits the :test for hash-tables to (member #'eql #'eq #'equal > #'equalp). I guess it's because hash-tables in CL can store _any_ > type of object, and #'string-equal, > #'char=, and friends don't operate on just any Lisp object.
I don't think this is the real reason the test functions are limited to those you enumerated.
The choice of test function is limited because the test function must always correspond to a hash function (specifically, for any two keys A and B, it must be that `A equals B' implies `hash(A) == hash(B)'; the reverse needn't be true.)
So if the user specifies a test function of his own, he should also specify a hash function of his own, and that hash function should be able to cope with all the Lisp types that could be a part of the key. Implementing such a hash function without delving into object allocation internals does not sound feasible.
In article <cxjk8ydoqh7....@engc.bu.edu>, David Bakhash <ca...@engc.bu.edu> wrote: ...
>I'm still wondering why ANSI CL limits the :test for hash-tables to >(member #'eql #'eq #'equal #'equalp). I guess it's because
If your key is the object identity EQ ( or to a lesser extent EQL ) think about what happens when you do a garbage collection. Namely you have a nonconservative collector that moves objects.
When those objects "move" their identity changes. The gc may change all of the references to the "new" identity, but the hash table was indirectly dependent upon those "old" ones. Namely your object will likely hash to different spot now.
That's OK. you can simply rehash the whole table before the next lookup. Or follow some other amortized resolution scheme.
EQUAL and EQUALP should be uninpacted by this. They don't care because they're based on a different sense of identity. So you need not only a test but also something that "says" what sort of policy you do post GC. I think common lisp designers chose to let only the implementors worry about these issues.
P.S. this may mean that EQ is probably _not_ a good :TEST argument to your new construct. You just may not have been bitten by it yet. [ if you exhaustively search the table until all keys are examined then fine. However, you can't really say that it has O(1) lookup characteristics. ]
You may have used SXHASH in all cases regardless of resolution predicate. The negates the problem of starting to "look" in a different place. I'm not sure the designeres wanted to constrain the implementors to this or if this doesn't have hash code distrubution problems where there are more collisons than "necessary". [ For instance I think I posted an example where an implementation use the size of a vector as its SXHASH code, irregardless of contents. A hash table of same size vectors would have horrific ( OK, linear) performance in this case. ]
--
Lyman S. Taylor "The Borg -- party poopers of the Galaxy. " (ly...@cc.gatech.edu) EMH Doctor Star Trek Voyager.
< hey, < < It turns out that these data structures -- tries -- are exactly what I < was looking for. I guess it's not important to explain why, because < what I'm doing is so specfic that it's not too interesting. But since < you guys have been so helpful, I may as well try to thank you by < describing what I was doing. < < I am making a huge string table. There are times when I'll look up < the string "the" and, right afterwards, I might look up the string < "them". In this case, it might help to retain the node that the "the" < got me to, and just keep going from there. Either way, the point is < that there's a lot of shared data, and so the potential for these < tries in my code is great. I was still hoping that someone would lend < me some code that implements tries (in CL, of course). but otherwise, < I'll simply try to find out where the algorithms are documented and < redo the work.
I wrote some tries just recently... I think I got the main idea from AICP - I've read about them in many algorithm books, but they usually make it seem much more complicated than it needs to be. These aren't too `fancy', but that's what I like about them. Ideally you would be able to specify the equality function and other such-nots...
(defun find-trie (key trie) (loop for code across key for newtrie = (follow-trie-no-extend code trie) then (follow-trie-no-extend code newtrie) while newtrie do (setq trie newtrie) finally (return trie)))