Newsgroups: comp.soft-sys.sas
From: Paul.Dorf...@FCSO.COM ("Dorfman, Paul")
Date: Mon, 17 Apr 2006 11:55:08 -0400
Local: Mon, Apr 17 2006 11:55 am
Subject: Re: [not quite]OT: Binary Tree Lookups for Dummies
Paul,
First, I do not think this is OT at all. Second, your idea is sensible and sound; just a few comments. It appears that you are searching an ordered [SAS] table rather than a binary search tree. Although the mechanism is conceptually precisely the same, a binary search tree as a data structure is not exactly equivalent to a mere ordered table, and one has to build it first in order to be able to search. You are using a rather clever technique called [by some] *uniform* binary search using the powers of 2. This has the potential advantage of ridding the standard binary search of the necessity to divide by 2. Fine, but as Richard has pointed out, you introduce a bigger inefficiency by computing the powers on the fly inside the inner loop. However, the latter is unnecessary: simply *precompute* all foreseeable powers of 2 beforehand and store them in a small temporary array, then merely point the that array with your variable TRIES. For instance, you can declare Array bin [0:32] _temporary_ ( and use BIN[TRIES] instead of 2**TRIES thereafter. You should not be surprised to observe TRIES never exceeding a 2-digit integer because the binary search never exceeds 1+log2(N) probes, hit or miss. If you have to search a file having more than 4294967296 rows, increase the size of the array above accordingly. The interpolation search you mention is blazingly fast with its 1+log2(log2(n)) maximal number of probes - however, the latter only holds if the interpolation functions follows the known distribution of keys (unknown beforehand) or the keys are known to be distributed uniformly, so the linear interpolation can be used, and if the interpolation function can be computed sufficiently rapidly (i.e. not orders of magnitude slower than the division by 2). Now the idea is of course not unheard of; in fact, it was outlined in the old SAS efficiency tips book of yore, although methought the implementation itself was a tad maladroit. And it is still the *best* way of fetching a few records out of an *already sorted* SAS data file, for then there is no reason to build an index or search sequentially (i.e. potentially read the entire file at least once). Also, it is a simple matter to use the binary search in order to collect all the rows belonging to duplicate keys - the job requiring a certain amount of trickery when going against a SAS index, especially when duplicates are present both among the lookup and search-for keys. This topic has been discussed in depth on sas-l in the past: just try searching the archives for binary and/or interpolation search and you will come up with tons of links. I have just narrowed it by adding the 'dorfman' constraint and could not believe my eyes when it fetched me my own post circa 1997. As was said in a well-circulated anecdote, humans do not live that long! Lastly, the binary trees. In pre-V9 SAS, those had been present only outside of the DATA step [in the form of the AVL trees] working quietly and productively behind all those MEAN, FREQ, TABULATE, et al. Now that the hash object has them transplanted into the DATA step as well, we can have even a *single* binary search tree, should there be a purist demand for such, by giving the HASHEXP parameter the value of 0, i.e. coding HASHEXP:0. However, I do not see any practical reason for doing so, and always code HASHEXP:16. The memory difference between the declarations dcl hash h (hashexp:16) ; is but the whole 500k, which is much more likely to seep through the cracks in Windows than to present any real memory problem in SAS. Kind regards >Date: Fri, 14 Apr 2006 09:32:15 -0700 TIA >Reply-To: "Choate, Paul@DDS" <pcho...@DDS.CA.GOV> >Sender: "SAS(r) Discussion" <SA...@LISTSERV.UGA.EDU> >From: "Choate, Paul@DDS" <pcho...@DDS.CA.GOV> >Subject: OT: Binary Tree Lookups for Dummies >Content-Type: text/plain; charset="iso-8859-1" >One goals for the last SUGI was to learn about and start applying more optimal lookup and merging methods. Of course Paul >Dorfman's most excellent papers on hashing offering a most interesting place to start. >A couple weeks later, still not nearly caught up at work, but noodling a little about something heard - binary >So here it is - my home-brewed binary tree lookup. Sorry if this is too pre-school to any über geeks out there, >The distance between my target and sample could be interpolated rather than simply halved with each iteration. >Tests show that in a set of .6 million records it takes fewer than 20 lookups to find the target. >data found_it; > set big_data(keep=uci) point=sample nobs=nobs; > direction= sum(((uci>find)*(-1)),((uci<find)*1)); Paul Choate First Coast Service Options, Inc., and its affiliates are not responsible for errors or omissions in the transmission of this e-mail message. Any personal comments made in this e-mail do not reflect the views of First Coast Service Options, Inc., or its affiliates. The information contained in this document may be confidential and is intended solely for the use of the individual or entity to whom it is addressed. This document may also contain material that is privileged or protected from disclosure under applicable law. If you are not the intended recipient or the individual responsible for delivery to the intended recipient, please (1) be advised that any use, dissemination, forwarding, or copying of this document IS STRICTLY PROHIBITED; and (2) notify sender immediately by telephone and destroy the document. Thank you. You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||