Newsgroups: comp.soft-sys.sas
From: pcho...@DDS.CA.GOV ("Choate, Paul@DDS")
Date: Fri, 14 Apr 2006 09:32:15 -0700
Local: Fri, Apr 14 2006 12:32 pm
Subject: OT: Binary Tree Lookups for Dummies
One goals for the last SUGI was to learn about and start applying more optimal lookup and merging methods. Of course Paul Dorfman's most excellent papers on hashing offering a most interesting place to start.
A couple weeks later, still not nearly caught up at work, but noodling a little about something heard - binary tree searching. Hadn't thought about it much before, but the light in my brain flashed on for a brief second. So here it is - my home-brewed binary tree lookup. Sorry if this is too pre-school to any über geeks out there, please have mercy. Any comments? Grossly missed efficiencies? How this is worse (or better?) than datastep lookups out there in practice? Does this always work? Does the ceil function keeps me out of trouble, hmmmmm? It worked on a few dozen tests. The distance between my target and sample could be interpolated rather than simply halved with each iteration. Attempting interpolation seemed inelegant, and this had already had become uglier than anticipated. Tests show that in a set of .6 million records it takes fewer than 20 lookups to find the target. The data are sorted on UCI, a seven digit character string. data found_it; set big_data(keep=uci) point=sample nobs=nobs; direction= sum(((uci>find)*(-1)),((uci<find)*1)); TIA Paul Choate You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: comp.soft-sys.sas
From: Nathaniel_Wood...@DOM.COM (Nat Wooding)
Date: Fri, 14 Apr 2006 12:38:53 -0400
Local: Fri, Apr 14 2006 12:38 pm
Subject: Re: OT: Binary Tree Lookups for Dummies
Paul
I like the way you were able to stick an umlaut on the "u" of "uber" For grins, try you search using with your '0010524' in a format and if Nat Wooding "Choate, 04/14/2006 12:32 Please respond to One goals for the last SUGI was to learn about and start applying more A couple weeks later, still not nearly caught up at work, but noodling a So here it is - my home-brewed binary tree lookup. Sorry if this is too Any comments? Grossly missed efficiencies? How this is worse (or better?) Does this always work? Does the ceil function keeps me out of trouble, The distance between my target and sample could be interpolated rather than Tests show that in a set of .6 million records it takes fewer than 20 The data are sorted on UCI, a seven digit character string. data found_it; set big_data(keep=uci) point=sample nobs=nobs; direction= sum(((uci>find)*(-1)),((uci<find)*1)); TIA Paul Choate ----------------------------------------- You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: comp.soft-sys.sas
From: "Richard A. DeVenezia" <rdevene...@wildblue.net>
Date: Fri, 14 Apr 2006 14:19:15 -0400
Local: Fri, Apr 14 2006 2:19 pm
Subject: Re: OT: Binary Tree Lookups for Dummies
A methodology using directions and power of 2 division certainly will work,
but the syntax is difficult to follow. Your use of 2**tries is sub-optimal, exponentiation function is more cpu cycle consuming than starting with size=1 and just multiplying by 2 in each iteration. Interpolation or any other predictive algorithm is only beneficial if you In a list of N sorted items, binary search can take no more than log2(N) An academic example: %let needle = 0010524; data found_it; seek: search: low_index = 1; high_index = nobs; do while (low_index <= high_index); * divide; index = mid_index; * and conquer; put low_index= high_index= mid_index= mid_value= needle=; if mid_value = needle then do; stop; drop Richard A. DeVenezia You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: comp.soft-sys.sas
From: Paul.Dorf...@FCSO.COM ("Dorfman, Paul")
Date: Mon, 17 Apr 2006 11:55:08 -0400
Local: Mon, Apr 17 2006 11:55 am
Subject: Re: [not quite]OT: Binary Tree Lookups for Dummies
Paul,
First, I do not think this is OT at all. Second, your idea is sensible and sound; just a few comments. It appears that you are searching an ordered [SAS] table rather than a binary search tree. Although the mechanism is conceptually precisely the same, a binary search tree as a data structure is not exactly equivalent to a mere ordered table, and one has to build it first in order to be able to search. You are using a rather clever technique called [by some] *uniform* binary search using the powers of 2. This has the potential advantage of ridding the standard binary search of the necessity to divide by 2. Fine, but as Richard has pointed out, you introduce a bigger inefficiency by computing the powers on the fly inside the inner loop. However, the latter is unnecessary: simply *precompute* all foreseeable powers of 2 beforehand and store them in a small temporary array, then merely point the that array with your variable TRIES. For instance, you can declare Array bin [0:32] _temporary_ ( and use BIN[TRIES] instead of 2**TRIES thereafter. You should not be surprised to observe TRIES never exceeding a 2-digit integer because the binary search never exceeds 1+log2(N) probes, hit or miss. If you have to search a file having more than 4294967296 rows, increase the size of the array above accordingly. The interpolation search you mention is blazingly fast with its 1+log2(log2(n)) maximal number of probes - however, the latter only holds if the interpolation functions follows the known distribution of keys (unknown beforehand) or the keys are known to be distributed uniformly, so the linear interpolation can be used, and if the interpolation function can be computed sufficiently rapidly (i.e. not orders of magnitude slower than the division by 2). Now the idea is of course not unheard of; in fact, it was outlined in the old SAS efficiency tips book of yore, although methought the implementation itself was a tad maladroit. And it is still the *best* way of fetching a few records out of an *already sorted* SAS data file, for then there is no reason to build an index or search sequentially (i.e. potentially read the entire file at least once). Also, it is a simple matter to use the binary search in order to collect all the rows belonging to duplicate keys - the job requiring a certain amount of trickery when going against a SAS index, especially when duplicates are present both among the lookup and search-for keys. This topic has been discussed in depth on sas-l in the past: just try searching the archives for binary and/or interpolation search and you will come up with tons of links. I have just narrowed it by adding the 'dorfman' constraint and could not believe my eyes when it fetched me my own post circa 1997. As was said in a well-circulated anecdote, humans do not live that long! Lastly, the binary trees. In pre-V9 SAS, those had been present only outside of the DATA step [in the form of the AVL trees] working quietly and productively behind all those MEAN, FREQ, TABULATE, et al. Now that the hash object has them transplanted into the DATA step as well, we can have even a *single* binary search tree, should there be a purist demand for such, by giving the HASHEXP parameter the value of 0, i.e. coding HASHEXP:0. However, I do not see any practical reason for doing so, and always code HASHEXP:16. The memory difference between the declarations dcl hash h (hashexp:16) ; is but the whole 500k, which is much more likely to seep through the cracks in Windows than to present any real memory problem in SAS. Kind regards Paul Choate First Coast Service Options, Inc., and its affiliates are not responsible for errors or omissions in the transmission of this e-mail message. Any personal comments made in this e-mail do not reflect the views of First Coast Service Options, Inc., or its affiliates. The information contained in this document may be confidential and is intended solely for the use of the individual or entity to whom it is addressed. This document may also contain material that is privileged or protected from disclosure under applicable law. If you are not the intended recipient or the individual responsible for delivery to the intended recipient, please (1) be advised that any use, dissemination, forwarding, or copying of this document IS STRICTLY PROHIBITED; and (2) notify sender immediately by telephone and destroy the document. Thank you. You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
| Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy |
| ©2009 Google |