OT: Binary Tree Lookups for Dummies
flag
4 messages - Collapse all
/groups/adfetch?adid=EdwBlRAAAACJZMCBhzAsPIcS8POfHUFA
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
1.  "Choate, Paul@DDS"  
View profile  
 More options Apr 14 2006, 12:32 pm
Newsgroups: comp.soft-sys.sas
From: pcho...@DDS.CA.GOV ("Choate, Paul@DDS")
Date: Fri, 14 Apr 2006 09:32:15 -0700
Local: Fri, Apr 14 2006 12:32 pm
Subject: OT: Binary Tree Lookups for Dummies
One goals for the last SUGI was to learn about and start applying more optimal lookup and merging methods.  Of course Paul Dorfman's most excellent papers on hashing offering a most interesting place to start.

A couple weeks later, still not nearly caught up at work, but noodling a little about something heard - binary tree searching.  Hadn't thought about it much before, but the light in my brain flashed on for a brief second.

So here it is - my home-brewed binary tree lookup.  Sorry if this is too pre-school to any über geeks out there, please have mercy.

Any comments? Grossly missed efficiencies?  How this is worse (or better?) than datastep lookups out there in practice?

Does this always work?  Does the ceil function keeps me out of trouble, hmmmmm?  It worked on a few dozen tests.

The distance between my target and sample could be interpolated rather than simply halved with each iteration. Attempting interpolation seemed inelegant, and this had already had become uglier than anticipated.

Tests show that in a set of .6 million records it takes fewer than 20 lookups to find the target.

The data are sorted on UCI, a seven digit character string.

data found_it;
     find='0010524';
     *using Zeno's arrow to hit a target - iterate up & down by nobs/2**n;
     tries+1;
     direction+0;
     sample+sum(direction=0,direction)*(ceil(nobs/(2**tries))<>1);
     sample=sample><nobs; *if outside 1<->nobs reset to min or max;
     sample=sample<>1;

     set big_data(keep=uci) point=sample nobs=nobs;

     direction= sum(((uci>find)*(-1)),((uci<find)*1));
     if uci=find or ceil(nobs/(2**tries))=1 then do;
          record=i;
          if uci=find then output;
                        else put / / 'Not Found:' (_all_)(=);
          stop;
     end;
run;

TIA

Paul Choate
DDS Data Extraction
(916) 654-2160


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
2.  Nat Wooding  
View profile  
 More options Apr 14 2006, 12:38 pm
Newsgroups: comp.soft-sys.sas
From: Nathaniel_Wood...@DOM.COM (Nat Wooding)
Date: Fri, 14 Apr 2006 12:38:53 -0400
Local: Fri, Apr 14 2006 12:38 pm
Subject: Re: OT: Binary Tree Lookups for Dummies
Paul

I like the way you were able to stick an umlaut on the "u" of "uber"
(sadly, sans umlaut).

For grins, try you search using with your '0010524' in a format and if
statement. I seem to recall that SAS uses a binary tree search for looking
up format values.

Nat Wooding
the unter geek

             "Choate,
             Paul@DDS"
             <pcho...@DDS.CA.G                                          To
             OV>                       SA...@LISTSERV.UGA.EDU
             Sent by: "SAS(r)                                           cc
             Discussion"
             <SA...@LISTSERV.U                                     Subject
             GA.EDU>                   OT: Binary Tree Lookups for Dummies

             04/14/2006 12:32
             PM

             Please respond to
                 "Choate,
                 Paul@DDS"
             <pcho...@DDS.CA.G
                    OV>

One goals for the last SUGI was to learn about and start applying more
optimal lookup and merging methods.  Of course Paul Dorfman's most
excellent papers on hashing offering a most interesting place to start.

A couple weeks later, still not nearly caught up at work, but noodling a
little about something heard - binary tree searching.  Hadn't thought about
it much before, but the light in my brain flashed on for a brief second..

So here it is - my home-brewed binary tree lookup.  Sorry if this is too
pre-school to any über geeks out there, please have mercy.

Any comments? Grossly missed efficiencies?  How this is worse (or better?)
than datastep lookups out there in practice?

Does this always work?  Does the ceil function keeps me out of trouble,
hmmmmm?  It worked on a few dozen tests.

The distance between my target and sample could be interpolated rather than
simply halved with each iteration. Attempting interpolation seemed
inelegant, and this had already had become uglier than anticipated.

Tests show that in a set of .6 million records it takes fewer than 20
lookups to find the target.

The data are sorted on UCI, a seven digit character string.

data found_it;
     find='0010524';
     *using Zeno's arrow to hit a target - iterate up & down by nobs/2**n;
     tries+1;
     direction+0;
     sample+sum(direction=0,direction)*(ceil(nobs/(2**tries))<>1);
     sample=sample><nobs; *if outside 1<->nobs reset to min or max;
     sample=sample<>1;

     set big_data(keep=uci) point=sample nobs=nobs;

     direction= sum(((uci>find)*(-1)),((uci<find)*1));
     if uci=find or ceil(nobs/(2**tries))=1 then do;
          record=i;
          if uci=find then output;
                        else put / / 'Not Found:' (_all_)(=);
          stop;
     end;
run;

TIA

Paul Choate
DDS Data Extraction
(916) 654-2160

-----------------------------------------
CONFIDENTIALITY NOTICE:  This electronic message contains
information which may be legally confidential and/or privileged and
does not in any case represent a firm ENERGY COMMODITY bid or offer
relating thereto which binds the sender without an additional
express written confirmation to that effect.  The information is
intended solely for the individual or entity named above and access
by anyone else is unauthorized.  If you are not the intended
recipient, any disclosure,  copying, distribution, or use of the
contents of this information is prohibited and may be unlawful.  If
you have received this electronic  transmission in error, please
reply immediately to the sender that you have received the message
in error, and delete it.  Thank you.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
3.  Richard A. DeVenezia  
View profile  
 More options Apr 14 2006, 2:19 pm
Newsgroups: comp.soft-sys.sas
From: "Richard A. DeVenezia" <rdevene...@wildblue.net>
Date: Fri, 14 Apr 2006 14:19:15 -0400
Local: Fri, Apr 14 2006 2:19 pm
Subject: Re: OT: Binary Tree Lookups for Dummies

A methodology using directions and power of 2 division certainly will work,
but the syntax is difficult to follow. Your use of 2**tries is sub-optimal,
exponentiation function is more cpu cycle consuming than starting with
size=1 and just multiplying by 2 in each iteration.

Interpolation or any other predictive algorithm is only beneficial if you
really know your data, or are willing to implement a system that records
aspects of the search behaviour (over many searches on some data of some
particular nature) and develop a predictor based on whatever aspects you
maintain.  Perhaps useful when performing millions of searches through
millions of lists having billions of items having some expected
distribution.

In a list of N sorted items, binary search can take no more than log2(N)
iterations to determine if a candidate value exists in the list.
I prefer to use indices to track the range being conquered.  The methodology
being implemented is very easy to follow.

An academic example:
--------------------------------------------
data haystack;
  do _n_ = 1 to 1000000;
*   array satellite[10];
    if ranuni(4242) < 0.84 then do;
      rownum + 1;
      uci = put(_n_,z7.);
      output;
    end;
  end;
run;

%let needle = 0010524;

data found_it;
  goto search;

seek:
    set haystack point=index nobs=nobs;
  return;

search:
  needle = '0010524';

  low_index = 1;
  index = low_index;
  link seek;
  low_value = uci;

  high_index = nobs;
  index = high_index;
  link seek;
  high_value = uci;

  do while (low_index <= high_index);

    * divide;
    mid_index = floor (( low_index + high_index ) / 2 );

    index = mid_index;
    link seek;
    mid_value = uci;

    * and conquer;
    if mid_value > needle then
      high_index = mid_index - 1;
    else
    if mid_value < needle then
      low_index = mid_index + 1;
    else
      leave;
  end;

  put low_index= high_index= mid_index= mid_value= needle=;

  if mid_value = needle then do;
    put needle= "found at " mid_index=;
    output;
  end;
  else
    put needle= "not found, search stopped at " mid_index=;

  stop;

  drop
    low_index mid_index high_index
    low_value mid_value high_value
    needle
  ;
run;
--------------------------------------------

Richard A. DeVenezia
http://www.devenezia.com/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
4.  "Dorfman, Paul"  
View profile  
 More options Apr 17 2006, 11:55 am
Newsgroups: comp.soft-sys.sas
From: Paul.Dorf...@FCSO.COM ("Dorfman, Paul")
Date: Mon, 17 Apr 2006 11:55:08 -0400
Local: Mon, Apr 17 2006 11:55 am
Subject: Re: [not quite]OT: Binary Tree Lookups for Dummies
Paul,

First, I do not think this is OT at all. Second, your idea is sensible and sound; just a few comments.

It appears that you are searching an ordered [SAS] table rather than a binary search tree. Although the mechanism is conceptually precisely the same, a binary search tree as a data structure is not exactly equivalent to a mere ordered table, and one has to build it first in order to be able to search.

You are using a rather clever technique called [by some] *uniform* binary search using the powers of 2. This has the potential advantage of ridding the standard binary search of the necessity to divide by 2. Fine, but as Richard has pointed out, you introduce a bigger inefficiency by computing the powers on the fly inside the inner loop. However, the latter is unnecessary: simply *precompute* all foreseeable powers of 2 beforehand and store them in a small temporary array, then merely point the that array with your variable TRIES. For instance, you can declare

Array bin [0:32] _temporary_ (
          1            2            4
          8           16           32
         64          128          256
        512         1024         2048
       4096         8192        16384
      32768        65536       131072
     262144       524288      1048576
    2097152      4194304      8388608
   16777216     33554432     67108864
  134217728    268435456    536870912
 1073741824   2147483648   4294967296
   ) ;

and use BIN[TRIES] instead of 2**TRIES thereafter. You should not be surprised to observe TRIES never exceeding a 2-digit integer because the binary search never exceeds 1+log2(N) probes, hit or miss. If you have to search a file having more than 4294967296 rows, increase the size of the array above accordingly. The interpolation search you mention is blazingly fast with its 1+log2(log2(n)) maximal number of probes - however, the latter only holds if the interpolation functions follows the known distribution of keys (unknown beforehand) or the keys are known to be distributed uniformly, so the linear interpolation can be used, and  if the interpolation function can be computed sufficiently rapidly (i.e. not orders of magnitude slower than the division by 2).

Now the idea is of course not unheard of; in fact, it was outlined in the old SAS efficiency tips book of yore, although methought the implementation itself was a tad maladroit. And it is still the *best* way of fetching a few records out of an *already sorted* SAS data file, for then there is no reason to build an index or search sequentially (i.e. potentially read the entire file at least once). Also, it is a simple matter to use the binary search in order to collect all the rows belonging to duplicate keys - the job requiring a certain amount of trickery when going against a SAS index, especially when duplicates are present both among the lookup and search-for keys.

This topic has been discussed in depth on sas-l in the past: just try searching the archives for binary and/or interpolation search and you will come up with tons of links. I have just narrowed it by adding the 'dorfman' constraint and could not believe my eyes when it fetched me my own post circa 1997. As was said in a well-circulated anecdote, humans do not live that long!

Lastly, the binary trees. In pre-V9 SAS, those had been present only outside of the DATA step [in the form of the AVL trees] working quietly and productively behind all those MEAN, FREQ, TABULATE, et al. Now that the hash object has them transplanted into the DATA step as well, we can have even a *single* binary search tree, should there be a purist demand for such, by giving the HASHEXP parameter the value of 0, i.e. coding HASHEXP:0. However, I do not see any practical reason for doing so, and always code HASHEXP:16. The memory difference between the declarations

dcl hash h (hashexp:16) ;
dcl hash h (hashexp: 0) ;

is but the whole 500k, which is much more likely to seep through the cracks in Windows than to present any real memory problem in SAS.

Kind regards
------------
Paul Dorfman
Jax, FL
------------

TIA

Paul Choate
DDS Data Extraction
(916) 654-2160

First Coast Service Options, Inc., and its affiliates are not responsible for errors or omissions in the transmission of this e-mail message. Any personal comments made in this e-mail do not reflect the views of First Coast Service Options, Inc., or its affiliates. The information contained in this document may be confidential and is intended solely for the use of the individual or entity to whom it is addressed. This document may also contain material that is privileged or protected from disclosure under applicable law. If you are not the intended recipient or the individual responsible for delivery to the intended recipient, please (1) be advised that any use, dissemination, forwarding, or copying of this document IS STRICTLY PROHIBITED; and (2) notify sender immediately by telephone and destroy the document. Thank you.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google