Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion OT: Binary Tree Lookups for Dummies
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Richard A. DeVenezia  
View profile  
 More options Apr 14 2006, 2:19 pm
Newsgroups: comp.soft-sys.sas
From: "Richard A. DeVenezia" <rdevene...@wildblue.net>
Date: Fri, 14 Apr 2006 14:19:15 -0400
Local: Fri, Apr 14 2006 2:19 pm
Subject: Re: OT: Binary Tree Lookups for Dummies

"Choate, Paul@DDS" wrote:
> One goals for the last SUGI was to learn about and start applying
> more optimal lookup and merging methods.  Of course Paul Dorfman's
> most excellent papers on hashing offering a most interesting place to
> start.

> A couple weeks later, still not nearly caught up at work, but
> noodling a little about something heard - binary tree searching.
> Hadn't thought about it much before, but the light in my brain
> flashed on for a brief second.

> So here it is - my home-brewed binary tree lookup.  Sorry if this is
> too pre-school to any über geeks out there, please have mercy.

> Any comments? Grossly missed efficiencies?  How this is worse (or
> better?) than datastep lookups out there in practice?

> Does this always work?  Does the ceil function keeps me out of
> trouble, hmmmmm?  It worked on a few dozen tests.

> The distance between my target and sample could be interpolated
> rather than simply halved with each iteration. Attempting
> interpolation seemed inelegant, and this had already had become
> uglier than anticipated.

> Tests show that in a set of .6 million records it takes fewer than 20
> lookups to find the target.

> The data are sorted on UCI, a seven digit character string.

> data found_it;
>      find='0010524';
>      *using Zeno's arrow to hit a target - iterate up & down by
>      nobs/2**n; tries+1;
>      direction+0;
>      sample+sum(direction=0,direction)*(ceil(nobs/(2**tries))<>1);
>      sample=sample><nobs; *if outside 1<->nobs reset to min or max;
>      sample=sample<>1;

>      set big_data(keep=uci) point=sample nobs=nobs;

>      direction= sum(((uci>find)*(-1)),((uci<find)*1));
>      if uci=find or ceil(nobs/(2**tries))=1 then do;
>           record=i;
>           if uci=find then output;
>                         else put / / 'Not Found:' (_all_)(=);
>           stop;
>      end;
> run;

A methodology using directions and power of 2 division certainly will work,
but the syntax is difficult to follow. Your use of 2**tries is sub-optimal,
exponentiation function is more cpu cycle consuming than starting with
size=1 and just multiplying by 2 in each iteration.

Interpolation or any other predictive algorithm is only beneficial if you
really know your data, or are willing to implement a system that records
aspects of the search behaviour (over many searches on some data of some
particular nature) and develop a predictor based on whatever aspects you
maintain.  Perhaps useful when performing millions of searches through
millions of lists having billions of items having some expected
distribution.

In a list of N sorted items, binary search can take no more than log2(N)
iterations to determine if a candidate value exists in the list.
I prefer to use indices to track the range being conquered.  The methodology
being implemented is very easy to follow.

An academic example:
--------------------------------------------
data haystack;
  do _n_ = 1 to 1000000;
*   array satellite[10];
    if ranuni(4242) < 0.84 then do;
      rownum + 1;
      uci = put(_n_,z7.);
      output;
    end;
  end;
run;

%let needle = 0010524;

data found_it;
  goto search;

seek:
    set haystack point=index nobs=nobs;
  return;

search:
  needle = '0010524';

  low_index = 1;
  index = low_index;
  link seek;
  low_value = uci;

  high_index = nobs;
  index = high_index;
  link seek;
  high_value = uci;

  do while (low_index <= high_index);

    * divide;
    mid_index = floor (( low_index + high_index ) / 2 );

    index = mid_index;
    link seek;
    mid_value = uci;

    * and conquer;
    if mid_value > needle then
      high_index = mid_index - 1;
    else
    if mid_value < needle then
      low_index = mid_index + 1;
    else
      leave;
  end;

  put low_index= high_index= mid_index= mid_value= needle=;

  if mid_value = needle then do;
    put needle= "found at " mid_index=;
    output;
  end;
  else
    put needle= "not found, search stopped at " mid_index=;

  stop;

  drop
    low_index mid_index high_index
    low_value mid_value high_value
    needle
  ;
run;
--------------------------------------------

Richard A. DeVenezia
http://www.devenezia.com/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.