"Alf P. Steinbach" <
alf.p.stein...@gmail.com> writes:
> Part 3. This tutorial, if it works (it's an experiment), is intended
> to work this way:
>
> * I post some working code.
> * Learner(s) study it and ask about things.
> * Others answer questions and post critique or get bogged down in long
> sub-threads about sausages and swearing.
>
> Part 2 added logic to maintain a height difference value in each
> node. Namely, height of right subtree minus height of left
> subtree. One nice thing about a difference, as opposed to just storing
> the heights directly, is that when balancing is introduced the diff
> can be encoded in just 2 bits, e.g., for joy-joy hacking delight, in
> the least significant bits of the pointers ? but I'm not going to do
> that. :)
The code still doesn't discard (or simply count) duplicate values.
I really recommend that you do that, especially before you get to
rebalancing. (I have no other specific comments on that version
of the code so I snipped it.)
> Tim Rentsch suggested an alternative way to do the updating, using
> O(1) memory instead of O(log n) memory as in the code I posted. That's
> one reason for this posting. Another reason is that I'd like comments
> on my thoughts about balancing, outlined at the end here, before I
> code it up.
>
> ----------------------------------------------------------------------
>
> An implementation of Tim's suggested O(1) memory updating, which as I
> understand it is a common implementation:
It corresponds to what is done in the AVL algorithm given in
Knuth. I don't know how common it is, only that I thought it
worth pointing out.
> // Called after a new node with value `new_value` has been inserted:
> static void update_height_differences_down_from(
> Node* const highest_node_with_change,
> Value const& new_value
> )
> {
> Node* node = highest_node_with_change;
> while( node )
> {
> bool const go_left = (new_value < node->value);
> bool const is_leaf_node = not(node->left or node->right);
> if( not is_leaf_node )
> {
> node->height_diff += (go_left? -1 : +1);
> }
> node = (go_left? node->left : node->right);
> }
> }
Another way the termination test could be done is to pass in the
address of the newly created leaf node, and compare for it. That
would make the while() loop be something like
while( node != new_leaf ) ...
and it would allow the updating of node->height_diff to be lifted
up a level rather than being inside an if().
> void add( Value const& value )
> {
> Node** p_ptr = &root_;
> Node* lowest_newheight_absorber = nullptr;
> while( *p_ptr != nullptr )
> {
> Node*& ref_ptr = *p_ptr;
> bool const go_left = (value < ref_ptr->value);
> bool const go_right = not go_left;
> bool const right_is_higher = (ref_ptr->height_diff > 0);
> bool const left_is_higher = (ref_ptr->height_diff < 0);
>
> if( go_left and right_is_higher or go_right and left_is_higher )
> {
> lowest_newheight_absorber = ref_ptr;
> }
>
> p_ptr = &(go_left? ref_ptr->left : ref_ptr->right);
> }
>
> *p_ptr = new Node{ nullptr, nullptr, value, 0 };
> update_height_differences_down_from(
> lowest_newheight_absorber? lowest_newheight_absorber : root_,
> value
> );
> }
>
> I haven't measured these but the O(1) memory scheme avoids any dynamic
> allocation even for the most naive direct implementation, so for the
> case of non-optimized code it should be faster.
A downside of this approach is that the value comparisons (or some
of them anyway) are done a second time. An intermediate scheme,
which I kind of like, is to have a "stack" of bits reflecting the
left/right choices made past the point of the tipping point node.
The "stack" of bits can be stored in 2 or 3 'unsigned long' words,
which is more than enough to handle any feasible tree.
Incidentally, kind of a side comment, I find the code composition
style shown to be somewhat cumbersome. I don't know if you have
changed your normal style for tutorial reasons, and it's really
tangential to the main discussion anyway, but I thought it might
be good to mention it.
> It brings more complexity to the `add` member function, and simplifies
> the `update_height_differences`... function.
It turns out some sort of tracking like this will be important
when we get to rebalancing, so it's probably better to introduce
it now.
Also, you may find it helpful to split off the special case of
adding a value to an empty tree. The general case code has an
easier time of it if that code doesn't have to deal with the tree
being completely empty.
> ----------------------------------------------------------------------
>
> My thinking about balancing.
>
> I've yet to read up on this in Wikipedia. [...]
For those who are interested I will try to walk through a general
outline. If anyone wants to figure it out on their own then stop
reading here.
Here is a canonical tree that may need rebalancing after a new
value is inserted (and obviously there are the corresponding
mirror image cases):
20
/ \
/ \
10 40
/ \
/ \
30 50
Note: the tree is shown as complete, with '20' at the top, but
that isn't actually necessary for what follows. The '20' node
could be in the middle of the tree, and there could be additional
subtrees below the '10', '30', and '50' nodes, as long as:
one: the height of those three leaf nodes is all the same; and
two: all nodes after '20' along the insertion path are "level",
ie, their left and right subtrees have equal heights.
So, if we're going to insert a new value, there are three cases:
somewhere under the 10 (eg, 5 or 15), somewhere under the 30 (eg,
25 or 35), or somewhere under the 50 (eg 45 or 55). Let's look at
the "under 10" case first. The new tree looks like this:
20
/ \
/ \
10 40
.... / \
/ \
30 50
where the .... means something got added somewhere in the left or
right subtree of the 10. The "balance" factor of the 10 node will
have been adjusted already, by virtue of being downstream from the
"tipping point" node, which is the 20 node in this case. The
height of 20's left side has grown by one, so the 20 node needs to
be marked as "level", but nothing else needs to happen.
Now let's look at the "under 50" case. The new tree looks like
this:
20
/ \
/ \
10 40
/ \
/ \
30 50
....
For sure this tree needs to be rebalanced. After inserting a new
value, rebalancing an AVL tree is a local operation, involving
just a few nodes in the interior of the tree. Here we have the
easier of the two cases where rebalancing needs to happen.
Typically the transformation done is described as a "rotation".
Conceptually it could be described thusly:
one: move the 20 node down and to the left;
two: move the 40 node up to where the 20 node used to be;
three: change the left link in the 40 node to point to the
20 node, and set the right link of the 20 node (which
used to point to the 40 node) to point to the 30 node.
Here is the post-rotation picture:
40
/ \
/ \
20 50
/ \ ....
/ \
10 30
To adjust the balance factors, the 20 node and the 40 node both
need to be set to "level". (The balance on the 50 node will have
been adjusted already, according to whether the insertion happened
on the left or the right.) Whatever parent pointer used to point
to the 20 node needs to be set to point to the 40 node. The
parent pointer could be a left- or right- pointer in what was 20's
parent node, or it could the root_ pointer if the 20 node was at
the top of the tree.
Now let's look at the "under 30" case. Here is what the tree
looks like after the new value is inserted:
20
/ \
/ \
10 40
/ \
/ \
30 50
....
Suppose we try the same rotation we did for the "under 50" case.
The resulting tree would be:
40
/ \
/ \
20 50
/ \
/ \
10 30
....
Obviously that didn't help - the tree shown has the same out-of-
balance problem as the original, except on the left instead of on
the right. We need to do something different, but what?
It's common to see the transformation needed here described as a
"double rotation", but I think it might be easier to understand if
described like this:
one: reach down, grab the 30 node, pull it up to the top;
two: set 30's left link to point to the 20 node, 30's right
link to point to the 40 node;
three: set 20's right link to point to the original value of
30's left link, and 40's left link to point to the
original value of 30's right link
Here is the resulting picture:
30
/ \
/ \
20 40
/ .? ?. \
/ \
10 50
Note the .... that was under the 30 got split, with half of it put
under the 20, and half of it put under the 40. If the 30 node was
previously a leaf, only one of those two pointers will be
non-null.
For balance factors: the 30 node should be set level; the 20
node should be set level if the insertion happened down 30's left
side, and "leftish" otherwise; the 40 node should be set level if
the insertion happened down 30's right side, and "rightish"
otherwise.
And of course whatever parent node was previously pointing to 20
needs to be set to point to the 30 node.
NOTE: The discussion above assumes all nodes along the insertion
path and downstream of the 20 node have had their balance factors
adjusted already. Nodes upstream of the 20 node all keep the same
balance factors they had before the insertion.