Hi Molly,
Yes, you would write a function like
split(the_node_address, the_feature, the_threshold)
that takes a node (must be unsplitted) and performs the split from there, on that feature and threshold. This function should not be recursive; rather an outside wrapper/main/other-function calls the splits.
For part C (extra credit) the biggest challenge is to try all the splits without actually modifying the tree until you know which split is the best. So here's how I would do this:
First modify the function split above to have the ability to "simulate" the split, rather that crate it like below:
split(the_node_address, the_feature, the_threshold, do_or_simulate)
- crete the children, filter the data to left or right
- if really do the split: write the feature and the threshold at the node, then make the linkage (children to the node, and the node's left and right to children)
- if only simulate, calculate the value of the split, then delete the children, and do not modify the node.
This way you can call this function as "simulate" for all possible features and thresholds just to figure out the best split, then do it for real for that split. You would also need a que for the current terminal nodes addresses to handle the nodes that are to be splitted - once a split is done the new children (now terminal nodes) are added to that queue/list.
--virgil