How do the pitfalls of dichotimization relate to the use of classification and regression trees?
There are some clear differences - both in terms of how splits are found and how they are utilized - but there seems to be a fundamental similarity.
Differences: 1. In tree analysis, the nodes from the first split are treated separately 2. Tree analysis almost always involves some form of validation - regression often does not 3. Tree analysis often involves looking at multiple model (bagging, boosting, forests, etc) regression rarely does
Similarities 1. Both use splits of the data - often dichotomous splits 2. Cutpoints in regression models MAY be chosen based on the data and the bivariate relationships; in tree analysis, this is always done
Any thoughts or references or what-have-you appreciated.
Peter
Peter L. Flom, PhD
Statistical Consultant
www DOT peterflomconsulting DOT com
Could people please make an effort to revert to responding in plain text (Google will add the HTML version anyway ...). Peter Flom's response below: quoted as received. (I spare you the earlier one, almost unreadable, from Karl Schlag). Thanks, Ted.
> <head><style>body{font-family: > Geneva,Arial,Helvetica,sans-serif;font-size:9pt;background-color: >#ffffff;color: black;}</style></head><body id="compText">Thinking some >#more about this - <br><br>How do the pitfalls of dichotimization relate >#to the use of classification and regression trees?<br><br>There are > some clear differences - both in terms of how splits are found and how > they are utilized - but<br>there seems to be a fundamental > similarity. <br><br>Differences: <br> 1. In tree analy > sis, the nodes from the first split are treated separately<br> > 2. Tree analysis almost always involves some form of validation - > regression often does not<br> 3. Tree analysis often > involves looking at multiple model (bagging, boosting, forests, etc) > regression rarely does<br><br>Similarities<br> 1. Both use splits > of the data - often dichotomous splits<br> 2. Cutpoints in > regression models MAY be chosen based on the data and the bivariate > relationships; in tree analysis, this is always done<br><br><br>Any > thoughts or references or what-have-you > appreciated.<br><br><br>Peter<br></body><pre>
> Peter L. Flom, PhD > Statistical Consultant > www DOT peterflomconsulting DOT com</pre>
-----Original Message----- >From: Ted.Hard...@manchester.ac.uk >Sent: Jun 25, 2009 1:10 PM >To: MedStats@googlegroups.com >Subject: HTML (was: RE: Splits and Trees (was Re: {MEDSTATS} Re: categorising BMI))
>Could people please make an effort to revert to responding in >plain text (Google will add the HTML version anyway ...). >Peter Flom's response below: quoted as received. >(I spare you the earlier one, almost unreadable, from Karl Schlag). >Thanks, >Ted.
>On 25-Jun-09 16:44:16, Peter Flom wrote:
>> <head><style>body{font-family: >> Geneva,Arial,Helvetica,sans-serif;font-size:9pt;background-color: >>#ffffff;color: black;}</style></head><body id="compText">Thinking some >>#more about this - <br><br>How do the pitfalls of dichotimization relate >>#to the use of classification and regression trees?<br><br>There are >> some clear differences - both in terms of how splits are found and how >> they are utilized - but<br>there seems to be a fundamental >> similarity. <br><br>Differences: <br> 1. In tree analy >> sis, the nodes from the first split are treated separately<br> >> 2. Tree analysis almost always involves some form of validation - >> regression often does not<br> 3. Tree analysis often >> involves looking at multiple model (bagging, boosting, forests, etc) >> regression rarely does<br><br>Similarities<br> 1. Both use splits >> of the data - often dichotomous splits<br> 2. Cutpoints in >> regression models MAY be chosen based on the data and the bivariate >> relationships; in tree analysis, this is always done<br><br><br>Any >> thoughts or references or what-have-you >> appreciated.<br><br><br>Peter<br></body><pre>
>> Peter L. Flom, PhD >> Statistical Consultant >> www DOT peterflomconsulting DOT com</pre>
> Sorry about that ... > I will try to remember to check. I think the default on my system > is to reply in the same mode as the message I am replying to. > Peter
Thanks, Peter! (Not that I was trying to "make an example" of you particularly -- you just happened to be "on top of the stack").
The evils of dichotomization are one of the reasons that recursive
partitioning fails unless you have an incredibly large sample size to
make up for the loss of information [the other problem is allowing for
all possible interactions, i.e., not using any additivity
assumptions.]. Much has been written about this. See my
bibliographic database at http://biostat.mc.vanderbilt.edu/rms near
the bottom of the page, and look for recursive partitioning or CART.
Recursive partitioning seems to work on sample sizes less than 20,000
but this is usually a mirage. Bootstrapping reveals that the tree
architecture is really blowing in the wind.
Frank
On Jun 25, 11:44 am, Peter Flom <peterflomconsult...@mindspring.com>
wrote:
> Thinking some more about this -
> How do the pitfalls of dichotimization relate to the use of classification and regression trees?
> There are some clear differences - both in terms of how splits are found and how they are utilized - but
> there seems to be a fundamental similarity.
> Differences:
> 1. In tree analysis, the nodes from the first split are treated separately
> 2. Tree analysis almost always involves some form of validation - regression often does not
> 3. Tree analysis often involves looking at multiple model (bagging, boosting, forests, etc) regression rarely does
> Similarities
> 1. Both use splits of the data - often dichotomous splits
> 2. Cutpoints in regression models MAY be chosen based on the data and the bivariate relationships; in tree analysis, this is always done
> Any thoughts or references or what-have-you appreciated.
> PeterPeter L. Flom, PhD Statistical Consultant www DOT peterflomconsulting DOT com
-----Original Message----- >From: Frank <f.harr...@vanderbilt.edu> >Sent: Jun 28, 2009 9:35 AM >To: MedStats <MedStats@googlegroups.com> >Subject: Re: Splits and Trees (was Re: {MEDSTATS} Re: categorising BMI)
>The evils of dichotomization are one of the reasons that recursive >partitioning fails unless you have an incredibly large sample size to >make up for the loss of information [the other problem is allowing for >all possible interactions, i.e., not using any additivity >assumptions.]. Much has been written about this. See my >bibliographic database at http://biostat.mc.vanderbilt.edu/rms near >the bottom of the page, and look for recursive partitioning or CART.
>Recursive partitioning seems to work on sample sizes less than 20,000 >but this is usually a mirage. Bootstrapping reveals that the tree >architecture is really blowing in the wind.
>Frank
>On Jun 25, 11:44 am, Peter Flom <peterflomconsult...@mindspring.com> >wrote: >> Thinking some more about this - >> How do the pitfalls of dichotimization relate to the use of classification and regression trees? >> There are some clear differences - both in terms of how splits are found and how they are utilized - but >> there seems to be a fundamental similarity. >> Differences: >> 1. In tree analysis, the nodes from the first split are treated separately >> 2. Tree analysis almost always involves some form of validation - regression often does not >> 3. Tree analysis often involves looking at multiple model (bagging, boosting, forests, etc) regression rarely does >> Similarities >> 1. Both use splits of the data - often dichotomous splits >> 2. Cutpoints in regression models MAY be chosen based on the data and the bivariate relationships; in tree analysis, this is always done >> Any thoughts or references or what-have-you appreciated. >> PeterPeter L. Flom, PhD Statistical Consultant www DOT peterflomconsulting DOT com
Peter L. Flom, PhD Statistical Consultant www DOT peterflomconsulting DOT com