On Fri, 4 May 2012 06:15:58 -0700 (PDT), Mattias
<
mattia...@gmx.at> wrote:
>Dear Rich,
>I have looked at scatter plots and see what you are referring to. It
>shows, for example, that the 0-cases in a reciprocal transformation
>are very close to the transformed values. The distance is smaller than
>for the logged variables (i.e. logsquarefeet etc.). When I run
>separate regressions with the loglandsqft and the reciprocal
>transformed land variable together with zland, the coefficient for
>loglandsqft is .507 (sig. .001) and zland has a coefficient of 2.177
>(sig. .003).
To go back to your previous question, "What does this mean?"
The predictor equation has the same value when zland=1,
"0 area", as when loglandsqft = 2.177/.507 = 4.294.
That is, the equation, not counting the constant, is 2.177
both when it is y'= 0.507*4.294 + 2.177*0,
and when it is y'= 0.507*0 + 2.177*1.0 .
So, having '0' area is predictively equivalent to having
either 73 sq ft (if you used natural log) or 19,700 sq ft (if
you used log10). I would presume that the former is smaller
than your smallest actual area owned, where the latter is
the equivalent to quite a few city lots taken together.
- The meaning of the p-value for the 0/1 zland is an indicator
of how far from 2.177 the dummy coding for "0" places it.
That is, if you dummy-code the '0' as {choose 73 or19,700},
the coefficient for zland will come out as ~0. To do that would
be an example of what is called "effect coding" the initial zeroes
while taking the log of the other values.
> Regression with reciprocal land gives a coefficient of -.
>126 (sig. at .001) and a coefficient of -,326 (sig. .033) for zland.
>
>What seems problematic though is that using a two variable solution
>regardless of with a log or a reciprocal transformation means that
>cases with zero land and cases with 1 acre land (2.3%) receive the
>same value on the transformed scale unless I change the unit and then
>also have a considerably larger distance between 0-cases and other
>cases. To what extent is this a problem?
I will try to be clearer here. "Started logs" - adding a constant
before taking a log" -- is *not* a highly recommended procedure.
In fact, it is a barely-tolerated procedure, for the statistical
practices that I am familier with. The same goes for "started
reciprocal" and so on. If you add 1 to your square feet, the
distorting effect is trivial and ignorable, and it is computational
convenience. - For clarity of description, to preserve the
simple logic, I would always use the not-added version, and set
the values for "0" separately. If you add 1 to your Acres when
there are fractional values, the distortion is much more severe,
and undermines the basic rationale for the transformation.
We prefer our units to be "natural" for an analysis. What is
measured is often natural, or a simple power transformation
is arguably "natural" for a particular case. An "added" version
is not natural unless you are fixing a bias in measurement.
(In biochemical assays, where 0 represents "undetected",
using the log is often "natural"; and replacing 0s with half the
lowest measurable value is often a viable solution.)
Transforming "to get a nicer distribution" is an argument that
I have used only as a last resort. - Least square statistics
are overly affected by outliers, for instance, when the bulk
of available variance (sum of squares around the mean) is
attributable to just a few of the cases. Your raw data are
like that. But your problem as described so far is only with
the *large* scores, and you have made no argument for
screwing with the "natural" transformation of the smallest
numbers.
>
>If it is a problem, as I believe it is, would it then be appropriate
>to ‘arbitrarily’ change the values of the original 0-cases to zero
>after the transformation? That also moves them away from the middle of
>the transformed scale (even though I understand that this is not
>really a problem since I can find its equivalent location by using two
>variables in the regression). I did the following to get the
>reciprocal transformation using two variables:
>
>RECODE land(0=1)(else=0) into zland.
>execute.
>compute recipland=1/(land+zland).
>execute.
As I say above - this terribly distorts the reciprocal-transform
when land is measured in acres. I won't say more about your
numbers for the reciprocal transform because I think you used
Acres.
>
>In order to change the values of the 0-cases I tried the following
>‘correction’:
>
>if zland EQ 1 recipland=recipland-1.
>execute.
>
>An advantage is that 0-cases are then distinguished from cases with 1
>acre land in the transformed scale. However, when I look at the
>difference in scatter plots, I of course see that because I thereby
>change the cases with 1 on the reciprocal scale (original land values
>of zero) to zero, those cases are also moved further away from the
>transformed scale.
>
>I ran regressions with the two alternative definitions (with and
>without the ‘correction’). When zland is included, the p-values and
>the coefficients for the two versions is the same (sig. at .001,
>coeff. -.126) but the p-value for zland is .033 (coeff. -,326)
>together with the ‘uncorrected’ alternative and .008 (coeff. -,453)
>together with the ‘corrected’. When zland is not included the p-value
>is .005 (coeff. -,100) for the ‘uncorrected’ and .021 (coeff. -,074)
>for the ‘corrected’ variable.
>
>I think the reciprocal transformation – at least judging by the
>smaller distance between 0-cases and transformed values – is indeed
>the better option for me, but do I need to worry about the fact that
>values for 0-cases become confounded with 1 acre cases in the
>transformed scale? If so, am I correct in ‘correcting’ this the way I
>have?
If these were my data, I would be concerned with a couple of
questions about the transformed values, excluding zeroes. How
skewed are the these sets? Are there outliers at the bottom as
well as the top? What fraction of cases do look like outliers, at
either end?
What is the frequency of the outcome variable? (Mainly, I'm
wondering whether it is something rare -- in which case, outliers
may be of particular importance in the prediction -- or if it is
something near 50%, in which case, Windsordizing (re: previous
note) might still provide a way to make the regression m ore robust.
- I have favored Windsordizing over "started" transforms, as being
easier to justify both in the subject matter and as a statistical
practice.
Now, as out outcome, I have pointed to 73 or 19700 sq ft as the
equivalent predictor value for 0, from the logsqft value. It would
be interesting to compute the corresponding figure for the reciprocal
transformation, and it would be nice if it came out to the same value,
or near it. Given the availability of computing power these days, it
is harder to argue, "We never looked at it that way," and easier to
argue "We looked at it both ways (or, several ways) and the
results came out blah-blah-blah, which you see is the same."
--
Rich Ulrich