Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Transformation of a predictor with many zero values in logistic regression

1,985 views
Skip to first unread message

Mattias

unread,
Apr 27, 2012, 7:24:08 AM4/27/12
to
Dear group,
I need to transform one of my predictors in a logistic regression and
am wondering what the best solution could be. The predictor in
question measures landholding and is strongly positively skewed and
with a large number of zero values (52% of households own no land).
Having no land is of substantial importance.
I understand that one alternative is to add a constant to the
transformation, i.e.:

Compute logland=LG10(land+1).
execute.

I found another alternative suggested by someone in an internet
posting, which makes a lot of sense to me since zero landholding is of
importance, namely:

"to create two variables. One of them equals log(x) when x is nonzero
and otherwise is anything; it's convenient to let it default to zero.
The other, let's call it zx, is an indicator of whether x is zero: it
equals 1 when x=0 and is 0 otherwise. These terms contribute a sum

βlog(x)+β0zx

to the estimate. When x>0, zx=0 so the second term drops out leaving
just βlog(x). When x=0, "log(x)" has been set to zero while zx=1,
leaving just the value β0. Thus, β0 estimates the effect when x=0 and
otherwise β is the coefficient of log(x)."

I am wondering if this is an appropriate way to transform the variable
and if I have managed to get the syntax right. I set it up like this:

do if land EQ 0.
- compute zland=1.
- else.
- compute zland EQ 0.
- end if.
execute.
compute logland2=LG10(land+zland).
execute.

Both transformations of the predictor done in this way are highly
significant (logland at .004 and logland2 at .001) but the coefficient
for logland2 is somewhat smaller (logland is .628 and logland2 is .
517).

Furthermore, how do I interpret the coefficients of the two different
transformations?

Any help on this would be greatly appreciated.

Best,
Mattias

Rich Ulrich

unread,
Apr 28, 2012, 4:34:07 PM4/28/12
to
On Fri, 27 Apr 2012 04:24:08 -0700 (PDT), Mattias
<mattia...@gmx.at> wrote:

>Dear group,
>I need to transform one of my predictors in a logistic regression and
>am wondering what the best solution could be. The predictor in
>question measures landholding and is strongly positively skewed and
>with a large number of zero values (52% of households own no land).
>Having no land is of substantial importance.
>I understand that one alternative is to add a constant to the
>transformation, i.e.:
>
>Compute logland=LG10(land+1).
>execute.

If the values of land are scaled far from zero -- numerically
in the thousands, say -- your results will be 0 for the zero
cases, and decimal values above 3 for the others, hardly
distinguishable from taking log(x). But if your "land values"
are themselves small integers, showing "thousands" implicitly,
you have a different scaling of your variable than taking the
simple log. If land-value of "500" is coded as "0.5", etc., and
some of these exist, then you definitely have changed the sclaing
of the logged variable.

>
>I found another alternative suggested by someone in an internet
>posting, which makes a lot of sense to me since zero landholding is of
>importance, namely:
>
>"to create two variables. One of them equals log(x) when x is nonzero
>and otherwise is anything; it's convenient to let it default to zero.
>The other, let's call it zx, is an indicator of whether x is zero: it
>equals 1 when x=0 and is 0 otherwise. These terms contribute a sum
>
>?log(x)+?0zx
>
>to the estimate. When x>0, zx=0 so the second term drops out leaving
>just ?log(x). When x=0, "log(x)" has been set to zero while zx=1,
>leaving just the value ?0. Thus, ?0 estimates the effect when x=0 and
>otherwise ? is the coefficient of log(x)."
>
>I am wondering if this is an appropriate way to transform the variable
>and if I have managed to get the syntax right. I set it up like this:
>
>do if land EQ 0.
>- compute zland=1.
>- else.
>- compute zland EQ 0.

This is a hazardous way of computing zland=0.
What is evaluated is the logical Truth function,
"Is zland EQ 0?" If NO, the result is 0. If zland
were previously defined as 0, the result would be 1.

The simple way to establish zland values from land
values is RECODE land(0=1)(else=0).

>- end if.
>execute.
>compute logland2=LG10(land+zland).
>execute.
>
>Both transformations of the predictor done in this way are highly
>significant (logland at .004 and logland2 at .001) but the coefficient
>for logland2 is somewhat smaller (logland is .628 and logland2 is .
>517).

Is this the significance as test in the logistic regression?
Did you run both examples with zland also as a predictor?
- If so, this much difference is surprising,unless your log-
scaling is changed in the way that I described above.

>
>Furthermore, how do I interpret the coefficients of the two different
>transformations?

I'm not sure that I follow the question. You can plug
in values and see what the predicted value is. "A 10-fold
increase in land value changes the log of the odds ratio
by 0.517." That is, the Odds ratio is 3.28 (10^0.517)
better/worse for a 10-fold difference is Land.

I think i got that right. It's been a log time since I
fiddled with any of those actual numbers.

>
>Any help on this would be greatly appreciated.
>

--
Rich Ulrich

Mattias

unread,
May 2, 2012, 4:17:44 AM5/2/12
to
Dear Rich,

Thank you for your reply. I realize that the second alternative – of
constructing two variables – is quite dodgy and I will stay away from
using it. It gives me negative values. I understand that your concern
that the addition of a constant 1 to the log transformation might have
changed the scaling is something I need to be careful with, but I am
not sure whether I have actually changed the scaling by adding a
constant. Perhaps you could help me understand.

The values for land range between 0 and 74,5 acres with a mean of 2,33
acres; skewness is 5.853, SD 7.313. The logland (i.e. Compute
logland=LG10(land+1)) variable ranges from 0 to 1.88, skewness 1,917,
SD .3872. It has the same frequency of zeroes (which is of importance
in empirically) as the original land variable and the two variables
correlate perfectly.

Doesn’t this mean that I haven’t actually changed the scaling and that
I should keep the LG10(land+1) transformation, particularly since I
get the zero values in there?

Thanks
Mattias

Rich Ulrich

unread,
May 2, 2012, 2:34:52 PM5/2/12
to
On Wed, 2 May 2012 01:17:44 -0700 (PDT), Mattias
<mattia...@gmx.at> wrote:

>Dear Rich,
>
>Thank you for your reply. I realize that the second alternative – of
>constructing two variables – is quite dodgy and I will stay away from
>using it.

? I pointed out that your SPSS syntax was dodgy. And I showed
you how to fix it. Using two variables is, almost assuredly, what
you want to use. You have no real notion how "0" scales, compared
to the other values.

> It gives me negative values.

[See below - convert units to Square Feet, and negative values -
which should not be a cause of concern in any case - will disappear.]


> I understand that your concern
>that the addition of a constant 1 to the log transformation might have
>changed the scaling is something I need to be careful with, but I am
>not sure whether I have actually changed the scaling by adding a
>constant. Perhaps you could help me understand.
>
>The values for land range between 0 and 74,5 acres with a mean of 2,33
>acres; skewness is 5.853, SD 7.313.

Okay, you have presented your units for the first time.
With a mean of 2.33 and a max of 74.5 -- and something
like (only) half zeroes -- you certainly have a lot of tiny
numbers to offset the influence of 74.5 on the mean.
Since they are "acres", I will assume here that many of
the values are written as decimal fractions.

Thus, if you add 1.0 to 0.1 and 0.4, you no longer have
values of logs consistent with the latter being 4 times as big
as the former. You are distorting the log-scale, which *should*
always show "4 times as big" as the same distance, whether
it is 0.1 to 0.4, or 1 to 4, or 10 to 40.

Adding 1 will make little difference if all the numbers are
already big. For instance, you could convert all your
"acres" to "square feet" or "square rods" -- and there
would be a different scales to those outcome for the small
sizes.

Given that one outlier is more than 20 times the size of the
mean, I would be tempted to "Windsordize", after considering
whether of not all the interesting information was embodied in
that single case.


> The logland (i.e. Compute
>logland=LG10(land+1)) variable ranges from 0 to 1.88, skewness 1,917,
>SD .3872. It has the same frequency of zeroes (which is of importance
>in empirically) as the original land variable and the two variables
>correlate perfectly.

Not "perfectly"; the Pearson correlation may be very high,
but this should serve as a object lesson in the fact that
"high" correlation is not the same as "perfect" correlation.

>
>Doesn’t this mean that I haven’t actually changed the scaling and that
>I should keep the LG10(land+1) transformation, particularly since I
>get the zero values in there?

--
Rich Ulrich

Mattias

unread,
May 3, 2012, 8:47:30 AM5/3/12
to
Thanks again Rich,

I obviously completely misunderstood your first comment. I think, and
hope, that I now better understand after this second comment of yours.
Your comments are of great help. I have used the ‘two variables’
alternative with a zland variable and have tried rescaling into
different units before logging.

The transformations based on ‘larger’ units such as square feet or
square meter are highly significant when I also include the zland
variable in my models. The smaller the unit used the greater is the p-
value for zland, when both variables are included. When zland is not
included, only transformations using much smaller units such as square
rod and square chain are significant. Also, the smaller the unit used
in the transformation the stronger is the correlation with the
original land variable (for square feet = .480, for square meter = .
508 and for square rod = .575). If I understand correctly, I need to
stick with having both variables included as the only other
alternative is when a unit is used which is so small that I actually
change the scaling of the logged variable.

So, with sticking with a rescaling into square feet, what does it mean
that both the ‘loglandsqft’ and the zland variables are significant
and only when both are included? How do I interpret this (and I don’t
mean how to plug in values and get the predicted values)?

I think that ‘windsordizing’ would be inappropriate because even
within the 99th percentile there are cases with 74.5, 62, 60, 45 and
40 acres landholding. That also feels like I would be flubbing data.

Thanks
Mattias

Rich Ulrich

unread,
May 3, 2012, 4:31:07 PM5/3/12
to
On Thu, 3 May 2012 05:47:30 -0700 (PDT), Mattias
<mattia...@gmx.at> wrote:

>Thanks again Rich,
>
>I obviously completely misunderstood your first comment. I think, and
>hope, that I now better understand after this second comment of yours.
>Your comments are of great help. I have used the ‘two variables’
>alternative with a zland variable and have tried rescaling into
>different units before logging.

Here is something extra that might further illuminate the
effect of transformations. Look at the simple scatterplots
between transformed versions of the area.

- When you are using square feet versus meters, the non-zero
scores will form a (visually) straight line, with the original "zero"
values falling off the line. For Square Feet versus Acres, there
will be visible curvature in the straight part; and the "zero"s
will be dislocated even more.

- In order to *eliminate* the artifical choice of scaling, while
preserving a convenient distinction between "zero" and the
rest of the scale (and to keep the values positive, just for
the sake of comfort), you could use the straight log(area)
for all positive areas, instead of (area+1); and 0 for zero.

>
>The transformations based on ‘larger’ units such as square feet or
>square meter are highly significant when I also include the zland
>variable in my models. The smaller the unit used the greater is the p-
>value for zland, when both variables are included. When zland is not
>included, only transformations using much smaller units such as square
>rod and square chain are significant.

(I keep stumbling when I read your "larger units" for square feet
and "smaller units" for square rod -- I know that you are pointing
to the size of the numbers that enter the regression, and not
1 sq ft. versus 1 sq. rod.)


If you try the scatter plots I suggested above, you will see that
the most prominent difference between these, used as predictors,
is the location of the original 0s. Your other results suggest to
me that the relative curvature favors sq. ft, if anything.

So, these results do suggest that the placement of 0 is more
"appropriate" as a predictor when it is closer to the logged
values, or even in the middle of them. - It often works out,
in my experience, that 0-cases are special like this.


> Also, the smaller the unit used
>in the transformation the stronger is the correlation with the
>original land variable (for square feet = .480, for square meter = .
>508 and for square rod = .575). If I understand correctly, I need to
>stick with having both variables included as the only other
>alternative is when a unit is used which is so small that I actually
>change the scaling of the logged variable.
>
>So, with sticking with a rescaling into square feet, what does it mean
>that both the ‘loglandsqft’ and the zland variables are significant
>and only when both are included? How do I interpret this (and I don’t
>mean how to plug in values and get the predicted values)?

"Having 0 land is equivalent, as a predictor, to having X amount
of land (when the areas are measured on a log-scale)."

Look at the coefficient for (No=0/ Yes=1). That's what you would
plug in for the log(area), since that is what the prediction
equation uses to achieve exactly equal predictions, since area
is recorded as 0 for those cases.

This is the reason for using the two variables: to locate where
the '0' seems to belong. If it is not simply, slightly beyond the
end of the observable values, that gives you something extra
to talk about.

>
>I think that ‘windsordizing’ would be inappropriate because even
>within the 99th percentile there are cases with 74.5, 62, 60, 45 and
>40 acres landholding. That also feels like I would be flubbing data.

Good thinking. You have more cases than I assumed. Yes, the
cases are more important when there are more of them. For
a linear regression, I would wonder how extreme a few outcomes
also may be, and whether these values have undue influence.

For this logistic regression, there are not any "extreme outcomes",
but there is still a potential problem of bias. If *all* the large
areas result in the same outcome, LR swallows up the problem;
overprediction becomes generally irrelevant. But relative
outliers like this (your skewness, even after logs, was still large)
otherwise have (potentially) undue influence on the prediction
equation. This concern with large outliers makes me wonder how
a stronger transformation would affect the results.

- The table of power transformations puts "log" at "0".
So, the power of 2 is squaring, and expands larger values;
the power of 1/2 is square root, and compacts larger values;
the power of 0 is log, and compacts them more; the power
of -1 is the reciprocal, and compacts them even more.

The reciprocal (or negative reciprocal, if you want to keep the
signs on the original correlations) is often used for simple
distances. I think I would try the reciprocal with your data.
Zero, again, would be set to an arbitrary value -- like "0",
even though that places it at the wrong end of the
transformed scale, but that does not matter when you
find its equivalent location by using two variables in the
regression.

--
Rich Ulrich

David Marso

unread,
May 3, 2012, 6:40:29 PM5/3/12
to
Poking in where I probably have no business ;-)
Since you are dealing with something in Square Acres/feet/whatever...
What about a Square Root Transform?
Then the whole business of 0's becomes completely irrelevant/inconsequential/moot/.....!!! ;-)))
------

Rich Ulrich

unread,
May 3, 2012, 9:39:14 PM5/3/12
to
On Thu, 3 May 2012 15:40:29 -0700 (PDT), David Marso
<david...@gmail.com> wrote:

>Poking in where I probably have no business ;-)
>Since you are dealing with something in Square Acres/feet/whatever...
>What about a Square Root Transform?
>Then the whole business of 0's becomes completely irrelevant/inconsequential/moot/.....!!! ;-)))

[snip previous]

Mean 2.33, along with several points from 40 to 74.

The square root transform does not do much to temper that much
skew. Seeing those extremes is what led me to most recently
suggest reciprocal, in place of log.

--
Rich Ulrich

Mattias

unread,
May 4, 2012, 9:15:58 AM5/4/12
to
Dear Rich,
I have looked at scatter plots and see what you are referring to. It
shows, for example, that the 0-cases in a reciprocal transformation
are very close to the transformed values. The distance is smaller than
for the logged variables (i.e. logsquarefeet etc.). When I run
separate regressions with the loglandsqft and the reciprocal
transformed land variable together with zland, the coefficient for
loglandsqft is .507 (sig. .001) and zland has a coefficient of 2.177
(sig. .003). Regression with reciprocal land gives a coefficient of -.
126 (sig. at .001) and a coefficient of -,326 (sig. .033) for zland.

What seems problematic though is that using a two variable solution
regardless of with a log or a reciprocal transformation means that
cases with zero land and cases with 1 acre land (2.3%) receive the
same value on the transformed scale unless I change the unit and then
also have a considerably larger distance between 0-cases and other
cases. To what extent is this a problem?

If it is a problem, as I believe it is, would it then be appropriate
to ‘arbitrarily’ change the values of the original 0-cases to zero
after the transformation? That also moves them away from the middle of
the transformed scale (even though I understand that this is not
really a problem since I can find its equivalent location by using two
variables in the regression). I did the following to get the
reciprocal transformation using two variables:

RECODE land(0=1)(else=0) into zland.
execute.
compute recipland=1/(land+zland).
execute.

In order to change the values of the 0-cases I tried the following
‘correction’:

if zland EQ 1 recipland=recipland-1.
execute.

An advantage is that 0-cases are then distinguished from cases with 1
acre land in the transformed scale. However, when I look at the
difference in scatter plots, I of course see that because I thereby
change the cases with 1 on the reciprocal scale (original land values
of zero) to zero, those cases are also moved further away from the
transformed scale.

I ran regressions with the two alternative definitions (with and
without the ‘correction’). When zland is included, the p-values and
the coefficients for the two versions is the same (sig. at .001,
coeff. -.126) but the p-value for zland is .033 (coeff. -,326)
together with the ‘uncorrected’ alternative and .008 (coeff. -,453)
together with the ‘corrected’. When zland is not included the p-value
is .005 (coeff. -,100) for the ‘uncorrected’ and .021 (coeff. -,074)
for the ‘corrected’ variable.

I think the reciprocal transformation – at least judging by the
smaller distance between 0-cases and transformed values – is indeed
the better option for me, but do I need to worry about the fact that
values for 0-cases become confounded with 1 acre cases in the
transformed scale? If so, am I correct in ‘correcting’ this the way I
have?

Thanks again,
Mattias


Rich Ulrich

unread,
May 4, 2012, 3:09:51 PM5/4/12
to
On Fri, 4 May 2012 06:15:58 -0700 (PDT), Mattias
<mattia...@gmx.at> wrote:

>Dear Rich,
>I have looked at scatter plots and see what you are referring to. It
>shows, for example, that the 0-cases in a reciprocal transformation
>are very close to the transformed values. The distance is smaller than
>for the logged variables (i.e. logsquarefeet etc.). When I run
>separate regressions with the loglandsqft and the reciprocal
>transformed land variable together with zland, the coefficient for
>loglandsqft is .507 (sig. .001) and zland has a coefficient of 2.177
>(sig. .003).

To go back to your previous question, "What does this mean?"

The predictor equation has the same value when zland=1,
"0 area", as when loglandsqft = 2.177/.507 = 4.294.

That is, the equation, not counting the constant, is 2.177
both when it is y'= 0.507*4.294 + 2.177*0,
and when it is y'= 0.507*0 + 2.177*1.0 .

So, having '0' area is predictively equivalent to having
either 73 sq ft (if you used natural log) or 19,700 sq ft (if
you used log10). I would presume that the former is smaller
than your smallest actual area owned, where the latter is
the equivalent to quite a few city lots taken together.

- The meaning of the p-value for the 0/1 zland is an indicator
of how far from 2.177 the dummy coding for "0" places it.
That is, if you dummy-code the '0' as {choose 73 or19,700},
the coefficient for zland will come out as ~0. To do that would
be an example of what is called "effect coding" the initial zeroes
while taking the log of the other values.

> Regression with reciprocal land gives a coefficient of -.
>126 (sig. at .001) and a coefficient of -,326 (sig. .033) for zland.
>
>What seems problematic though is that using a two variable solution
>regardless of with a log or a reciprocal transformation means that
>cases with zero land and cases with 1 acre land (2.3%) receive the
>same value on the transformed scale unless I change the unit and then
>also have a considerably larger distance between 0-cases and other
>cases. To what extent is this a problem?

I will try to be clearer here. "Started logs" - adding a constant
before taking a log" -- is *not* a highly recommended procedure.

In fact, it is a barely-tolerated procedure, for the statistical
practices that I am familier with. The same goes for "started
reciprocal" and so on. If you add 1 to your square feet, the
distorting effect is trivial and ignorable, and it is computational
convenience. - For clarity of description, to preserve the
simple logic, I would always use the not-added version, and set
the values for "0" separately. If you add 1 to your Acres when
there are fractional values, the distortion is much more severe,
and undermines the basic rationale for the transformation.

We prefer our units to be "natural" for an analysis. What is
measured is often natural, or a simple power transformation
is arguably "natural" for a particular case. An "added" version
is not natural unless you are fixing a bias in measurement.
(In biochemical assays, where 0 represents "undetected",
using the log is often "natural"; and replacing 0s with half the
lowest measurable value is often a viable solution.)

Transforming "to get a nicer distribution" is an argument that
I have used only as a last resort. - Least square statistics
are overly affected by outliers, for instance, when the bulk
of available variance (sum of squares around the mean) is
attributable to just a few of the cases. Your raw data are
like that. But your problem as described so far is only with
the *large* scores, and you have made no argument for
screwing with the "natural" transformation of the smallest
numbers.

>
>If it is a problem, as I believe it is, would it then be appropriate
>to ‘arbitrarily’ change the values of the original 0-cases to zero
>after the transformation? That also moves them away from the middle of
>the transformed scale (even though I understand that this is not
>really a problem since I can find its equivalent location by using two
>variables in the regression). I did the following to get the
>reciprocal transformation using two variables:
>
>RECODE land(0=1)(else=0) into zland.
>execute.
>compute recipland=1/(land+zland).
>execute.

As I say above - this terribly distorts the reciprocal-transform
when land is measured in acres. I won't say more about your
numbers for the reciprocal transform because I think you used
Acres.

>
>In order to change the values of the 0-cases I tried the following
>‘correction’:
>
>if zland EQ 1 recipland=recipland-1.
>execute.
>
>An advantage is that 0-cases are then distinguished from cases with 1
>acre land in the transformed scale. However, when I look at the
>difference in scatter plots, I of course see that because I thereby
>change the cases with 1 on the reciprocal scale (original land values
>of zero) to zero, those cases are also moved further away from the
>transformed scale.
>
>I ran regressions with the two alternative definitions (with and
>without the ‘correction’). When zland is included, the p-values and
>the coefficients for the two versions is the same (sig. at .001,
>coeff. -.126) but the p-value for zland is .033 (coeff. -,326)
>together with the ‘uncorrected’ alternative and .008 (coeff. -,453)
>together with the ‘corrected’. When zland is not included the p-value
>is .005 (coeff. -,100) for the ‘uncorrected’ and .021 (coeff. -,074)
>for the ‘corrected’ variable.
>
>I think the reciprocal transformation – at least judging by the
>smaller distance between 0-cases and transformed values – is indeed
>the better option for me, but do I need to worry about the fact that
>values for 0-cases become confounded with 1 acre cases in the
>transformed scale? If so, am I correct in ‘correcting’ this the way I
>have?


If these were my data, I would be concerned with a couple of
questions about the transformed values, excluding zeroes. How
skewed are the these sets? Are there outliers at the bottom as
well as the top? What fraction of cases do look like outliers, at
either end?

What is the frequency of the outcome variable? (Mainly, I'm
wondering whether it is something rare -- in which case, outliers
may be of particular importance in the prediction -- or if it is
something near 50%, in which case, Windsordizing (re: previous
note) might still provide a way to make the regression m ore robust.
- I have favored Windsordizing over "started" transforms, as being
easier to justify both in the subject matter and as a statistical
practice.

Now, as out outcome, I have pointed to 73 or 19700 sq ft as the
equivalent predictor value for 0, from the logsqft value. It would
be interesting to compute the corresponding figure for the reciprocal
transformation, and it would be nice if it came out to the same value,
or near it. Given the availability of computing power these days, it
is harder to argue, "We never looked at it that way," and easier to
argue "We looked at it both ways (or, several ways) and the
results came out blah-blah-blah, which you see is the same."


--
Rich Ulrich

Mattias

unread,
May 8, 2012, 9:37:18 AM5/8/12
to
Thanks again Rich, your comments are very helpful and I am learning a
lot.

I have tried taking the reciprocal transformation of land in square
feet, but end up getting warnings from SPSS (“Due to redundancies,
degrees of freedom have been reduced for one or more variables.”) and
the binary zland variable is dropped from the regressions. To my
understanding this is because the transformed variables are so close
the zero (largest landholding in acre = 74.5 = 3245220 square feet =
0,0000003081455 in reciprocal transformation). Either it does not work
taking the reciprocal transformation or I have not understood the
transformation procedure. Please help me clarify this.

I also tried taking the square root of land in square feet, based on
David’s suggestion. It gives me what I believe is an acceptable
distance between the 0-land cases and the first transformed values
based on the scatter plots and it saves me the problem of divisions
with zeroes and the more complicated interpretation with two
variables. I changed the unit to square feet also here to avoid
squaring the many values below 1 when the unit is acre (73% is below
one acre). However, the sqrtlandsqft variable still has a skew of 2.6
but a correlation of .896 with land in acres. It is also significant (.
009) with a coefficient of .001 in my models.

Am I correct in thinking that I therefore have two appropriate
alternative transformations to choose from, the square root
transformation and the two variable alternative with logged square
feet? How do I know which is more suited, apart from the lower level
of skew of the two variable alternative with logged square feet, and
the easier interpretation of the square root transformed variable?
What would be an appropriate argument for choice of transformation?

By the way, the frequency of my outcome is 54%. I have pasted in a
frequency table of my original land variable in acres.

Mattias


Land in acres
Frequency Percent Valid Percent Cumulative Percent
Valid ,00 824 56,3 56,3 56,3
,10 37 2,5 2,5 58,9
,20 63 4,3 4,3 63,2
,30 24 1,6 1,6 64,8
,40 35 2,4 2,4 67,2
,50 36 2,5 2,5 69,7
,60 16 1,1 1,1 70,7
,70 7 ,5 ,5 71,2
,80 8 ,5 ,5 71,8
,90 19 1,3 1,3 73,1
1,00 33 2,3 2,3 75,3
1,20 22 1,5 1,5 76,8
1,30 8 ,5 ,5 77,4
1,50 9 ,6 ,6 78,0
1,80 6 ,4 ,4 78,4
1,90 3 ,2 ,2 78,6
2,00 45 3,1 3,1 81,7
2,30 4 ,3 ,3 82,0
2,50 10 ,7 ,7 82,6
3,00 34 2,3 2,3 85,0
3,10 8 ,5 ,5 85,5
3,50 2 ,1 ,1 85,6
4,00 27 1,8 1,8 87,5
4,50 2 ,1 ,1 87,6
5,00 44 3,0 3,0 90,6
5,50 1 ,1 ,1 90,7
6,00 9 ,6 ,6 91,3
7,00 9 ,6 ,6 91,9
8,00 6 ,4 ,4 92,3
9,00 9 ,6 ,6 93,0
10,00 16 1,1 1,1 94,1
12,00 19 1,3 1,3 95,4
13,00 6 ,4 ,4 95,8
14,00 5 ,3 ,3 96,1
15,00 2 ,1 ,1 96,2
16,00 8 ,5 ,5 96,8
17,00 5 ,3 ,3 97,1
18,00 3 ,2 ,2 97,3
20,00 5 ,3 ,3 97,7
25,00 4 ,3 ,3 97,9
26,00 3 ,2 ,2 98,2
28,00 1 ,1 ,1 98,2
30,00 2 ,1 ,1 98,4
34,00 4 ,3 ,3 98,6
35,00 1 ,1 ,1 98,7
37,00 4 ,3 ,3 99,0
40,00 3 ,2 ,2 99,2
45,00 3 ,2 ,2 99,4
60,00 2 ,1 ,1 99,5
62,00 4 ,3 ,3 99,8
74,50 3 ,2 ,2 100,0
Total 1463 100,0 100,0

Rich Ulrich

unread,
May 8, 2012, 10:33:46 PM5/8/12
to
On Tue, 8 May 2012 06:37:18 -0700 (PDT), Mattias
<mattia...@gmx.at> wrote:

>Thanks again Rich, your comments are very helpful and I am learning a
>lot.

Before commenting on your notes here, I will say something
about the distribution that you show (below).

Taking the reciprocal does a *lot* to compress the whole
numeric range above 1.0, which is easiest to see when
looking directly at the transformation in acres. - The effect,
the shape, is going to be exactly the same when measured
by any unit, so long as you are not doing any add-on. From
the numbers that you show, I would think that the reciprocal
is far too strong for these data; but I don't know what it is
that you are trying to predict. If you imagine that symmetry
should characterize this prediction, the reciprocal of the square
root looks like it could be a candiate, since .1 maps into 3, the
median into about 1, and 74 into 0.1. That would be about
as skewed (I think - I haven't done them) as the log, but with
the opposite end being crunched.

>
>I have tried taking the reciprocal transformation of land in square
>feet, but end up getting warnings from SPSS (“Due to redundancies,
>degrees of freedom have been reduced for one or more variables.”) and
>the binary zland variable is dropped from the regressions. To my
>understanding this is because the transformed variables are so close
>the zero (largest landholding in acre = 74.5 = 3245220 square feet =

No. That is not what the error message means. "Redundancies"
says that you put the same variable into the equation twice.

Forget the add-on. Then you can readily use ACRES size.
What you should be computing for two variables goes like this.

DO IF <land eq 0>
+ compute <zland=1>.
+compute <transformed= 0 or whatever >.
ELSE
+compute <transformed var> .
END IF.

>0,0000003081455 in reciprocal transformation). Either it does not work
>taking the reciprocal transformation or I have not understood the
>transformation procedure. Please help me clarify this.

"Redundancies" says you weren't doing what you thought.
- But if you want to try something else, you might do the
inverse of the square root.

>
>I also tried taking the square root of land in square feet, based on
>David’s suggestion. It gives me what I believe is an acceptable
>distance between the 0-land cases and the first transformed values
>based on the scatter plots and it saves me the problem of divisions
>with zeroes and the more complicated interpretation with two
>variables.

That means that you are *accepting* the position of 0 in
the transformed metric. With 0 being more than half the
values, I think I (as a reviewer) would insist that you try
the analysis with the second variable -- If nothing else, the
0/1 variable will be non-significant, showing that the scaling
happens to "work okay" and that 0/1 does *not* account
for much of what is observed.

> I changed the unit to square feet also here to avoid
>squaring the many values below 1 when the unit is acre (73% is below
>one acre). However, the sqrtlandsqft variable still has a skew of 2.6
>but a correlation of .896 with land in acres. It is also significant (.
>009) with a coefficient of .001 in my models.

Multiplying the values by a constant does not change the skew.
(I assume you stopped doing the add-on thing, because that
does mess things up.)

>
>Am I correct in thinking that I therefore have two appropriate
>alternative transformations to choose from, the square root
>transformation and the two variable alternative with logged square
>feet? How do I know which is more suited, apart from the lower level
>of skew of the two variable alternative with logged square feet, and
>the easier interpretation of the square root transformed variable?
>What would be an appropriate argument for choice of transformation?

I have already said that you should test the 0/1, so the model is only
somewhat simpler if the 0/1 is not significant. If it *is*
significant, that does say that the scaling is not equal interval --
or else, it shows the more important fact that zeroes do not
belong in the same linear measure. There are 44000 sq ft in
an acre. I recall that my previous figuring suggested two
numbers... and 19000 would be the equivalent to about 0.4 acres,
rather than some value less than your smallest observed value
of 0.1.


I mentioned in a previous note -- It would be good to "locate zero"
with two different scales, so that you can show how robust it is
(or whether it is robust), in scaling with the other values.
--
Rich Ulrich

Mattias

unread,
May 9, 2012, 8:28:29 AM5/9/12
to
Dear Rich,

Yes, I definitely dropped the add-on thing. I understand that it
changes the scale which is something I wanted to avoid in the first
place.

My syntax for the reciprocal was incorrect and when I use your
suggested syntax in acre as unit I get good results. However, the
skewness of the reciprocal transformation is larger than that of the
square root transformation (3.5 compared to 2.6 for the square root).
The coefficient for recland is -.126 (sig. .001) and the coefficient
for zland is -.453 (sig. .008) when I include them in my model. I
don’t imagine that symmetry should characterize this prediction.

I tried to locate zero for the reciprocal scale, to compare with the
loglandsqft scale following your suggestion and example like this:

Loglandsqft scale: As before, the predictor equation has the same
value when zland=1,
"0 area", as when loglandsqft = 2.177/.507 = 4.294.

That is, the equation, not counting the constant, is 2.177
both when it is y'= 0.507*4.294 + 2.177*0,
and when it is y'= 0.507*0 + 2.177*1.0 .

Reciprocal scale: The predictor equation has the same value when
zland=1,
"0 area", as when recland = -.453/-.126 = 3.595.

And the predictor equation has the value -0.453
both when it is y´= -0.126*3.595-0.453*0,
and when it is y´= -0.126*0-0.453*1.0.

Having ‘0’ acre would then be predictively equivalent to having 0.278
acres (1/3.595=0.278) on the reciprocal transformation scale, which is
not the same as the predicted ‘0’ area of 0.455 acres on the
loglandsqft scale.

Since they differ, what is robust and what is not? Or if both are
robust, what is robuster? Both are larger than my smallest observed
value (0.1 acres) but 0.278 is obviously closer. From a substantive
standpoint 0.455 and 0.278 acres are similar in that both are very
small and smaller than the area needed for subsistence farming.

Furthermore, if I want to take the negative reciprocal, what do I do
with the zland variable? Do I reverse it as well? If not the zland
coefficient has a negative sign. But if I do, the predicted value
equation gives me a negative .453 for the ‘0’ area (y´=
0.126*0+0.453*-1.0.) and a positive .453 for the ‘not’ 0-area (y´=
0.126*3.595+0.453*0).

With regard to the square root of land, when I include the square root
of land and zland (which I did not do earlier), the zland variable is
insignificant.

I seem to have three very similar alternatives and I don’t know what
alternative is more appropriate for me.

Thanks again,
Mattias
0 new messages