Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Pls Help: Box-Cox Trnasformation?

653 views
Skip to first unread message

John Gallaugher

unread,
Aug 1, 2000, 3:00:00 AM8/1/00
to
I've found the Box-Cox transformation macro on the SPSS site at:
http://www.spss.com/tech/answer/result.cfm?tech_tan_id=100000143
and I'm trying to get their simple example to run. I've entered the code below (copied &
pasted from the web site, I removed the excessive leading spaces and added a single space
on lines where the line above did not end in a period. I've entered the file into the
Syntax editor, saved it, exited SPSS, and ran Production, selecting the file I just saved.

There is output, but the output looks problematic. If I scroll quickly to the end I see
that the plot shows an approach to a maximum point, then a sharp drop off at zero, along
with an approach to a second peak around .60. My understanding is that the frontier of
maximum likelihood calculations is supposed to be a single peaked curve. In looking at
the regressions for the root of the problem I see that all of the regressions up until
Lambda = 0 (the first 12 or so) begin with a warning stating:
"The argument for the natural log function is less than or equal to zero. The result has
been set to the system-missing value. An attempt was made to compute an exponential of the
form 0 ** Y where Y was less than or equal to zero. The result has been set to the
system-missing value."

The regression for lambda = zero doesn't have any warning, the regressions after this (for
positive lambda) have the warning:
"The argument for the natural log function is less than or equal to zero. The result has
been set to the system-missing value."

I've never used macros or the production facility prior to today. My guess is that once I
get this working properly then I should be able to adopt the code for my own use, but the
code on the SPSS site doesn't seem to function as one would expect. Any ideas?
Thanks,
John

Here's the code.
--
* Macro to generate squared residuals for aggregation.
* Use WRITE to restructure lambda & likelihood to graph.
Title 'Box-Cox transforms - data from Draper & Smith 81 , p228 '.
data list / p 1-2 f 4-5 visc 7-9.
begin data.
0 0 26
0 12 38
0 24 50
0 36 76
0 48 108
0 60 157
10 0 17
10 12 26
10 24 37
10 36 53
10 48 83
10 60 124
20 0 13
20 12 20
20 24 27
20 36 37
20 48 57
20 60 87
30 12 15
30 24 22
30 36 27
30 48 41
30 60 63
end data.

* Select cases with no missing vars - this is important for correct calculation of sample
size by AGGREGATE command later in program .
count nmiss = p f visc (missing).
select if (nmiss=0).
compute dummy=1.
compute logvisc=ln(visc).

* The following macro reads the name of the dependent var, the value of lambda, predictor
vars and var name for squared residuals to be computed by the macro.

DEFINE boxcox (dep=!TOKENS(1)
/ lambda=!ENCLOSE('(',')')
/indep=!ENCLOSE('[',']')
/ressq=!TOKENS(1)).
+ compute ll = !lambda .
+ do if (ll = 0) .
+ compute yt = ln(!dep).
+ else.
+ compute yt = (!dep**ll - 1)/ll.
+ end if.
+ regression /variables= !indep yt
/ dependent=yt /method=enter /save=resid(!ressq).
+ compute !ressq = !ressq**2.
!ENDDEFINE.

set printback=none /results=none.
boxcox dep=visc lambda=(-1.0) indep=[ p f ] ressq = resn100 .
boxcox dep=visc lambda=(-0.8) indep=[ p f ] ressq = resn080 .
boxcox dep=visc lambda=(-0.6) indep=[ p f ] ressq = resn060 .
boxcox dep=visc lambda=(-0.4) indep=[ p f ] ressq = resn040 .
boxcox dep=visc lambda=(-0.2) indep=[ p f ] ressq = resn020 .
boxcox dep=visc lambda=(-0.15) indep=[ p f ] ressq = resn015 .
boxcox dep=visc lambda=(-0.1) indep=[ p f ] ressq = resn010 .
boxcox dep=visc lambda=(-0.08) indep=[ p f ] ressq = resn008 .
boxcox dep=visc lambda=(-0.06) indep=[ p f ] ressq = resn006 .
boxcox dep=visc lambda=(-0.05) indep=[ p f ] ressq = resn005 .
boxcox dep=visc lambda=(-0.04) indep=[ p f ] ressq = resn004 .
boxcox dep=visc lambda=(-0.02) indep=[ p f ] ressq = resn002 .
boxcox dep=visc lambda=(0) indep=[ p f ] ressq = res000 .
boxcox dep=visc lambda=(0.05) indep=[ p f ] ressq = resp005 .
boxcox dep=visc lambda=(0.1) indep=[ p f ] ressq = resp010 .
boxcox dep=visc lambda=(0.2) indep=[ p f ] ressq = resp020 .
boxcox dep=visc lambda=(0.4) indep=[ p f ] ressq = resp040 .
boxcox dep=visc lambda=(0.6) indep=[ p f ] ressq = resp060 .
boxcox dep=visc lambda=(0.8) indep=[ p f ] ressq = resp080 .
boxcox dep=visc lambda=(1.0) indep=[ p f ] ressq = resp100 .

* Use AGGREGATE to sum the squared residuals and ln(y).

aggregate outfile = * / break=dummy /logvisc=sum(logvisc)
/nsize=n(dummy)
/resn100 resn080 resn060 resn040 resn020 resn015 resn010 resn008
resn006 resn005 resn004 resn002 res000 resp005 resp010 resp020
resp040 resp060 resp080 resp100 =
sum(resn100 resn080 resn060 resn040 resn020 resn015 resn010 resn008
resn006 resn005 resn004 resn002 res000 resp005 resp010 resp020
resp040 resp060 resp080 resp100).

set printback=listing /results=listing.

compute lkmxn100 = (-1/2)*nsize*ln(resn100/nsize) + ((-1)-1)*logvisc.
compute lkmxn080 = (-1/2)*nsize*ln(resn080/nsize) + ((-0.80)-1)*logvisc.
compute lkmxn060 = (-1/2)*nsize*ln(resn060/nsize) + ((-0.60)-1)*logvisc.
compute lkmxn040 = (-1/2)*nsize*ln(resn040/nsize) + ((-0.40)-1)*logvisc.
compute lkmxn020 = (-1/2)*nsize*ln(resn020/nsize) + ((-0.20)-1)*logvisc.
compute lkmxn015 = (-1/2)*nsize*ln(resn015/nsize) + ((-0.15)-1)*logvisc.
compute lkmxn010 = (-1/2)*nsize*ln(resn010/nsize) + ((-0.10)-1)*logvisc.
compute lkmxn008 = (-1/2)*nsize*ln(resn008/nsize) + ((-0.08)-1)*logvisc.
compute lkmxn006 = (-1/2)*nsize*ln(resn006/nsize) + ((-0.06)-1)*logvisc.
compute lkmxn005 = (-1/2)*nsize*ln(resn005/nsize) + ((-0.05)-1)*logvisc.
compute lkmxn004 = (-1/2)*nsize*ln(resn004/nsize) + ((-0.04)-1)*logvisc.
compute lkmxn002 = (-1/2)*nsize*ln(resn002/nsize) + ((-0.02)-1)*logvisc.
compute lkmx000 = (-1/2)*nsize*ln(res000/nsize) + (0-1)*logvisc.
compute lkmxp005 = (-1/2)*nsize*ln(resp005/nsize) + (0.05-1)*logvisc.
compute lkmxp010 = (-1/2)*nsize*ln(resp010/nsize) + (0.10-1)*logvisc.
compute lkmxp020 = (-1/2)*nsize*ln(resp020/nsize) + (0.20-1)*logvisc.
compute lkmxp040 = (-1/2)*nsize*ln(resp040/nsize) + (0.40-1)*logvisc.
compute lkmxp060 = (-1/2)*nsize*ln(resp060/nsize) + (0.60-1)*logvisc.
compute lkmxp080 = (-1/2)*nsize*ln(resp080/nsize) + (0.80-1)*logvisc.
compute lkmxp100 = (-1/2)*nsize*ln(resp100/nsize) + (1-1)*logvisc.

* Restructure data to plot by writing lambda and corresponding likelihoods.
write outfile=dummy records=20
/ '-1.0' lkmxn100 / '-.8' lkmxn080 / '-.6' lkmxn060 / '-.4' lkmxn040
/ '-.2' lkmxn020 / '-.15' lkmxn015 / '-.1' lkmxn010 / '-.08' lkmxn008
/ '-.06' lkmxn006 / '-.05' lkmxn005 / '-.04' lkmxn004 / '-.02' lkmxn002
/'0' lkmx000 / '.05' lkmxp005 / '.1' lkmxp010 / '.2' lkmxp020
/ '.4' lkmxp040 / '.6' lkmxp060 / '.8' lkmxp080 / '1' lkmxp100.
execute.
data list free file=dummy / lambda likmx.
list.
plot /plot=likmx with lambda.
* The original data must be reread to perform the chosen transformation and run regression
with the output that was suppressed in the box-cox runs.


--
John Gallaugher, Ph.D.
Assistant Professor of Information Systems
Wallace E. Carroll School of Management - Boston College
Fulton 352B
Chestnut Hill, MA 02467
E-mail: john.ga...@bc.edu
WWW: http://www2.bc.edu/~gallaugh/
Phone: 617-552-2519 / Fax: 617-552-0433

Rich Ulrich

unread,
Aug 2, 2000, 3:00:00 AM8/2/00
to
On Tue, 01 Aug 2000 17:28:45 -0400, John Gallaugher
<john.ga...@bc.edu> wrote:

> I've found the Box-Cox transformation macro on the SPSS site at:
> http://www.spss.com/tech/answer/result.cfm?tech_tan_id=100000143
> and I'm trying to get their simple example to run. I've entered the code below (copied &
> pasted from the web site, I removed the excessive leading spaces and added a single space
> on lines where the line above did not end in a period. I've entered the file into the
> Syntax editor, saved it, exited SPSS, and ran Production, selecting the file I just saved.
>
> There is output, but the output looks problematic. If I scroll quickly to the end I see
> that the plot shows an approach to a maximum point, then a sharp drop off at zero, along
> with an approach to a second peak around .60. My understanding is that the frontier of
> maximum likelihood calculations is supposed to be a single peaked curve. In looking at
> the regressions for the root of the problem I see that all of the regressions up until
> Lambda = 0 (the first 12 or so) begin with a warning stating:
> "The argument for the natural log function is less than or equal to zero. The result has
> been set to the system-missing value. An attempt was made to compute an exponential of the
> form 0 ** Y where Y was less than or equal to zero. The result has been set to the
> system-missing value."

< snip, some detail >

When I ran it, I got a plot just like what you described as the
correct one.

Okay, the warnings: those show a serious problem. I borrowed your
code, and touched it up in an obvious way, and I did not have that
problem. The obvious way?

> Here's the code.
> --
> * Macro to generate squared residuals for aggregation.
> * Use WRITE to restructure lambda & likelihood to graph.
> Title 'Box-Cox transforms - data from Draper & Smith 81 , p228 '.
> data list / p 1-2 f 4-5 visc 7-9.
> begin data.
> 0 0 26
> 0 12 38
> 0 24 50
> 0 36 76
> 0 48 108

- See, that "data list" is supposed to describe each line.
It says that these data are in FIXED format, as follows.
p is in col. 1-2,
f is in 4-5, and
visc is in 7-9 - that is what those numbers mean.
=========look at my version of your data with a FIXED font


0 0 26
0 12 38
0 24 50
0 36 76
0 48 108

...
=========and notice the exact alignment that you need.

I did not check any further; this seems to account for the problem.

--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

0 new messages