Having looked through the manual, and the FAQ, I can find no simple way
of entering data in a matrix, having the class boundary in one column
and the frequency in the second, then calculate the standard deviation
of the data.
I can use the frequencies section to give Sum of xy (of f.x), but not
Sum of f.x^2.
The only way I can see to do this is extract the x column from the
matrix, convert it to a list, square it, then convert the frequencies
into a list and multiply the two together. I can then calculate the
standard deviation.
Is there an easier way, that has less opportunity for human error than
this? It is such a common statistical application that I feel I must be
missing something!
Thanks
--
Every accomplishment, great or small, begins with the right decision: "I'll Try"
Andrew Murray
*** http://www.ngmint.demon.co.uk ***
Sounds pretty good to me. Just write a program to do exactly what you said.
But, let the hp multiply and get the sum of f.x^2.
To do that:
Take out the x column using COL-, convert it to a list, square the list,
covert it to a vector and throw it back into [Sigma]DAT. Then by computing
[Sigma]XY, you will find the sum of f times x^2
Okay, sorry, MID-POINT of the group. (standard stats frequency table)
And also, I have a HP-48G.
> Having looked through the manual, and the FAQ, I can find no simple way
> of entering data in a matrix, having the class boundary in one column
> and the frequency in the second, then calculate the standard deviation
> of the data. I can use the frequencies section to give Sum of xy (of f.x),
> but not Sum of f.x^2. It is such a common statistical application that
> I feel I must be missing something!
[then a later follow-up/correction]:
> Okay, sorry, MID-POINT of the group. (standard stats frequency table)
> And also, I have a HP-48G.
The built-in "Frequencies" stat application (employing the BINS command)
assumes that every individual data point is already represented *separately*
in the SumDAT matrix, and then *computes* the frequencies of uniform-width
class intervals (bins) you specify, based on the *individual* data points
which you have previously entered.
The "Summary stats" are also intended for use on a statistical matrix in
which every individual point has been entered.
What you describe above appears to be the exact opposite of "bins"; i.e.
you specify the mid-points of a set of bins and the count of each bin,
and then you want to re-create the individual points and find their
original standard deviation. This is not precisely possible if each
"bin" represents a range of possible values, since some information
about the original individual points is lost, but can be done under
assumption that all the points in each bin have the same value.
While many generations of Casio scientific calculators offering a
Standard Deviation data entry mode have offered a means to specify
the number of occurrences of each value, none of the HP calculators
I have ever used has done so; the only way to use the built-in
functions to do the computation has been to repeat the entry of
an individual data point the required number of times; if there
are not too many total points (sum of all frequencies), then
repeating the rows in the SumDAT matrix according to each
(necessarily integer) frequency will accomplish the task.
It would be a modest task in User-RPL for the G/GX to create a generalized
function whose input would be a matrix and a column number (the column
containing the frequencies) and whose output would be a matrix whose
rows were repeated according to the frequencies, with the original
frequency column itself deleted. The result would then be ready
for use with all the other standard built-in statistical functions.
Another approach along similar lines, but mimicking the data entry approach
used with older HP calculators, would be to write a variant of the Sum+
command (which you find still present in the HP48 STAT DATA menu) which
would accept one more stack argument (the frequency) and repeat the
built-in Sum+ command the given number of times; this is most elementary
for input data which is only single real values or vectors, and slightly
trickier to generalize to accept "vectors" in the form of multiple real
values on separate stack levels (as the HP48 Sum+ command itself accepts);
however, you seem to be interested only in the standard deviation of a
single list of real values, which is the easiest one to implement.
The Sum+ approach to stat data entry eliminates using the Matrix Writer
at all, which carries the added blessing of being a whole lot faster!
(Does anyone remember editing text using MSDOS with the TED editor, a mere
4096-byte program offering full-screen editing, cut/paste/search/undo/etc?
Remember this sometime while waiting for Word6 to start on Win95 with 16MB).
Finally, you could also simply abandon the use of built-in functions
(as you seem to be choosing to do anyway) and write your own function
to compute the sums and calculate the standard deviation. If you
wish to let the HP48 help you to do this using the built-in SumXY
statistical function, then you could append a final column to the matrix
containing the square of each data value, as suggested elsewhere, or you
could just sum all the values yourself during one pass through the list.
Whatever approach you choose simply requires a bit of programming on your
part to automate the job and eliminate doing it manually.
Or, perhaps somewhere amidst the zillions of megabytes of HP48 software
out there on FTP sites and CD-ROMS, there may be another statistical
package already accomplishing all this.
-----------------------------------------------------------
With best wishes from: John H Meyers ( jhme...@mum.edu )
> While many generations of Casio scientific calculators offering a
> Standard Deviation data entry mode have offered a means to specify
> the number of occurrences of each value, none of the HP calculators
> I have ever used has done so; the only way to use the built-in
> functions to do the computation has been to repeat the entry of
> an individual data point the required number of times; if there
> are not too many total points (sum of all frequencies), then
> repeating the rows in the SumDAT matrix according to each
> (necessarily integer) frequency will accomplish the task.
I have always felt this to be one of the 48's most puzzling omissions,
especially, as you say, as it is built into most Casios and TIs. I
spent weeks searching the manual for this functionality, but in the
end, I had to accept that it just wasn't there.
What is even more puzzling is that I have any number of HP advertising
brochures which claim that the 48, as most other HP's, can calculate
weighted means - which is essentially the same problem (the frequency
of a data item being to all intents and purposes a weight). 'Weighted
mean' is explicitly stated as a function of the 48 (and the 38), as
well as a host of older HP's - do none of them actually do it?
> It would be a modest task in User-RPL for the G/GX to create a generalized
> function whose input would be a matrix and a column number (the column
> containing the frequencies) and whose output would be a matrix whose
> rows were repeated according to the frequencies, with the original
> frequency column itself deleted. The result would then be ready
> for use with all the other standard built-in statistical functions.
>
> Another approach along similar lines, but mimicking the data entry approach
> used with older HP calculators, would be to write a variant of the Sum+
> command (which you find still present in the HP48 STAT DATA menu) which
> would accept one more stack argument (the frequency) and repeat the
> built-in Sum+ command the given number of times; this is most elementary
> for input data which is only single real values or vectors, and slightly
> trickier to generalize to accept "vectors" in the form of multiple real
> values on separate stack levels (as the HP48 Sum+ command itself accepts);
> however, you seem to be interested only in the standard deviation of a
> single list of real values, which is the easiest one to implement.
Both of the above methods have a drawback - neither will support
non-integer frequencies. Admittedly, this isn't really a problem for a
frequency-based problem, but in calculating weighted means, it is
quite possible for weights to have fractional values.
I was off work with 'flu last week, and decided to address this
problem for once and for all. I have written a set of functions to
calculate the mean, variance and s-dev (both population and sample)
for a weighted set of data, with the value itself stored in column 1
of SigDAT, and the weight/frequency stored in column 2. This means
that the standard Sig+ and Sig- functions can be used, and seemed to
me to be the best approach to the problem.
I then decided that I should do the job properly, and started writing
a weighted version of the BINS function - I intend to distribute the
whole lot as one library (or as RPL source) to anyone who wants
it.(Feel free to drop me an email if you're interested, btw - it's
only a fairly elementary set uf User-RPL progs.)
I discovered something really odd about BINS in the process. At first
I thought it was a bug, but on carefully reading the AUR, I discovered
that it is just strange defined behaviour. Basically, the final 'bin'
is slightly wider than all the previous ones. As an example, store [1
2 3 4 5 6 7 8 9] into SigDAT, and then try to sort it into 3 bins, 2
wide, starting at 2. The result is [2 2 3], not (as you would expect)
[2 2 2]. The first bin runs from 2.0 to 3.99999, the second from 4.0
to 5.99999, but the third runs from 6.0 to 8.0 - very weird. Anyone
got any idea why this should be? As I said, the AUR implies that this
will happen in it's definition of the maximum value which will be
assigned to a bin. (I can't recall the actual equation off-hand, but
effectively it uses a <= when I think it should use a <.) As a result,
the function displays this odd behaviour. I still can't decide whether
I should make my weighted BINS function behave in the same (to me,
erroneous) fashion or not...
HTH,
Simon Long
Hmm, unfortunately I didn't get past the second chapter of the
programming manual, but now would seem to be a good time to start!
With the guidance above, I should be able to manage it over the weekend,
Thanks
Andrew
In article <328B17...@camcon.co.uk>, Simon Long <s...@camcon.co.uk> writes:
> I have any number of HP advertising brochures which claim that the 48,
> as most other HP's, can calculate weighted means - essentially the same
> problem (the frequency of a data item being to all intents and purposes a
> weight). 'Weighted mean' is explicitly stated as a function of the 48 (and
> the 38), as well as a host of older HP's - do none of them actually do it?
"Weighted Mean" - yes; "Standard Deviation of weighted values" - no.
I don't have my HP12C with me (and EduCalc's catalog photo has a big
clunky finger covering exactly the keys I need to see!), but as I
recall, the simple way to calculate weighted means is to input
pairs of values (older HP's like the 12C always use both x and y
as input values), with one of these always being the weight (BTW,
a non-integer weight is perfectly fine for this method). If it was
'x' which was the weight, then SumXY / SumX gives the average of the
weighted Y values (just think of it the opposite way to interchange the
roles of 'x' and 'y'); the HP12C has a specific keyboard function for
this particular calculation, while it is easy to perform on other models
(including HP48) by simply recalling the appropriate sums and dividing.
For weighted Standard Deviation, I used to simply use a trivial program
which repeated Sum+ a number of times given by the weight (in this case
only integers would do), then use the normal built-in functions. Older
HP calculators, like Casio's etc., used to accumulate only a running total
of the various sums in six dedicated registers (Casio and other brands
often use fewer accumulators, because they usually don't offer both
x-data and y-data standard deviations, and also neglect linear regression);
therefore the "weighting" of any input value a given number of times is
then simply via multiplying the contributions to each of the various sums
by the same weighting factor, whereas on the HP48, the approach is to
remember each separate data point in a matrix, and perform summations
only after collecting all data values.
Retaining all data points and performing sums only as required improves
final accuracy in cases where some points are entered and subsequently
deleted, or where larger data values precede smaller values.
One can of course rewrite all the internal statistics functions in terms
of a model where there are designated in SumDAT not only columns for an
independent and a dependent variable, but also a column for a weighting
factor which will be applied to all the summations when actually performed;
this would allow non-integer weights without restriction (hopefully there
is a statistical meaning corresponding to this generalization).
HP is a good company and produces good product, but they have always
been a bit backward with respect to their statistical stuff -- this
goes way back before micro computers. You can trust the statistical
calculations in the HP48 for small, non-challenging data sets, but
don't do any extensive, serious calculations on it -- especially
simulations with the random number generator -- random numbers are
too important to be left to chance.
The best and most accurate way to calculate the above is with a
simple looping function. All the serious computer packages use it,
because it is both the fastest and the most accurate way to do
the calculations. See Miller, Alan, J. (1989). Updating means and
variances. Jour. Computational Physics.
Let X(i) be an array of i=1...N interval midpoints, and W(i) a
corresponding
array of weights (frequencies). Use arrays SW(i), SX(i), and M(i)
to hold the calculations. Set SW(0)=SX(0)=M(0)=0, and then repeat
SW(i+1)=SW(i)+W(i)
d=[X(i+1)-M(i)]W(i)
M(i+1)=M(i)+d/SW(i+1)
SX(i+1)=SX(i)+d[X(i+1)-M(i+1)]
SX(N) will contain N times the variance (SD squared),
and M(N) the mean. You don't actually have to use arrays for
SW,SX, and M: scalars will do if you update them.
Bob Wheeler, ECHIP, Inc.
%%HP: T(3)A(D)F(.);
@ WEIGHT2, Weighted (Grouped) Data Statistics, by Joseph K. Horn
DIR
WTOT @ Weighted Total
\<< \GSX*Y
\>>
WMEAN @ Weighted Mean
\<< \GSX*Y \GSY /
\>>
WSDEV @ Weighted Sample Standard Deviation
\<< \GSY \GSDAT OBJ\-> EVAL DROP 0 1 ROT
START ROT SQ ROT * +
NEXT OVER * \GSX*Y SQ - SWAP DUP 1 - * / \v/
\>>
WPSDEV @ Weighted Population Standard Deviation
\<< WMEAN 1 \GS+ WSDEV \GS- DROP
\>>
END
--
-Joseph K. Horn- <joe...@mail.liberty.com>
Press F13 to continue...
Been there, done that. It's called WEIGHT2 on Goodies Disk #8.
> Another approach along similar lines, but mimicking the data entry approach
> used with older HP calculators, would be to write a variant of the Sum+
> command (which you find still present in the HP48 STAT DATA menu) which
> would accept one more stack argument (the frequency) and repeat the
> built-in Sum+ command the given number of times; this is most elementary
> for input data which is only single real values or vectors, and slightly
> trickier to generalize to accept "vectors" in the form of multiple real
> values on separate stack levels (as the HP48 Sum+ command itself accepts);
> however, you seem to be interested only in the standard deviation of a
> single list of real values, which is the easiest one to implement.
Been there, done that too. It's called WEIGHT on Goodies Disk #7.
Don't use it; WEIGHT2 on GD8 is much smaller, faster, and more powerful.
For example, it lets you have frequencies in the thousands, which
the repeated-Sum+ method does not allow.
> Or, perhaps somewhere amidst the zillions of megabytes of HP48 software
> out there on FTP sites and CD-ROMS, there may be another statistical
> package already accomplishing all this.
Be sure to read WEIGHT.DOC and WEIGHT2.DOC; they explain pretty
thoroughly how "frequency statistics" and "weighted statistics"
only differ in concept but can be calculated the same way.
In article <328D1B...@echip.com>, Bob Wheeler <bwhe...@echip.com> writes:
> You can trust the statistical calculations in the HP48 for small,
> non-challenging data sets, but don't do any extensive, serious calculations
> on it -- especially simulations with the random number generator --
> random numbers are too important to be left to chance [!]
> The best and most accurate way to calculate [Mean & Standard Deviation] is
> with a simple looping function. All the serious computer packages use it,
> because it is both the fastest and the most accurate way to do the
> calculations. See Miller, Alan, J. (1989):
> Updating means and variances. Jour. Computational Physics.
> Let X(i) be an array of i=1...N interval midpoints,
> and W(i) a corresponding array of weights (frequencies).
> Use arrays SW(i), SX(i), and M(i) to hold the calculations.
> Set SW(0)=SX(0)=M(0)=0, and then repeat:
> SW(i+1)=SW(i)+W(i)
> d=[X(i+1)-M(i)]W(i)
> M(i+1)=M(i)+d/SW(i+1)
> SX(i+1)=SX(i)+d[X(i+1)-M(i+1)]
> SX(N) will contain SW times the variance (SD squared), and M(N) the mean.
> You don't actually have to use arrays for SW,SX, and M;
> scalars will do if you update them.
The above algorithm looked so neat that I couldn't help myself;
I had to program it and try it out:
The following program computes the Mean, Sample Standard Deviation, and
Population Standard Deviation of *weighted* data currently stored in
the HP48 Statistics Matrix (SumDAT), allowing *any* columns of SumDAT
to be specified as the "data" and "frequency/weight" columns,
in the same manner as other built-in HP48 statistical commands:
Note: This program relies on the G/GX commands COL- and DOLIST,
which can be simulated by suitable S/SX programs.
%%HP: T(3); @ \GS is Greek Sigma, \-> is right-arrow, \v/ is SquareRoot
\<< '\GSPAR' RCL 1 2 SUB EVAL 0 0 DUP2 \-> x f w d m v
\<< RCL\GS x COL- SWAP DROP ARRY\-> EVAL \->LIST
RCL\GS f COL- SWAP DROP ARRY\-> EVAL \->LIST
2 \<< DUP2 'w' STO+ m - * 'd' STO d w / 'm' STO+ m - d * 'v' STO+ \>>
DOLIST m v w DUP2 1 - / \v/ ROT ROT / \v/ \>> \>>
@ G/GX: 269 bytes (on stack), #17BBh checksum
Instructions:
Enter a statistical data matrix containing at least two columns,
with any particular column containing data values, and any other
column containing weights or frequencies (integer or non-integer).
The HP48 reserved variable SumDAT may also contain a 'name' of any
other variable, which in turn contains the actual statistics matrix.
If you use any of the G/GX Statistics data entry applications, a
default SumPAR will be created automatically, specifying column 1
as the "independent" variable (data values) and column 2 as the
"dependent" variable (weights). You may change the corresponding X-column
and/or Y-column designations in any of the Statistics data entry menus;
"Single-var.." and "Frequencies.." use only X-column, while
"Fit data.." and "Summary stats.." use both X-column and Y-column).
You may also use the XCOL and YCOL commands to specify the
individual columns independently, or the COLSum command to
specify both XCOL and YCOL at the same time.
Executing the above program then returns three values:
Mean, Sample Standard Deviation, and Population Standard Deviation;
you may subsequently DROP any values you do not need.
If you want the Variances, rather than the Standard Deviations,
then simply omit taking the square roots (you may of course modify
the program to produce any desired combination of Total Weight (count),
Total Values, and Mean, Variances and/or Standard Deviations).
Note that none of the common "Summary stats" (SumX, SumY^2, etc.)
is actually computed by this program, nor will any of the built-in
HP48 Statistics functions except MAXsum and MINsum return any
correct answers using the "weighted" statistics matrix.
Many thanks to Bob Wheeler for posting the reference for the methods used.