Alois,
I know you are an experienced matlab user, so pardon me if I am being
oversimplistic in my post. If you read today's thread entitled
"Unique and NaN", you will see that NaN is a strange beast indeed.
In the computation of basic statistics, if NaN represents a missing
value, you may indeed wish to compute the desired statistic (e.g. std
or var or mean) using only the non-NaN values. The Matlab statistics
toolbox has the functions nanmax, nanmean, nanmedian, nanmin, nanstd
and nansum that do precisely that.
I guess the reason for the NaN-in NaN-out behaviour of many functions
in Matlab (such as sum, std, mean, median, corrcoef, detrend) is that
the folks at the Mathworks would rather leave it up to the user to
decide what to do with NaNs for their particular application or field
of expertise. With the isnan function, it is generally easy to tell
matlab whether NaNs should be ignored or not, as in
x = [0 3 4 NaN];
xmean = mean(~isnan(x));
Having said that, I will add that I generally (or always???) handle
NaNs as missing values in my programs. Therefore, your TSA toolbox
sounds interesting to me and I will definitely download it. ;-)
Just my own $0.02, Denis ;-)
Denis Gilbert wrote:
It should be
xmean = mean(x(~isnan(x)));
>
>
> Having said that, I will add that I generally (or always???) handle
> NaNs as missing values in my programs. Therefore, your TSA toolbox
> sounds interesting to me and I will definitely download it. ;-)
>
> Just my own $0.02, Denis ;-)
Denis,
one purpose of posting these functions was to demonstate how easy it is
to write efficient routines which can handle missing values. MEAN, SUM,
STD, VAR, DETREND etc. are simple examples, the routines in the
TSA-toolbox are more advanced.
I cannot follow your argument that it should be up to the user, because
it is always possible to perform an explicit check on the data with
isnan(x). Actually, I recommend explicit check's for NaN's rather than
implicit checks. This would make programs much more clear.
In cases without a need of checking NaN's, most users (like me and you)
do not want to care about whether to use MEAN or NANMEAN. Actually MIN
and MAX consider NaN's as missing values, this is not consistent with the
behavior of MEAN, SUM etc. In cases without NaN's, both (MEAN and
NANMEAN) give the same results. In case of missing values I want to get a
meaningful result, too. It is boring to use NANSUM, NANMEAN etc. (BTW,
NANSUM, NANMEAN sound like they would sum (or mean) the nan's !?!)
The solution of
xmean = mean(x(~isnan(x)));
is computationally not very efficient; additional intermediate result
needs memory space. Sometimes x is not explicitely available (e.g. x::=
x(f)). XMEAN is not correct if x is not a vector but a matrix (i.e.
all(size(x)>1) ). And it is not very intuitive, too.
Your concerns are quite important. It shows me the parts that I missed to
explain in the first posting. Thanks.
Alois
You are obviously right. I would have noticed my mistake if I had
left out the semi-colon ;-)
You make several good points here. I too would find matlab coding
easier if the default behavior of some of its functions (sum, mean,
median, detrend, corrcoef) were to skip NaNs while returning
meaningful non-NaN values. I will leave it at that and wait for
others to express their own views. Perhaps the above functions
should simply issue a warning to the effect that the input arrays
contain so many NaNs out of a total of so many input array elements?
You may find it interesting that Zhiping You of TMW once mentioned to
this newsgroup (02 Mar 1999) that NANMIN and NANMAX of the Statistics
will eventually become obsolete because of the new NaN-handling
capability of the MIN and MAX functions that was introduced in matlab
5.x. Perhaps there's a trend here ;-)
Cheers, Denis.
Dear Matlab-users,
please not also, that the stats toolbox is still at Matlab-4-level, i.e. it
does not support more than two dimensions. Compare
k=rand(3,4,1,4)
median(k)
nanmedian(k)
As a further inconsistency, the dimension concept of Matlab is not
satisfactory because it does not support vectors but n x 1 or 1 x n matrices
instead, which led me to write my own SUM that does sum up along the first
dimension *always*.
Regards,
Matthias Frühwirth
Matthias Fruehwirth wrote:
> "Alois Schlögl" <schl...@dpmi.tu-graz.ac.at
> >It is boring to use NANSUM, NANMEAN etc. (BTW,
> > NANSUM, NANMEAN sound like they would sum (or mean) the nan's !?!)
>
> Dear Matlab-users,
>
> please not also, that the stats toolbox is still at Matlab-4-level, i.e. it
> does not support more than two dimensions. Compare
>
> k=rand(3,4,1,4)
> median(k)
> nanmedian(k)
>
Dear Matthias,
I think Matlab 5.3 (and Statistics toolbox Version 2.2 (R11) can handle these
cases, too.
However, for the same reasons as mentioned in my previous posting, I'd suggest
that median should handle missing values, too. Then, nanmedian is not needed
anymore.
>
> As a further inconsistency, the dimension concept of Matlab is not
> satisfactory because it does not support vectors but n x 1 or 1 x n matrices
> instead, which led me to write my own SUM that does sum up along the first
> dimension *always*.
>
You can force SUM to work on a defined dimension using SUM(x,DIM). For your
case, SUM(x,1) would do the same as your routine.
Alois
> Matthias Fruehwirth wrote:
>
> > k=rand(3,4,1,4)
> > median(k)
> > nanmedian(k)
> >
>
> Dear Matthias,
>
> I think Matlab 5.3 (and Statistics toolbox Version 2.2 (R11) can handle
these
> cases, too.
Matlab 6 (Stats 3.0) cannot.
> However, for the same reasons as mentioned in my previous posting, I'd
suggest
> that median should handle missing values, too. Then, nanmedian is not
needed
> anymore.
I agree regarding statistics, there I also use NaN as missing values.
Yet I have a number of applications where I use NaN as erroneous value which
I want to propagate.
> > ..., which led me to write my own SUM that does sum up along the first
> > dimension *always*.
> >
>
> You can force SUM to work on a defined dimension using SUM(x,DIM). For
your
> case, SUM(x,1) would do the same as your routine.
Right, the above happened on Matlab 4, before SUM was supporting dimensions.
Still I miss vectors (e.g. the default behaviour of built-in functions
whether returning row or column vectors has changed between releases, which
led to subsequent dimensions errors in existing programs)
Matthias