difficult question, maybe there is something more on it available in
SARIMAX and the statespace models
In most models, the scale (variance of error) is treated as a nuisance
parameter and concentrated out. Based on a quick check of the
ARMA.loglike, sigma is concentrated out of the loglikelihood and
estimated directly from the residuals.
The same is true in OLS.
in general we don't have a very consistent policy for the scale
estimation yet. In some cases we calculate the scale as part of the
loglike optimization but might throw away the related cov_params
information. I don't quite know what we should do when scale is
estimated as part of the MLE versus scale is estimated through some
other way.
The reason that we can treat the scale as a nuisance parameter in the
MLE calculation and don't have to treat it like the mean parameters is
that under the normal distribution assumption (or GLM/linear
exponential family in general), the mean parameters are orthogonal to
the scale parameters in the asymptotic distribution (or in the
expectation).
If we use this orthogonality, then we could also estimate the variance
of the scale separately. Under full normality assumption the
asymptotic variance of the scale would be just scale squared (i.e.
sigma**4, i.e. 4th moment), or a function of that.
That's as far as I remember. This would be feasible without changing
the core code of ARMA.
To use the observed information matrix, i.e. Hessian, instead of the
orthogonal expected information matrix would require considerable
changes because scale needs to be added as an explicit `params`
instead of concentrated out nuisance parameter. I don't know whether
that's easy or not in ARMA/ARIMA.
aside: I looked at related things every once in a while, but, I think,
a main problem is that the variance of the variance will not be very
reliable unless the sample size is very large because 4th moments are
easily influenced by noise and some larger (outlying) residuals. I
never checked the details, but for example, in some cases estimators
that use var(scale) = scale**2 or f(scale**2) have often much better
small sample behavior than estimators that rely on empirical 4th
moment.
What does matlab use for the variance of the scale, observed or
expected information matrix?
Josef