Duan Smearing Estimator

0 views

Skip to first unread message

Rashawn Devegowda

unread,

Aug 4, 2024, 8:53:17 PM8/4/24

to tidejori

Manytimes, researchers wish to transform the dependent variable of a regression in order to estimate parameter values. Performing the transformation, however, complicates the calculation of the expected value of the dependent variable on the untransformed scale. Assume, the Yi is the dependent variable. Assume the function g is used to transform the dependent variable as follows:

The easiest way to image this functions is think of g as the ln function and h as the exp function. In health economics, researchers often use a log transformation to attenuate problems related to a heavily right-skewed distribution. In this case, one would estimate the following regression:

To see how the smearing estimator works in practice see this example. I examine (made-up) figures of how Medicare spending varies by age. One can see that the average spending level is $6516. Retransforming the dependent variable on the log scale and using the naive estimate produces an estimate expected value of $4,853. Using the Duan smearing estimator, however, we get much closer to the actual spending level. The estimated average spending level is $6,725, much closer to the actual figure than the naive estimate.

Hi - I just modeled, using multiple linear regression, log Ecoli in terms of log turbidity and a rainfall variable and now need to retransform the Ecoli results back into arithmetic form. This retransformation technique was used to good effect in a similar project (USGS OFR 2020-1048). I can't find any reference to it in JMP.

I suggest that you save the residuals from the fitting platform, then create a new data column with another formula to implement transformation. Then use Summary, Distribution, or Tabulate to calculate the smearing estimator (average back-transformed residual).

The other day on LinkedIn I made a point about how I think scikits TransformedTargetRegressor is very likely to mislead folks. In fact, the example use case in the docs for this function is a common mistake, fitting a model for log(y), then getting predictions phat, and then simply exponentiating those predictions exp(phat).

On LinkedIn I gave an example of how this is problematic for random forests, but here is a similar example for linear regression. For simplicity pretend we only have 3 potential residuals (all equally likely), either a residual of -1, 0, or 1.

So if we take the mean of our LinPred column, we then get a prediction of about 202. The prediction using this approach is much higher than the naive approach of simply exponentiating 5. The difference is that the exp(5) estimate is the median, and the above estimate taking into account residuals is the mean estimate.

While there are some cases you may want the median estimate, in that case it probably makes more sense to use a quantile estimator of the median from the get go, as opposed to doing the linear regression on log(y). I think for many (probably most) use cases in which you are predicting dollar values, this underestimate can be very problematic. If you are using these estimates for revenue, you will be way under for example. If you are using these estimates for expenses, holy moly you will probably get fired.

This problem will happen for any non-linear transformation. So while some transformations are ok, in scikit for example minmax or standardnormal scalars are ok, things like logs, square roots, or box-cox transformations are not. (To know if it is a linear transformation, if you do a scatterplot of original vs transformed, if it is a straight line it is ok, if it is a curved line it is not!)

But as we will see in a second, the exponentiated predictions are not so well behaved. To illustrate how the WrongTrans variable behaves, I show its distribution compared to the original y value. You can see that on average it is a much smaller estimate. Our sample values have a mean of 7.5 million, and the naive estimate here only has a mean of 4.6 million.

Now here is a way to get an estimate of the mean value. In a nutshell, what you do is take the observed residuals, pretty much like that little table I did in the intro of this blog post, generate predictions given those residuals, and then back transform them and take the mean.

So the residuals and the Duan smearing estimator do not need to be the same dimension. So for example if you have a big data application, you may want to do something like resids = resids.sample(1000) above.

Also another nice perk of this is you can use dp above to give you prediction intervals, so np.quantile(dp,[0.025,0.975], axis=1).T would give you a 95% prediction interval of the mean on the linear scale as well.

Another approach, which may make sense given the application, is instead of using the observed residuals to give a non-parametric estimate, you can estimate the distribution of the residuals, and then use that to make either an integral estimate of the Smeared estimate back on the original scale. Or in the case of the logged regression there is a closed form solution.

The differences could be due to the the integral is simply an estimate (and you can see I did not do negative to positive infinity, but chopped it off, I do not know if there is a better function to estimate the integral or general approach here).

While this focuses on regression, I do not think this will perform all that badly for other types of models (such as random forests or xgboost). But for forests it may make sense to simply pull out the individual tree estimates, back transform them, and get the mean of that backtransformed estimate. I have a different blog post that has a function showing how to scoop up the individual predictions from a random forest model.

It should also apply the same to any regression model with regularization. But if you want to do this, there are of course other alternative models you may consider that may be better suited towards your end goals of predictions on the linear/original scale.

For example, if you really want prediction intervals, it may make sense to not transform the data, and estimate a quantile regression model at the 5% and 95% quantiles. This would give you a 90% prediction interval.

Another approach is that it may make sense to use a different model, such as Poisson regression or negative binomial regression (or another generalized linear model in general). Even if your data are not integer counts, you can still use these models! (They just need to be 0 and above, no negative values.)

That Stata blog suggests to use Poisson and then robust standard errors, but that is a bad idea if you are really interested in predictions as well (see Gary Kings comment and linked paper). But you can just do negative binomial models in most cases then, and that is a better default than Poisson for many real world datasets.

I was reminded of this paper by Jung et al. on constructing simple rules via regression recently. So in the past few posts I have talked about how RTM (1,2) is aimed at making simple models. This is via variable selection and/or simplying the inputs to be binary yes/no. But in the end the final equation could be something like:

The paper linked above is about making the regression weights simple, so instead of a regression weight of 0.89728, you may just round the regression weight to 1. The Jung paper does a procedure where they use lasso regression and then round the weights. But there is a simpler approach IMO I will illustrate, just amend the lasso weights to push the coefficients to simple integers. (Also reminded by this example of using an iterative linear program to push weights to binary 0/1.)

So in lasso, you estimate your normal regression equation, but put a penalty on the weights that is typically something like lambda*(sum(abs(reg_weights)) - 1)**2. So if you have reg weights that add to more than 1, they are penalized by a particular amount (the lambda is a tuner to make the penalty higher/lower). And in the iterative algorithm to minimize your loss function plus this added penalty, it will converge to regression weights that meet the criteria of in total summing to around 1. Not exactly 1 but close.

You can however swap out that penalty term with whatever you want (or add to it additional penalties). I will show an example of using a penalty term to push regression coefficients towards integer values, creating simple regression weights.

Dan Simpson has a good blog post of the Jung paper and why simple models are sometimes preferable (and I also have a comment why simple models like this tend to work out well for CJ datasets). But here are few quick examples why you might want a simple model results.

Example 1: If you have people in the field who are tabulating data and making quick decisions, it may be they need to use pen/paper and make a quick decision. No time to input results into a computer and pop out a prediction. Imagine a nurse in the ER, or even your general practitioner. There may be quite a bit of utility in making a simple check list that says if +4 on this scale, do a more intensive treatment.

Example 2: You have a complicated, large database. It is easier to create a simple predictive model in SQL to serve up predictions (either because of latency or because of the complexity of the data pipeline). Instead of a complicated random forest, a linear regression with simple weights will be much easier to implement.

Example 3: Transparency. Complicated models are more difficult to understand and monitor. If you have a vested interest in presenting the model to outside parties, it may make sense to sacrifice some accuracy to make the model more interpretable. Also similar to lasso, I suspect these simple weights will reduce the variance of predictions.

Next I read in the data, which I have previously used as an example in prior blog posts on doctor visits for medicare patients. One thing to note here, is that I rescale the independent variables I am using to min/max. So the age variable instead of going from 65-90 like in the original data, now is scaled to be between 0/1. This is a problem intrinsic to lasso as well, in that you can change the scale of the input variables and it changes the weights. Here with the original data, the education variable has a tiny regression coefficient (0.2), but is highly stat significant. So without rescaling that variable, the model said to hell with your penalty and still converged to a solution of that regression weight is 0.2. If you divide the education variable by 5 though, the corresponding regression weight would change to around 1.