Inmy statistical teaching, I encounter some stubborn ideas/principles relating to statistics that have become popularised, yet seem to me to be misleading, or in some cases utterly without merit. I would like to solicit the views of others on this forum to see what are the worst (commonly adopted) ideas/principles in statistical analysis/inference. I am mostly interested in ideas that are not just novice errors; i.e., ideas that are accepted and practiced by some actual statisticians/data analysts. To allow efficient voting on these, please give only one bad principle per answer, but feel free to give multiple answers.
Very often, even on this website, I see people lamenting that their data are not normally distributed and so t-tests or linear regression are out of the question. Even stranger, I will see people try to rationalize their choice for linear regression because their covariates are normally distributed.
I don't have to tell you that regression assumptions are about the conditional distribution, not the marginal. My absolute favorite way to demonstrate this flaw in thinking is to essentially compute a t-test with linear regression as I do here.
Some people have the intuition that post hoc power analysis could be informative because it could help explain why they attained a non-significant result. Specifically, they think maybe their failure to attain a significant result doesn't mean their theory is wrong... instead maybe it's just that the study didn't have a large enough sample size or an efficient enough design to detect the effect. So then a post hoc power analysis should indicate low power, and we can just blame it on low power, right?
The problem is that the post hoc power analysis does not actually add any new information. It is a simple transformation of the p-value you already computed. If you got a non-significant result, then it's a mathematical necessity that post hoc power will be low. And conversely, post hoc power is high when and only when the observed p-value is small. So post hoc power cannot possibly provide any support for the hopeful line of reasoning mentioned above.
Note that the problem here is not the chronological issue of running a power analysis after the study is completed per se -- it is possible to run after-the-fact power analysis in a way that is informative and sensible by varying some of the observed statistics, for example to estimate what would have happened if you had run the study in a different way. The key problem with "post hoc power analysis" as defined in this post is in simply plugging in all of the observed statistics when doing the power analysis. The vast majority of the time that someone does this, the problem they are attempting to solve is better solved by just computing some sort of confidence interval around their observed effect size estimate. That is, if someone wants to argue that the reason they failed to reject the null is not because their theory is wrong but just because the design was highly sub-optimal, then a more statistically sound way to make that argument is to compute the confidence interval around their observed estimate and point out that while it does include 0, it also includes large effect size values -- basically the interval is too wide to conclude very much about the true effect size, and thus is not a very strong disconfirmation.
It seems that many individuals have the idea that they not only can, but should disregard data points that are some number of standard deviations away from the mean. Even when there is no reason to suspect that the observation is invalid, or any conscious justification for identifying/removing outliers, this strategy is often considered a staple of data preprocessing.
Just because you aren't performing a t.test on 1,000,000 genes doesn't mean you're safe from it. One example of a field it notably pops up is in studies that test an effect conditional on a previous effect being significant. Often in experiments the authors identify a significant effect of something, and then conditional on it being significant, then perform further tests to better understand it without adjusting for that procedural analysis approach. I recently read a paper specifically about the pervasiveness of this problem in experiments, Multiple hypothesis testing in experimental economics and it was quite a good read.
This seems like low hanging fruit, but stepwise regression is one error which I see pretty frequently even from some stats people. Even if you haven't read some of the very well-written answers on this site which address the approach and its flaws, I think if you just took a moment to understand what is happening (that you are essentially testing with the data that generated the hypothesis) it would be clear that step wise is a bad idea.
Most competent Data Scientists and ML Engineers who are generalists (in the sense that they don't specialize in time series forecasting or econometrics), as well as MBA types and people with general statistics backgrounds, will default to ARIMA as the baseline model for a time series forecasting problem. Most of the time they end up sticking with it. When they do evaluate it against other models, it is usually against more exotic entities like Deep Learning Models, XGBoost, etc...
On the other hand, most time series specialists, supply chain analysts, experienced demand forecasting analysts, etc...stay away from ARIMA. The accepted baseline model and the one that is still very hard to beat is Holt-Winters, also known as Triple Exponential Smoothing. See for example "Why the damped trend works" by E S Gardner Jr & E McKenzie. Beyond academic forecasting, many enterprise grade forecasting solutions in the demand forecasting and the supply chain space still use some variation of Holt-Winters. This isn't corporate inertia or bad design, it is simply the case that Holt-Winters or Damped Holt-Winters is still the best overall approach in terms of robustness and average overall accuracy.
Some history might be useful here: Exponential Smoothing models, Simple ES, Holt's model, and Holt-Winters, were developed in the 50s. They proved to be very useful and pragmatic, but were completely "ad-hoc". They had no underlying statistical theory or first principles - they were more of a case of: How can we extrapolate time series into the future? Moving averages are a good first step, but we need to make the moving average more responsive to recent observations. Why don't we just add an $\alpha$ parameter that gives more importance to recent observation? - This was how simple exponential smoothing was invented. Holt and Holt-Winters were simply the same idea, but with the trend and seasonality split up and then estimated with their own weighted moving average models (hence the additional $\beta$ and $\gamma$ parameters). In fact, in the original formulations of ES, the parameters $\alpha$, $\beta$, and $\gamma$ were chosen manually based on their gut feeling and domain knowledge.
Even today, I occasionally have to respond to requests of the type "The sales for this particular product division are highly reactive, can you please override the automated model selection process and set $\alpha$ to 0.95 for us" (Ahhh - thinking to myself - why don't y'all set it to a naive forecast then??? But I am an engineer, so I can't say things like that to a business person).
Anyway, ARIMA, which was proposed in the 1970s, was in some ways a direct response to Exponential Smoothing models. While engineers loved ES models, statisticians were horrified by them. They yearned for a model that had at least some theoretical justification to it. And that is exactly what Box and Jenkins did when they came up with ARIMA models. Instead of the ad-hoc pragmatism of ES models, the ARIMA approach was built from the ground up using sound first principles and highly rigorous theoretical considerations.
And ARIMA models are indeed very elegant and theoretically compelling. Even if you don't ever deploy a single ARIMA model to production in your whole life, I still highly recommend that anyone interested in time series forecasting dedicate some time to fully grasping the theory behind how ARIMA works, because it will give a very good understanding of how time series behave in general.
I recall once working with a very smart business forecaster who had a strong statistics background and who was unhappy that our production system was using exponential smoothing, and wanted us to shift to ARIMA instead. So him and I worked together to test some ARIMA models. He shared with me that in his previous jobs, there was some informal wisdom around the fact that ARIMA models should never have values of $p$, $d$, or $q$ higher than 2. Ironically, this meant that the ARIMA models we were testing were all identical to or very close to ES models. It is not my colleague's fault though that he missed this irony. Most introductory graduate and MBA level material on time series modeling focus significantly or entirely on ARIMA and imply (even if they don't explicitly say so) that it is the end all be all of statistical forecasting. This is likely a holdover from the mind set that Hyndman referred to in the 70s, of academic forecasting experts being "enamored" with ARIMA. Additionally, the general framework that unifies ARIMA and ES models is a relatively recent development and isn't always covered in introductory texts, and is also significantly more involved mathematically than the basic formulations of both ARIMA and ES models (I have to confess I haven't completely wrapped my head around it yet myself).
At this point would point out to some modern tools and packages that use ARIMA and perform very well on most reasonable time series (not too noisy or too sparse), such as auto.arima() from the R Forecast package or BigQuery ARIMA. These tools in fact rely on sophisticated model selection procedures which do a pretty good job of ensuring that the $p,d,q$ orders selected are optimal (BigQuery ARIMA also uses far more sophisticated seasonality and trend modeling than the standard ARIMA and SARIMA models do). In other words, they are not your grandparent's ARIMA (nor the one taught in most introductory graduate texts...) and will usually generate models with low $p,d,q$ values anyway (after proper pre-processing of course). In fact now that I think of it, I don't recall ever using auto.arima() on a work related time series and getting $p,d,q > 1$, although I did get a value of $q=3$ once using auto.arima() on the Air Passengers time series.
3a8082e126