Loaded questions!
Using statistical significance, AIC, Cp, BIC for model selection backfires whether forwards, backwards, or all possible subsets of predictors selection is being used. fastbw exists for the better-behaved approach of model approximation and for minor trimming of highly insignificant variables in a context where the bootstrap can properly penalize for such data dredging. Using P-values whether from anova or not to drop or select variables has all the same problems. fastbw also exists to satisfy users who really want to walk the tightrope.
Methods that simultaneously shrinkage and select variables (e.g., lasso and elastic net) make the variable selection "honest" by penalizing the result for the number of candidate features. But there is a serioius problem that still lurks. Just as with naiive unpenalized variable selection, the newer methods have a low probability of finding the truly important variables, especially when co-linearity exists. This is exposed by running the algorithm multiple times using the bootstrap and noticing that the list of features "selected" is highly unstable.
The desire for parsimony generally creates a large number of problems: loss of predictive discrimination, arbitrariness, instability, complete noise in the presence of co-linearity.
The old ridge regression (quadratic penalty with no attempt at parsimony) approach generally gives the best predictive discrimination while helping to get good calibration. Here the analyst gives up on trying to find "the" variables. The model can sometimes be simplified using model approximation (pre-conditioning).