Please bear with me, it's going to be little lengthy.
For 1 response and 6-7 predictors, I am playing around different models to improve my prediction. I managed to arrive at a model with around 56% adjusted r-squared.
I tried to look at the issues and found that there were three leverage points in the data (and thankfully no outliers), that were effecting my prediction accuracy badly.
When I removed those observations one by one, my adjusted r-squared increased to around 62 %.
Again in order to get even better model, looking carefully at the plots, I tried log transformation to one of my predictor and the results changed dramatically (adjusted r-squared 71%)
Feeling good about it but the output says "(42 observations deleted due to missingness)".
Probably R (I love R for regression modelling due to it's flexibility) removes observations with nearly zero values of that variable as log is not defined at zero.
Again to overcome this issue, I've applied log(x+x/2) or log(x+c); c being a small positive number. This further improved model accuracy and reduce the number of missing observations [but still there are some missing observations]
Now my concerns are:
- Should I report this regression model using log transformation (with around 75% adj R2 but ~20 missing obs) or the previous model without log transformation (with 62% R2, no missing obs)
- Do you suggest me to look for another transformation like beta, gamma or possion for that predictor
- It is ok to remove three obs with high leverage point, which effects the model badly
Thanks for being patient, please provide your valuable insights.
Thanks,
Nisha Arora