modeling a left skewed distribution

10 views
Skip to first unread message

Quresh Latif

unread,
Jul 4, 2024, 4:49:01 PMJul 4
to nimble-users
I am wondering if anyone has any suggestions on how to model (as elegantly as possible) left skewed and somewhat leptokurtic data. My data are day-of-year values of observations of human occurrence on the landscape (this is for a study of recreation effects on birds). I am attaching histograms showing the distribution of observed values, the distribution of median simulated values from a gamma regression, and three of the individual simulated datasets from said gamma regression. For other datasets, gamma regression is fitting just fine (posterior predictive GOF p-values based on deviance ~= 0.4), but the fit for this one is pretty terrible. There are a few covariates in the gamma regression, but it does seem that I am missing something, and scouring the world for additional covariates isn't really a viable option for my purposes.

I have read a little on skewed normal and exponentially modified normal. Neither seems quite right. I imagine my primary options would be some sort of mixture of two distributions. Alternatively, I could specify a categorical or continuous latent covariate, but I worry about computational efficiency as well as taking away information from actual covariates.

I may actually try a model that specifies a latent covariate after posting this, but I've taken to posting my thoughts in case they prove helpful to anyone and in case anyone can suggest some super elegant solution that I am missing.
Simulated_DOYs_from_gamma_regression_median.png
Simulated_DOYs_from_gamma_regression_sim2.png
Observed_DOYs.png
Simulated_DOYs_from_gamma_regression_sim1.png
Simulated_DOYs_from_gamma_regression_sim3.png

Quresh Latif

unread,
Jul 5, 2024, 11:38:45 AMJul 5
to PierGianLuca, nimble-users
Not really.

Actually, since posting my message yesterday, I've identified the reason for the funky distribution. Day-of-year (along with time-of-day) is non-linearly confounded with traffic volume. The reason is that I am averaging day of year across human traffic pings in each of my sampling units, so sampling units with more pings also have more observations of day-of-year and the average for that unit then gets pulled towards the overall population mean. In contrast, units with low traffic volume also have few day-of-year observations, resulting in more heterogeneity in values. I think if I were to really care about traffic timing, I'd need to compile timing covariates in a manner that disentangles the artificial relationship with traffic volume. For the purposes of this particular analysis, however, I think I may just end up dropping this variable as we don't have strong hypotheses for its effects on birds.

Ended up being another lesson pointing to the value of data exploration before throwing covariates into a model.

Quresh S. Latif
Biometrician
Bird Conservancy of the Rockies
230 Cherry St., Ste. 150, Fort Collins, CO 80521
970-482-1707 (ext. 15)
Connecting people, birds and land


On Thu, Jul 4, 2024 at 3:29 PM PierGianLuca <lu...@magnaspesmeretrix.org> wrote:
Hi Quresh,

Interesting problem. Is a nonparametric – that is, model-free – approach unfeasible?

Cheers,
Luca
Reply all
Reply to author
Forward
0 new messages