what residual value can be considered as a good fit?

443 views
Skip to first unread message

Hao Hu

unread,
Feb 18, 2016, 2:12:50 PM2/18/16
to dadi-user
Dear Ryan,

I constructed my favorite model for my two-population dataset after experimenting with quite a few other models.  If I look at the data vs. model frequency spectrum, it looks like a reasonably good fit (see attached plots). However, the residual values are usually between -20 and 20, which concerns me. 

According the the dadi manual, the residual should be approximately normally distributed (at least for large sample size). Therefore, a residual of -20 is almost p=2.75e-89. Should I be concerned with my residuals and try to get a better model with all residuals less than 3? What would you think are acceptable residual values?

Thanks so much for your help,

Hao


1D-1.png
1D-2.png
2D.png

Gutenkunst, Ryan N - (rgutenk)

unread,
Feb 19, 2016, 5:00:30 PM2/19/16
to dadi...@googlegroups.com
Hello Hao,

Unfortunately, I can't give a concrete answer here, because it's a judgement call.

If the you were to fit unlinked data simulated from your model, you should indeed find residuals that are normally distributed. The fact that you have larger residuals suggests that your model does not fully account for the data. In particular, your model underpredicts the number of high- and low-frequency alleles. Whether that's okay or not is really up to you. My philosophy is generally to strive for the simplest possible model that explains the data well, even if it isn't perfect. You could probably add parameters (particularly more growth) to your model to better fit those positions in the frequency spectrum. But it's not clear whether that would tell you much more biologically. Often when we do that we find that the model no longer converges, suggesting that we're overfitting. We often stop when our series of more complex models converges to the same biological story in what we care about, for example, the divergence time.

Best,
Ryan

--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To post to this group, send email to dadi...@googlegroups.com.
Visit this group at https://groups.google.com/group/dadi-user.
For more options, visit https://groups.google.com/d/optout.
<1D-1.png><1D-2.png><2D.png>

--
Ryan Gutenkunst
Assistant Professor
Molecular and Cellular Biology
University of Arizona
phone: (520) 626-0569, office LSS 325
Latest paper: "Whole genome sequence analyses of Western Central African Pygmy hunter-gatherers reveal a complex demographic history and identify candidate genes under positive natural selection"

Hao Hu

unread,
Feb 19, 2016, 6:18:40 PM2/19/16
to dadi-user
Ryan,

Thanks for the informative answer, which makes a lot of sense to me. On a side note, I am indeed concerned about the over-parameterization of my model, since I started 200 dadi runs with the same model specification but random starting parameters, and few of them converges. I could switch to simpler models, but the log-likelihood would decrease from my current best -20000 to around -30000. Would it make sense for me to switch to simpler models so that my data would converge, despite that the log-likelihood would increase?

Best,
Hao

Gutenkunst, Ryan N - (rgutenk)

unread,
Feb 20, 2016, 4:29:26 PM2/20/16
to dadi...@googlegroups.com
Thanks a substantial difference in log-likelihood. I would probably stick with the more complex model, as long as it is converging eventually.
Reply all
Reply to author
Forward
0 new messages