Question about choosing appropriate distribution

44 views
Skip to first unread message

Ruqin Ren

unread,
Dec 2, 2019, 12:00:20 AM12/2/19
to Aster Analysis User Group
Hi group,

I am new to aster model and please excuse my ignorance if this seems to be an easy question. 
I am trying to model the evolution of a cultural artifact, say a webpage. The graph model is like below:

                                                          
root -- (Bern) -- >  survival of a webpage (0 or 1)  ----->  Quality score of content (range from 1-5)  -- (Zero-truncated poisson)--> pageview counts

My question is, what would be an appropriate distribution from a binary survival indicator to a quality score of the page? The quality score values is an interval variable ranging between 1 to 5 point.

Thank you very much!

Ruqin

John Stanton-Geddes

unread,
Dec 3, 2019, 12:56:58 PM12/3/19
to Ruqin Ren, Aster Analysis User Group
Hi Ruqin,
This is an interesting application of aster models! As you've certainly noted, most of the current applications are in the biological world, but that doesn't mean that aster models can't and shouldn't be used elsewhere. While I started with aster in biological data, I'm now in the web analytics world and have also played with using aster models for such data. All my comments should be interpreted with caution since I haven't actively used aster models in years, but this might help get you started until someone corrects me :) 

A few questions for you to help guide the answer:

Is your quality score of content discrete or continuous? I assume that it's continuous such that a Poisson family wouldn't work. You can see the list of families that are available to `aster` models in the documentation: https://www.rdocumentation.org/packages/aster/versions/1.0-3/topics/families (As noted, you could construct your own by hand, though I'm not sure how "easy" this is to do if you aren't very sure of what you're doing). 

How is the quality score variable distributed? If you're lucky, it's somewhat Gaussian and then you can use the `fam.normal.location`, noting that you'll need to specify the standard deviation. If it's right-skewed, the `fam.negative.binomial` would likely be a better choice. If it's left skewed (which I would find hard to believe that most websites are 'high' quality, but who am I to judge?) ... then I'd probably do something dumb like take the inverse and use the negative binomial, but someone here would probably have a more statistically appropriate suggestion. 

Another comment based on experience ... you say that you'll use a zero-truncated Poisson for the final node of "pageview counts". I'm not sure what your time scale is, but most websites get hundreds to millions of views and a Poisson distribution would not be an appropriate fit. You'll probably want to use the negative binomial family again, as a few websites tend to be heavily visited while most have far fewer views. Even this may not be enough of a correction, and you may want to first take the log of pageview counts. 

Good luck! 

John


--
You received this message because you are subscribed to the Google Groups "Aster Analysis User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aster-analysis-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aster-analysis-user-group/a99bd228-0d2a-4da6-ad88-fd948e6acb70%40googlegroups.com.

Ruqin Ren

unread,
Dec 3, 2019, 11:10:05 PM12/3/19
to John Stanton-Geddes, Aster Analysis User Group
Thank you so much for the detailed information! This is enormously helpful. I appreciate it!

Ruqin Ren


Reply all
Reply to author
Forward
0 new messages