Goodness of fit

400 views
Skip to first unread message

Philipp Singer

unread,
Jun 11, 2013, 2:28:19 PM6/11/13
to powerlaw...@googlegroups.com
Hey!

As suggested in the Santafe paper one can also use the p-value approach
for determining the goodness of fit. Even though you suggest, that the
candidate distribution comparison is more elegant, I would like to do both.

Nevertheless, I am not quite sure how to do so. Maybe you can help me out.

Regards,
Philipp

Jeff Alstott

unread,
Jun 12, 2013, 11:26:53 AM6/12/13
to powerlaw...@googlegroups.com
Right now the Monte Carlo/bootstrapping method for calculating the probability that a real power law distribution would have created the observed data (the p-value) is not implemented. The reasons for this are two-fold:

1. I think it is insufficient and frequently unnecessary. 
2. It is sloooow to run. 

The reason for the slowness is that we generate samples of synthetic data and refit that data, and that refitting takes time. In order to estimate the p-value with much statistical power you'll need to run at least 1,000 synthetic samples. So if you have a dataset that originally took 10 seconds to fit, it will take nearly 3 hours to calculate a p-value. This problem is obviously worse the the larger your original dataset was, as the run time scales with (number_of_data_points * number_of_samples).

With all that said, it is not hard to implement. Adam Ginsburg has an implementation here:
The function is "test_pl". The important bit is lines 520-542.

I have now received multiple questions about the bootstrapping method, since Clauset et al. discuss it at length. I'm ambivalent about whether to implement it in powerlaw. On the one hand there is interest, but on the other hand I think it is usually distracting and will likely lead people to stop with a p-value before they've considered alternative distributions. 

With that said, if someone submits a Github pull request with a well-integrated implementation of the bootstrapping, I'll put it in.




--
You received this message because you are subscribed to the Google Groups "powerlaw-general" group.
To unsubscribe from this group and stop receiving emails from it, send an email to powerlaw-general+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



sanjasc...@gmail.com

unread,
Mar 30, 2017, 9:45:13 AM3/30/17
to powerlaw-general
Hi Jeff,

Firstly many thanks for the wonderfully simple and usable package and second, a sub-question to your answer here.

As Philipp, I was also looking for the p-value for the goodness-of-fit of power law. Could you expand on your point 1. why it is unnecessary? I have some data for which other distributions are ruled out with values shown below, but in order to include these results in a scientific paper, I wanted to have also p-value as a confirmation for my claim that my data indeed follow a power law.

 results.distribution_compare('power_law', 'lognormal_positive',  normalized_ratio=1) 
*********** longormal positive **************
8.05684637454 7.82879392126e-16

and similarly:

*********** exponential **************
8.32175273739 8.66719720664e-17
*********** lognormal **************
12.4683114412 1.11156529301e-35

Basically, are you suggesting that these comparisons to other distributions are enough, and could you shortly explain why? 

I am now running plpva.py by Joel Ornstein and will try also the code you suggested. However, as you say this is too slow, especially since my data are ~100Ks. Moreover, I have another 10 datasets I would want to evaluate, so would be great if I could skip this calculation...

Many thanks!
Sanja
To unsubscribe from this group and stop receiving emails from it, send an email to powerlaw-gener...@googlegroups.com.

Jeff Alstott

unread,
Mar 30, 2017, 9:57:31 AM3/30/17
to powerlaw...@googlegroups.com
Hi Sanja,

Check out the PLoS ONE paper. Specifically this section:

"Practically, bootstrapping is more computationally intensive and loglikelihood ratio tests are faster. Philosophically, it is frequently insufficient and unnecessary to answer the question of whether a distribution “really” follows a power law. Instead the question is whether a power law is the best description available. In such a case, the knowledge that a bootstrapping test has passed is insufficient; bootstrapping could indeed find that a power law distribution would produce a given dataset with sufficient likelihood, but a comparative test could identify that a lognormal fit could have produced it with even greater likelihood. On the other hand, the knowledge that a bootstrapping test has failed may be unnecessary; real world systems have noise, and so few empirical phenomena could be expected to follow a power law with the perfection of a theoretical distribution. Given enough data, an empirical dataset with any noise or imperfections will always fail a bootstrapping test for any theoretical distribution. If one keeps absolute adherence to the exact theoretical distribution, one can enter the tricky position of passing a bootstrapping test, but only with few enough data [6].
Thus, it is generally more sound and useful to compare the fits of many candidate distributions, and identify which one fits the best. "

To unsubscribe from this group and stop receiving emails from it, send an email to powerlaw-general+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sanjasc...@gmail.com

unread,
Mar 30, 2017, 10:39:30 AM3/30/17
to powerlaw-general
Thanks a lot! Now it is clear as well as your decision not to have it implemented. I did read the paper, and somehow still missed that relevant part.

Many thanks and have a great day!
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages