empirical cdf

antd...@gmail.com

unread,

May 30, 2018, 3:28:41 AM5/30/18

to powerlaw-general

Hello,

Thanks for developing the package, it seems to have a great functionality.

I have been comparing the results of my own code, which does something very similar, to the results of powerlaw.

My fit concerns a continuous small (<50) sample, with xmin and xmax specified to be the min and max values present in the data, but the issue remains in generic cases as well.

When I plot my results against the ones from powerlaw, there are clear discrepancies which I pinned down to arise from different values of the empirical distribution of the data.

I tried reading the source code and looked at past discussions, but could not answer my question, so here it goes:

From my understanding of the empirical cdf, I am calculating it as data=(x1, x2, x3, ... x_n) ecdf=(1/n, 2/n, 3/n, ...., n/n=1) where n is the size of the sample.

But to match the empirical cdf returned from powerlaw, I need to start at zero, that is ecdf=(0, 1/n, 2/n ,..., n-1/n)

Using x, y = fitparam.cdf()

print x ,y

confirms that this is the case, as the first point in y seems to always be zero.

Am I understanding something wrong in how the powerlaw code works (or how the ecdf should be defined in general)?

If not, then what is the reason for defining the ecdf like that?

This has also obviously consequences when one leaves xmin free to be specified from the code, as the ecdf is used to calculate the KS statistic.

Many thanks in advance,

Danai

Jeff Alstott

unread,

May 30, 2018, 9:03:56 AM5/30/18

to powerlaw...@googlegroups.com

Thanks Danai,

Nobody agrees on how CDFs work.

If I understand your question correctly (and I may not), the issue is whether your define CDF as X<x or X<=x. powerlaw defines it as X<x, while you're writing X<=x. While one of the two might be on Wikipedia, in practice they're both used a lot.

The greatest difficulty comes up with defining the CCDF as 1-CDF.

If CDF == X<x, then CCDF == X>=x.

If CDF == X<=x, then CCDF == X>x.

So there's no way to get away from an "or equal to". It's just a question of which edge cases cause you the most irritation. powerlaw was written so that CCDFs start with a value of 1 (and thus CDFs start with a value of 0). This is the definition used in Clauset et al. (I'm assuming because it makes a bunch of subsequent math work).

It is possible to produce different combinations of edge cases by breaking the definition of CCDF = 1-CDF, but that way lies madness.

--
You received this message because you are subscribed to the Google Groups "powerlaw-general" group.
To unsubscribe from this group and stop receiving emails from it, send an email to powerlaw-general+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

antd...@gmail.com

unread,

May 30, 2018, 2:37:35 PM5/30/18

to powerlaw-general

Thank you very much for the quick reply.

Yes, my question was indeed if, first of all, the ECDF has been defined with X<x instead of X<=x, and what is the reason behind that choice.

Thanks for clarifying both.

If I might ask one more thing:

When I plot my ECDF for X<=x on top of the best-fit theoretical CDF of powerlaw, they coincide much better than with the ECDF(X<x).

The KS statistic is smaller too (verified by my own code, which leads to almost identical best-fit parameters for the power law).

This is to be expected, since I specify xmax as the maximum value in the dataset, so the theoretical CDF(xmax)=1 and so does the ECDF(X<=x).

I am now confused as to whether the theoretical CDF formula should somehow take into account whether one works with the X<x or X<=x 'definition' :

Naively, I would think that if I define ECDF(X<=x) then the same should apply to my theoretical CDF (and equivalently for X<x).

But our definition of the theoretical CDF seems identical (from overplotting and checking them for many different samples - I could not spot the actual mathematical definition of it in the powerlaw source code), so I wonder if I am defining it wrong (since I compare to ECDF(X<=x).

Any insights on the above would be very welcomed...

or if you could point me to the appropriate part/function of the code that computes the theoretical CDF (does it have an option to return the values, like the fit.cdf() does? I tried calling it from the Distribution object but I get AttributeErrors).

Again, thanks for your help and time!

D.

Jeff Alstott

unread,

May 30, 2018, 8:33:40 PM5/30/18

to powerlaw...@googlegroups.com

Meta note: If the sample is of any appreciable size covering any appreciable range, the difference between X<x or X<=x should be immaterial. It's literally just one number's difference. Any effect of measures of significance should be immaterial, unless there's a problem with the data. The biggest issue here is dealing with a small sample, even though power laws are interesting because of the statistics of their rare events; we need lots of data to definitively say things are power laws.

I don't quite follow your situation, I think because I don't quite follow your wording/nomenclature. However! The part where CDFs are calculated is here:

https://github.com/jeffalstott/powerlaw/blob/master/powerlaw.py#L1850

Unless there's an error, all CDFs in powerlaw are calculated with that function's logic. If you find anything else going on in powerlaw, definitely let us know, and maybe submit a pull request to fix the issue.

Thanks!
Jeff

superno...@gmail.com

unread,

May 31, 2018, 3:04:59 AM5/31/18

to powerlaw-general

Hi Jeff,

Brilliant, thanks.

Sorry for my wording and thanks for trying to reply.

I managed to reconstruct (tracing down _cdf_base _cdf_xmin etc) the theoretical CDF powerlaw calculated in the case of both xmin and xmax and confirmed it reaches 1 at xmax.

For small samples it matches better an ECDF with X<=x (instead of X<x).

This is indeed only an issue for small samples, as the difference in ECDF will go as 1/n, so with big enough n it doesn't really matter.

For small n, it ends up affecting slightly the 'optimal' xmin found by the KS test.

I might try to implement a correction for it.

Conclusions on power law distributions should not be drawn from such small samples, but there are situations where the theory predicts a power law so it's still worth trying to estimate its parameters. Unfortunately, I'm dealing with very rare natural events, so even though both tails of the distribution are well 'sampled', the total number of data remains always small.

Many thanks,

Reply all

Reply to author

Forward