How to use statsmodels.distributions.ECDF

2,710 views
Skip to first unread message

Surfcast23

unread,
Aug 5, 2012, 9:00:20 PM8/5/12
to pystat...@googlegroups.com
Hi I am trying to calculate the empirical CDFof two  samples so I can run a KS test on them. When I try to print the results of ecdf = sm.distributions.ECDF(C) I get

 <statsmodels.distributions.empirical_distribution.ECDF object at 0xbe01fec>

, which leads me to believe that I am not using statsmodels.distributions.ECDF correctly. The documentation really did not enlighten me much so I would appreciate it if some can point out my mistake. Thanks in advance!

My Code

#load arrays with data
sdss_data = np.loadtxt(S)
data = np.loadtxt(F)
C = data[:,3]
N = sdss_data[:,0]
f_obs = sdss_data[:,1]

#Calculate Statistics
num_halo = sum(C)
avg = np.mean(C)
sigma = np.std(C)
variance = np.var(C)
N_cells = len(C)
ecdf = sm.distributions.ECDF(C)
ecdf_sdss = sm.distributions.ECDF(f_obs)
print(ecdf)

josef...@gmail.com

unread,
Aug 6, 2012, 12:18:32 AM8/6/12
to pystat...@googlegroups.com
ecdf and ecdf_sdss are instances of the ECDF class that represent the ecdf "function"

you need to evaluate it at some points, e.g.

>>> import statsmodels.api as sm
>>> x = np.random.randn(10)
>>> xnew = np.linspace(-3, 3, 21)
>>> ecdfx = sm.distributions.ECDF(x)
>>> ecdfx
<statsmodels.distributions.empirical_distribution.ECDF object at 0x05530690>
>>> ecdfx(x)
array([ 0.7,  0.2,  0.5,  0.1,  0.6,  1. ,  0.3,  0.9,  0.4,  0.8])
>>> ecdfx(xnew)
array([ 0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0.2,  0.3,  0.5,
        0.6,  0.6,  0.7,  0.7,  0.7,  0.9,  0.9,  1. ,  1. ,  1. ])

large sample

>>> x = np.random.randn(1000)
>>> ecdfx = sm.distributions.ECDF(x)
>>> ecdfx(xnew)
array([ 0.001,  0.005,  0.01 ,  0.019,  0.032,  0.059,  0.098,  0.16 ,
        0.26 ,  0.354,  0.48 ,  0.601,  0.729,  0.816,  0.897,  0.939,
        0.964,  0.984,  0.992,  0.997,  0.999])
>>> from scipy import stats
>>> stats.norm.cdf(xnew)
array([ 0.0013499 ,  0.00346697,  0.00819754,  0.01786442,  0.03593032,
        0.0668072 ,  0.11506967,  0.18406013,  0.27425312,  0.38208858,
        0.5       ,  0.61791142,  0.72574688,  0.81593987,  0.88493033,
        0.9331928 ,  0.96406968,  0.98213558,  0.99180246,  0.99653303,
        0.9986501 ])

If you just want to get the two sample ks test, then the direct calculation of the max as in stats.ks_2samp might be better.
I never tried whether the ECDF would get the correct max difference in a simple way.

Josef

josef...@gmail.com

unread,
Aug 6, 2012, 12:26:22 AM8/6/12
to pystat...@googlegroups.com

Ok, it looks like it works as in textbook definition

>>> y = np.random.randn(10)
>>> x = np.random.randn(10)
>>> stats.ks_2samp(x, y)
(0.5, 0.11084033741322809)
>>> xy = np.concatenate((x,y))
>>> ecdfx = sm.distributions.ECDF(x)
>>> ecdfy = sm.distributions.ECDF(y)
>>> np.max(np.abs(ecdfx(xy) - ecdfy(xy)))
0.5


try not "nice" number:

>>> y = np.random.randn(51)
>>> x = np.random.randn(51)
>>> xy = np.concatenate((x,y))
>>> ecdfx = sm.distributions.ECDF(x)
>>> ecdfy = sm.distributions.ECDF(y)
>>> stats.ks_2samp(x, y)
(0.11764705882352944, 0.84971026341538447)
>>> np.max(np.abs(ecdfx(xy) - ecdfy(xy)))
0.11764705882352949

Josef
 

Josef


Khary Richardson

unread,
Aug 7, 2012, 12:59:00 AM8/7/12
to pystat...@googlegroups.com
Okay I will give it another go
--
StriperCoast SurfCasters Club

Reply all
Reply to author
Forward
0 new messages