How to use statsmodels.distributions.ECDF

Surfcast23

unread,

Aug 5, 2012, 9:00:20 PM8/5/12

to pystat...@googlegroups.com

Hi I am trying to calculate the empirical CDFof two samples so I can run a KS test on them. When I try to print the results of ecdf = sm.distributions.ECDF(C) I get

<statsmodels.distributions.empirical_distribution.ECDF object at 0xbe01fec>

, which leads me to believe that I am not using statsmodels.distributions.ECDF correctly. The documentation really did not enlighten me much so I would appreciate it if some can point out my mistake. Thanks in advance!

My Code

#load arrays with data
sdss_data = np.loadtxt(S)
data = np.loadtxt(F)
C = data[:,3]
N = sdss_data[:,0]
f_obs = sdss_data[:,1]

#Calculate Statistics
num_halo = sum(C)
avg = np.mean(C)
sigma = np.std(C)
variance = np.var(C)
N_cells = len(C)
ecdf = sm.distributions.ECDF(C)
ecdf_sdss = sm.distributions.ECDF(f_obs)
print(ecdf)

josef...@gmail.com

unread,

Aug 6, 2012, 12:18:32 AM8/6/12

to pystat...@googlegroups.com

ecdf and ecdf_sdss are instances of the ECDF class that represent the ecdf "function"

you need to evaluate it at some points, e.g.

>>> import statsmodels.api as sm
>>> x = np.random.randn(10)
>>> xnew = np.linspace(-3, 3, 21)
>>> ecdfx = sm.distributions.ECDF(x)
>>> ecdfx
<statsmodels.distributions.empirical_distribution.ECDF object at 0x05530690>
>>> ecdfx(x)
array([ 0.7, 0.2, 0.5, 0.1, 0.6, 1. , 0.3, 0.9, 0.4, 0.8])
>>> ecdfx(xnew)
array([ 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.2, 0.3, 0.5,
        0.6, 0.6, 0.7, 0.7, 0.7, 0.9, 0.9, 1. , 1. , 1. ])

large sample

>>> x = np.random.randn(1000)
>>> ecdfx = sm.distributions.ECDF(x)
>>> ecdfx(xnew)
array([ 0.001, 0.005, 0.01 , 0.019, 0.032, 0.059, 0.098, 0.16 ,
        0.26 , 0.354, 0.48 , 0.601, 0.729, 0.816, 0.897, 0.939,
        0.964, 0.984, 0.992, 0.997, 0.999])
>>> from scipy import stats
>>> stats.norm.cdf(xnew)
array([ 0.0013499 , 0.00346697, 0.00819754, 0.01786442, 0.03593032,
        0.0668072 , 0.11506967, 0.18406013, 0.27425312, 0.38208858,
        0.5       , 0.61791142, 0.72574688, 0.81593987, 0.88493033,
        0.9331928 , 0.96406968, 0.98213558, 0.99180246, 0.99653303,
        0.9986501 ])

If you just want to get the two sample ks test, then the direct calculation of the max as in stats.ks_2samp might be better.
I never tried whether the ECDF would get the correct max difference in a simple way.

Josef

josef...@gmail.com

unread,

Aug 6, 2012, 12:26:22 AM8/6/12

to pystat...@googlegroups.com

Ok, it looks like it works as in textbook definition

>>> y = np.random.randn(10)
>>> x = np.random.randn(10)
>>> stats.ks_2samp(x, y)
(0.5, 0.11084033741322809)
>>> xy = np.concatenate((x,y))
>>> ecdfx = sm.distributions.ECDF(x)
>>> ecdfy = sm.distributions.ECDF(y)
>>> np.max(np.abs(ecdfx(xy) - ecdfy(xy)))
0.5

try not "nice" number:

>>> y = np.random.randn(51)
>>> x = np.random.randn(51)
>>> xy = np.concatenate((x,y))
>>> ecdfx = sm.distributions.ECDF(x)
>>> ecdfy = sm.distributions.ECDF(y)
>>> stats.ks_2samp(x, y)
(0.11764705882352944, 0.84971026341538447)
>>> np.max(np.abs(ecdfx(xy) - ecdfy(xy)))
0.11764705882352949

Josef

Josef

Khary Richardson

unread,

Aug 7, 2012, 12:59:00 AM8/7/12

to pystat...@googlegroups.com

Okay I will give it another go

--
StriperCoast SurfCasters Club

Reply all

Reply to author

Forward