Is there a fast inplace group demean?

josef...@gmail.com

unread,

Jan 21, 2016, 9:19:35 AM1/21/16

to pyd...@googlegroups.com

I'm trying out various ways of handling fixed effects panel data, partially with the intention of replacing the functionality of the deprecated pandas panel data estimation.

I can use numpy, scipy.sparse or pandas.

The specific feature I'm looking at right now is repeated demeaning within groups. Since it's repeated it should be fast and inplace. Is there a way to do this with pandas?

Related: Are there any examples or notebooks for working with panel data and estimation for it in pandas?

Josef

Jeff

unread,

Jan 21, 2016, 10:06:06 AM1/21/16

to PyData

Josef

is this what you are after? This is quite fast.

In [10]: np.random.seed(1234)

In [11]: df = DataFrame({'A' : np.random.randint(0,10,size=100), 'B' : np.random.randn(100)})

In [12]: df['C'] = df['B']-df.groupby('A')['B'].transform('mean')

In [13]: df.head()

Out[13]:

A B C

0 3 0.299347 0.412738

1 6 0.127277 0.161831

2 5 0.926190 0.838768

3 4 2.455240 1.855498

4 8 -0.320890 0.129203

In [14]: df[df.A==3]

Out[14]:

A B C

0 3 0.299347 0.412738

21 3 -0.063758 0.049633

26 3 -1.000889 -0.887498

28 3 0.159520 0.272911

30 3 0.028340 0.141731

32 3 -0.712358 -0.598967

51 3 -0.288721 -0.175330

60 3 0.418900 0.532291

99 3 0.139099 0.252490

In [15]: df[df.A==3].B.mean()

Out[15]: -0.11339119534010732

josef...@gmail.com

unread,

Jan 21, 2016, 10:34:00 AM1/21/16

to pyd...@googlegroups.com

On Thu, Jan 21, 2016 at 10:06 AM, Jeff <jeffr...@gmail.com> wrote:

Josef

is this what you are after? This is quite fast.

In [10]: np.random.seed(1234)

In [11]: df = DataFrame({'A' : np.random.randint(0,10,size=100), 'B' : np.random.randn(100)})

In [12]: df['C'] = df['B']-df.groupby('A')['B'].transform('mean')

Thanks Jeff

Yes, that works for the first part and several of the intended use cases. (I didn't know which groupby methods are fast now.)

For the iterated version, I would prefer the inplace changes. Something like

In [12]: df['B'] -= df.groupby('A')['B'].transform('mean')

which kind of works when I try it out, but drops other columns so it might not be really inplace

Josef

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff

unread,

Jan 22, 2016, 12:15:41 PM1/22/16

to PyData

The inplace updating works. What gets dropped?

On Thursday, January 21, 2016 at 9:19:35 AM UTC-5, Josef Pktd wrote:

josef...@gmail.com

unread,

Jan 22, 2016, 12:53:07 PM1/22/16

to pyd...@googlegroups.com

On Fri, Jan 22, 2016 at 12:15 PM, Jeff <jeffr...@gmail.com> wrote:

The inplace updating works. What gets dropped?

Sorry, `UserError`, I was jumping around in the notebook and got most likely the wrong dataframe.

Nothing is dropped when I run again everything in sequence.

Thanks

As aside:

It looks like a two-way fixed effect can be removed just in a few iterations. Convergence is very fast in the examples I tried with the standard firm - time effects pattern and 10% missing cells.

So users can do this easily based on pandas directly (once we add the required options to OLS), and assuming users don't need the details about the fixed effects.