Is there a fast inplace group demean?

1,277 views
Skip to first unread message

josef...@gmail.com

unread,
Jan 21, 2016, 9:19:35 AM1/21/16
to pyd...@googlegroups.com
I'm trying out various ways of handling fixed effects panel data, partially with the intention of replacing the functionality of the deprecated pandas panel data estimation.

I can use numpy, scipy.sparse or pandas.


The specific feature I'm looking at right now is repeated demeaning within groups. Since it's repeated it should be fast and inplace. Is there a way to do this with pandas?


Related: Are there any examples or notebooks for working with panel data and estimation for it in pandas?


Josef

Jeff

unread,
Jan 21, 2016, 10:06:06 AM1/21/16
to PyData
Josef

is this what you are after? This is quite fast.

In [10]: np.random.seed(1234)

In [11]: df = DataFrame({'A' : np.random.randint(0,10,size=100), 'B' : np.random.randn(100)})

In [12]: df['C'] = df['B']-df.groupby('A')['B'].transform('mean')

In [13]: df.head()
Out[13]: 
   A         B         C
0  3  0.299347  0.412738
1  6  0.127277  0.161831
2  5  0.926190  0.838768
3  4  2.455240  1.855498
4  8 -0.320890  0.129203

In [14]: df[df.A==3]
Out[14]: 
    A         B         C
0   3  0.299347  0.412738
21  3 -0.063758  0.049633
26  3 -1.000889 -0.887498
28  3  0.159520  0.272911
30  3  0.028340  0.141731
32  3 -0.712358 -0.598967
51  3 -0.288721 -0.175330
60  3  0.418900  0.532291
99  3  0.139099  0.252490

In [15]: df[df.A==3].B.mean()
Out[15]: -0.11339119534010732

josef...@gmail.com

unread,
Jan 21, 2016, 10:34:00 AM1/21/16
to pyd...@googlegroups.com
On Thu, Jan 21, 2016 at 10:06 AM, Jeff <jeffr...@gmail.com> wrote:
Josef

is this what you are after? This is quite fast.

In [10]: np.random.seed(1234)

In [11]: df = DataFrame({'A' : np.random.randint(0,10,size=100), 'B' : np.random.randn(100)})

In [12]: df['C'] = df['B']-df.groupby('A')['B'].transform('mean')

Thanks Jeff

Yes, that works for the first part and several of the intended use cases. (I didn't know which groupby methods are fast now.)

For the iterated version, I would prefer the inplace changes. Something like

In [12]: df['B'] -= df.groupby('A')['B'].transform('mean')

which kind of works when I try it out, but drops other columns so it might not be really inplace

Josef
 

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff

unread,
Jan 22, 2016, 12:15:41 PM1/22/16
to PyData
The inplace updating works. What gets dropped?


On Thursday, January 21, 2016 at 9:19:35 AM UTC-5, Josef Pktd wrote:

josef...@gmail.com

unread,
Jan 22, 2016, 12:53:07 PM1/22/16
to pyd...@googlegroups.com
On Fri, Jan 22, 2016 at 12:15 PM, Jeff <jeffr...@gmail.com> wrote:
The inplace updating works. What gets dropped?

Sorry, `UserError`, I was jumping around in the notebook and got most likely the wrong dataframe.
Nothing is dropped when I run again everything in sequence.

Thanks

As aside: 
It looks like a two-way fixed effect can be removed just in a few iterations. Convergence is very fast in the examples I tried with the standard firm - time effects pattern and 10% missing cells.
So users can do this easily based on pandas directly (once we add the required options to OLS), and assuming users don't need the details about the fixed effects.

Josef


 
--
Reply all
Reply to author
Forward
0 new messages