Apply groupby and transpose operations on huge dataset using modin

SELVA MUTHU KUMARAN SATHAPPAN

unread,

Oct 21, 2019, 3:59:32 AM10/21/19

to modi...@googlegroups.com

Hello Everyone,

I am dealing with an input dataframe of size more than 4 million and 30 columns.

I would like to perform certain groupby and transpose operations on dataframe to obtain a final output dataframe.

Basically, what I am trying to do is explained in this SO post as well. Based on my logic, the number of columns might increase to an unimaginable level but that's fine as long as the row count gets reduced

https://codereview.stackexchange.com/questions/231078/how-to-make-my-code-execute-faster-on-big-data

Anyway I am sharing the code here as well

df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_1','READ_5','READ_6','READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
'val' :[5,6,7,11,5,7,16,12,13,56,32,13,45,43,46],
})


N=2  # dividing into two dataframes for parallel processing.
dfs = [x for _,x in df.groupby(pd.factorize(df['person_id'])[0] // N)]

import multiprocessing as mp

def transpose_ope(df):                      #this function does the transformation like I want
    df_op = (df.groupby(['subject_id','readings'])['val']
           .describe()
           .unstack()
           .swaplevel(0,1,axis=1)
           .reindex(df['readings'].unique(), axis=1, level=0))
    df_op.columns = df_op.columns.map('_'.join)
    df_op = df_op.reset_index()
    return df_op


def main():

    with mp.Pool(mp.cpu_count()) as pool:
    res = pool.map(transpose_ope, [df for df in dfs])

if __name__=='__main__':
    main()



**dummy huge dataframe**

df_size = int(3e7)
N = 30000000
s_arr = pd.util.testing.rands_array(10, N)
df = pd.DataFrame(dict(subject_id=np.random.randint(1, 1000, df_size),
                   readings = s_arr,
                   val=np.random.rand(df_size)
                   ))

I have provided a sample dataframe which has de-identified records from original dataframe and dummy dataframe of more than 4M records (but data is random)

Can someone please help? I am trying to figure this out for the past two days but unable to do it.

Thanks

Selva

Devin Petersohn

unread,

Oct 21, 2019, 10:11:28 PM10/21/19

to SELVA MUTHU KUMARAN SATHAPPAN, modin-dev

Hi Selva,

Modin doesn't have yet have a parallel implementation for unstack, which is probably the issue. It should tell you when you are running the code which functions are not yet implemented.

We have a plan for implementation and unstack is on the list for implementation this next few months. If you want to see which functions are implemented, visit the documentation page: https://modin.readthedocs.io/en/latest/UsingPandasonRay/dataframe_supported.html

Devin

On Mon, Oct 21, 2019 at 4:34 PM SELVA MUTHU KUMARAN SATHAPPAN <selvasat...@gmail.com> wrote:

Hi David,

When I read the file using modin pandas and performed the same operations as in my code, I got a message that it "defaults to normal pandas" because some feature is not present or something like that

But the read_csv was faster. But couldn't do rest of the operations with modin and it defaulted to normal pandas. That's the problem

No error message as such. Performance is the issue. Is it because modin doesn't have all the pandas functions?

did you try the code that I shared in the link?

Thanks
Selva

On Tue, Oct 22, 2019, 00:13 Devin Petersohn <devin.p...@gmail.com> wrote:
Hi Selva,

I read the post you linked, but didn't see any mention of Modin. Did you get a chance to try it yet?

For transpose, you can just call `df.T` just like in pandas and it should work. The number of columns in this case shouldn't be prohibitive for completing other operations. Would you be able to tell me how it works in Modin? If there is a particular operation that it is hanging on that will be very informative.
Thanks!

Devin

On Mon, Oct 21, 2019 at 12:59 AM SELVA MUTHU KUMARAN SATHAPPAN <selvasat...@gmail.com> wrote:

Hello Everyone,

I am dealing with an input dataframe of size more than 4 million and 30 columns.

I would like to perform certain groupby and transpose operations on dataframe to obtain a final output dataframe.

Basically, what I am trying to do is explained in this SO post as well. Based on my logic, the number of columns might increase to an unimaginable level but that's fine as long as the row count gets reduced

https://codereview.stackexchange.com/questions/231078/how-to-make-my-code-execute-faster-on-big-data

I have provided a sample dataframe which has de-identified records from original dataframe and dummy dataframe of more than 4M records (but data is random)

Can someone please help? I am trying to figure this out for the past two days but unable to do it.

Thanks
Selva

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/2d50ee9c-73a6-483a-b91e-88865938947a%40googlegroups.com.

SELVA MUTHU KUMARAN SATHAPPAN

unread,

Oct 21, 2019, 10:19:20 PM10/21/19

to Devin Petersohn, modin-dev

Hi Devin,

Thanks for the response.

Would like to know whether you or any of the members in the forum with your expertise can help me on how to make my code execute faster? I am new to Python, so not sure whether I am missing something. I tried Modin, pandarallel, parallel processing etc

Am I doing the parallel processing right?

Is there any other way to do this?

Would really be helpful

Thanks

Selva

Devin Petersohn

unread,

Oct 21, 2019, 10:26:53 PM10/21/19

to SELVA MUTHU KUMARAN SATHAPPAN, modin-dev

No problem, I will let the other members chime in if they would like about their own abilities or if they have any input.

Devin

Reply all

Reply to author

Forward