df = pd.DataFrame({
'subject_id':[1,1,1,1,2,2,2,2,3,3,4,4,4,4,4],
'readings' : ['READ_1','READ_2','READ_1','READ_3','READ_1','READ_5','READ_6','READ_8','READ_10','READ_12','READ_11','READ_14','READ_09','READ_08','READ_07'],
'val' :[5,6,7,11,5,7,16,12,13,56,32,13,45,43,46],
})
N=2 # dividing into two dataframes for parallel processing.
dfs = [x for _,x in df.groupby(pd.factorize(df['person_id'])[0] // N)]
import multiprocessing as mp
def transpose_ope(df): #this function does the transformation like I want
df_op = (df.groupby(['subject_id','readings'])['val']
.describe()
.unstack()
.swaplevel(0,1,axis=1)
.reindex(df['readings'].unique(), axis=1, level=0))
df_op.columns = df_op.columns.map('_'.join)
df_op = df_op.reset_index()
return df_op
def main():
with mp.Pool(mp.cpu_count()) as pool:
res = pool.map(transpose_ope, [df for df in dfs])
if __name__=='__main__':
main()
**dummy huge dataframe**
df_size = int(3e7)
N = 30000000
s_arr = pd.util.testing.rands_array(10, N)
df = pd.DataFrame(dict(subject_id=np.random.randint(1, 1000, df_size),
readings = s_arr,
val=np.random.rand(df_size)
))
Hi David,When I read the file using modin pandas and performed the same operations as in my code, I got a message that it "defaults to normal pandas" because some feature is not present or something like thatBut the read_csv was faster. But couldn't do rest of the operations with modin and it defaulted to normal pandas. That's the problemNo error message as such. Performance is the issue. Is it because modin doesn't have all the pandas functions?did you try the code that I shared in the link?ThanksSelvaOn Tue, Oct 22, 2019, 00:13 Devin Petersohn <devin.p...@gmail.com> wrote:Hi Selva,I read the post you linked, but didn't see any mention of Modin. Did you get a chance to try it yet?For transpose, you can just call `df.T` just like in pandas and it should work. The number of columns in this case shouldn't be prohibitive for completing other operations. Would you be able to tell me how it works in Modin? If there is a particular operation that it is hanging on that will be very informative.Thanks!Devin
On Mon, Oct 21, 2019 at 12:59 AM SELVA MUTHU KUMARAN SATHAPPAN <selvasat...@gmail.com> wrote:
Hello Everyone,I am dealing with an input dataframe of size more than 4 million and 30 columns.I would like to perform certain groupby and transpose operations on dataframe to obtain a final output dataframe.Basically, what I am trying to do is explained in this SO post as well. Based on my logic, the number of columns might increase to an unimaginable level but that's fine as long as the row count gets reducedhttps://codereview.stackexchange.com/questions/231078/how-to-make-my-code-execute-faster-on-big-data
I have provided a sample dataframe which has de-identified records from original dataframe and dummy dataframe of more than 4M records (but data is random)Can someone please help? I am trying to figure this out for the past two days but unable to do it.ThanksSelva
--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/2d50ee9c-73a6-483a-b91e-88865938947a%40googlegroups.com.