Pivoting a large dataset (15gb)

已查看 2,294 次
跳至第一个未读帖子

Enrico Bergamini

未读,
2016年11月13日 19:00:492016/11/13
收件人 PyData
Hi everybody, 
I'm trying to create a pivot table for some data I collected. Here's a post with the details: the code I've been using and the shape of the data I'm dealing with: https://medium.com/@enricobergamini/creating-non-numeric-pivot-tables-with-python-pandas-7aa9dfd788a7#.a0yx47lqk

Now I have to scale up the size of the datasets, up to a 15 gb database and try to perform the same operation. The loading csv procedure obviously returns a memory error. 
Is there a way to iterate through the dataset and  in such a way it doesn't return a memory error? Would a chunksize=n method help me or it would corrupt the output, as chunking in the middle of an 'id' column and creating two or more rows for same id?

Thank you all in advance,
Enrico

Amol Sharma

未读,
2016年11月20日 12:59:342016/11/20
收件人 pyd...@googlegroups.com

--
Thanks and Regards,
Amol Sharma


--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pietro Battiston

未读,
2016年11月20日 14:17:132016/11/20
收件人 pyd...@googlegroups.com
Il giorno ven, 11/11/2016 alle 02.11 -0800, Enrico Bergamini ha
scritto:
Hi Enrico,

yes, in your case chunksize=n can risk creating two or more rows for
the same id. What you could do if the ids in the original file are
ordered is
1) read a chunk
2) read the id in the last row of it
3) process all of it rows except for those having such id (which will
be the last ones)
4) read another chunk and concatenate it to the rows left from the
previous one
5) if the original file is not over, restart from 2), otherwise process
what's left (the last id)

Pietro
回复全部
回复作者
转发
0 个新帖子