Pivoting a large dataset (15gb)

Enrico Bergamini

未读，

2016年11月13日 19:00:492016/11/13

收件人 PyData

Hi everybody,

I'm trying to create a pivot table for some data I collected. Here's a post with the details: the code I've been using and the shape of the data I'm dealing with: https://medium.com/@enricobergamini/creating-non-numeric-pivot-tables-with-python-pandas-7aa9dfd788a7#.a0yx47lqk

Now I have to scale up the size of the datasets, up to a 15 gb database and try to perform the same operation. The loading csv procedure obviously returns a memory error.

Is there a way to iterate through the dataset and in such a way it doesn't return a memory error? Would a chunksize=n method help me or it would corrupt the output, as chunking in the middle of an 'id' column and creating two or more rows for same id?

Thank you all in advance,

Enrico

Amol Sharma

未读，

2016年11月20日 12:59:342016/11/20

收件人 pyd...@googlegroups.com

You should use pytables - http://pandas-docs.github.io/pandas-docs-travis/io.html#hdf5-pytables

This answer should be useful - http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas

--

Thanks and Regards,

Amol Sharma

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pietro Battiston

未读，

2016年11月20日 14:17:132016/11/20

收件人 pyd...@googlegroups.com

Il giorno ven, 11/11/2016 alle 02.11 -0800, Enrico Bergamini ha
scritto:

Hi Enrico,

yes, in your case chunksize=n can risk creating two or more rows for
the same id. What you could do if the ids in the original file are
ordered is
1) read a chunk
2) read the id in the last row of it
3) process all of it rows except for those having such id (which will
be the last ones)
4) read another chunk and concatenate it to the rows left from the
previous one
5) if the original file is not over, restart from 2), otherwise process
what's left (the last id)

Pietro

回复全部

回复作者