Memory Error with pandas processing big data

173 views
Skip to first unread message

jesus s.h

unread,
Jun 18, 2014, 2:02:27 AM6/18/14
to panda-pro...@googlegroups.com
Hi,

I am having a problem with the memory usage (Python 32 in Windows 64 8GB).

I am reading data from a quite big PyTable (22Millions of rows), and using Pandas to calculate statistics (mean, std), and I am getting Memory Error.

I am not expert using python and even less pandas, so I do not know how to read the data from the PyTable and avoid this Memory Error, since, I would need all the data together to calculate statistics like mean or std.

As an example of use, here it is one of the scripts that I am using, which works well with small tables:


#####################################################

def getPageStats(pathToH5, pages, versions, sheets):

 

     with openFile(pathToH5, 'r') as f:

         tab = f.getNode("/pageTable")

 

         dversions = dict((i, None) for i in versions)

         dsheets = dict((i, None) for i in sheets)

         dpages = dict((i, None) for i in pages)

 

 

         df = pd.DataFrame([[row['page'],row['index0'], row['value0'] ] for row in tab.where('(firstVersion == 0) & (ok == 1)') if row['version'] in dversions and row['sheetNum'] in dsheets and row['pages'] in dpages ], columns=['page','index0', 'value0'])

         df2 = pd.DataFrame([[row['page'],row['index1'], row['value1'] ] for row in tab.where('(firstVersion == 1) & (ok == 1)') if row['version'] in dversions and row['sheetNum'] in dsheets and row['pages'] in dpages], columns=['page','index1', 'value1'])

 

         for i in dpages:

 

 

             m10 = df.loc[df['page']==i]['index0'].mean()

             s10 = df.loc[df['page']==i]['index0'].std()

 

             m20 = df.loc[df['page']==i]['value0'].mean()

             s20 = df.loc[df['page']==i]['value0'].std()

 

             m11 = df2.loc[df2['page']==i]['index1'].mean()

             s11 = df2.loc[df2['page']==i]['index1'].std()

 

             m21 = df2.loc[df2['page']==i]['value1'].mean()

             s21 = df2.loc[df2['page']==i]['value1'].std()

 

             yield (i,m10, s10), (i,m11, s11), (i,m20,s20), (i,m21,s21))


#####################################################


I have been reading some pandas documentation, and the cookbook, but I think I do not get yet how should I work when the data is stored in a big file like PyTables, and It need to be processed.


I also know about the limitations of python 32 regarding memory, but still, I think this might be able to work in a 32bit machine.


Any help would be appreciated.


Thanks.

wm higgins

unread,
Jun 18, 2014, 2:20:11 AM6/18/14
to panda-pro...@googlegroups.com
jesus:

i think you're mixing up the pandas data analysis library (http://pandas.pydata.org/) with the panda data library for newsrooms (http://pandaproject.net/).

this list group is for the data library. sorry for the confusion.

i do feel your pain, as i've been wrestling with a large data set as well. is there any way to break your data up into smaller sets for processing?

jesus s.h

unread,
Jun 18, 2014, 2:34:53 AM6/18/14
to panda-pro...@googlegroups.com

Hey wm higgins,

Sorry for the mistake and thanks for the answer.

About breaking the data, I have 2 problems with this: first, of course, I am inexpert, so I do not know how to do that, any example or recomendation would be great. But my second problem is that, even if I can split the data into small pieces (which I dont know how), then things like std deviation are not easy to calculate.

What do you think about that??, how do you deal with very large data sets??

And, could you recomend me any reading or any page where I can find information or examples about how to deal with very large data sets, python, pandas, etc.

Thanks again, any help is always welcome!

w higgins

unread,
Jun 18, 2014, 12:03:48 PM6/18/14
to panda-pro...@googlegroups.com
i'm no expert myself, but i might try isolating the column you're needing to work on and split it out as a single-column table, which might be small enough to work on in memory, rather than holding the whole data set in memory.


--
You received this message because you are subscribed to the Google Groups "PANDA Project Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to panda-project-u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jesus s.h

unread,
Jun 19, 2014, 1:46:26 AM6/19/14
to panda-pro...@googlegroups.com
Thanks for the help!
To unsubscribe from this group and stop receiving emails from it, send an email to panda-project-users+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages