pandas merge memory error?

2,139 views
Skip to first unread message

yarden

unread,
Dec 4, 2012, 11:50:41 AM12/4/12
to pystat...@googlegroups.com
Hi all,

I am using pandas-0.9.0-py2.6-linux-x86_64.egg and have been doing a merge operation on a bunch of smallish tables parsed by read_table (~100k rows by 15 or so columns, all containing strings and integers).  

Sometimes I get a memory error from merge, where the trace looks like this:

 File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/tools/merge.py", line 33, in merge
   return op.get_result()
 File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/tools/merge.py", line 190, in get_result
   result_data = join_op.get_result()
 File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/tools/merge.py", line 663, in get_result
   res_blk = self._get_merged_block(klass_blocks)
 File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/tools/merge.py", line 674, in _get_merged_block
   self.result_items, copy=self.copy)
 File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/tools/merge.py", line 760, in reindex_block
   axis=axis)
 File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/core/internals.py", line 114, in reindex_axis
   fill_value=fill_value)
 File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/core/common.py", line 365, in take_fast
   axis=axis, fill_value=fill_value)
 File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/core/common.py", line 326, in take_2d
   out = np.empty(out_shape, dtype=arr.dtype)
MemoryError


I'm not sure how to start debugging this. Are there known issues with memory and merge that might be fixed in a later version? I.e. could this be fixed by an upgrade? If not, are there any hints on how to best debug this? I'm doing this on a computer with roughly 64 GB of RAM.  Thanks.

yarden

unread,
Dec 4, 2012, 2:33:57 PM12/4/12
to pystat...@googlegroups.com
This is probably better suited for GitHub, so I created a self-contained test that reproduces this MemoryError and described it in an issue on GitHub, available here ( #2427): 



My GitHub issue is reproduced below. 

==
I'm performing the following set of consecutive `merge` operations in pandas on a few small tables (100k rows by ~15 columns) that are read from file using `read_table`. Although the tables are small and I run it on a machine with over 60 GB of RAM, I consistently get `MemoryError` from the `merge` operation, and the traceback points to:

``
  File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/tools/merge.py", line 437, in _get_join_indexers
    return join_func(left_group_key, right_group_key, max_groups)
  File "join.pyx", line 105, in pandas.lib.left_outer_join (pandas/src/tseries.c:118232)
  File "join.pyx", line
``

The following zip contains the short script (`make_table.py`) and the necessary text files to reproduce the bugs:


To generate the error, simply run: 

``
python make_table.py
``

which for me yields the output:


```Merging known to Ensembl...
  - Merge took 0.48 secs
Merging kgXref...
Traceback (most recent call last):
  File "make_table.py", line 137, in <module>
    main()
  File "make_table.py", line 124, in main
    right_on=["kgID"])
  File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/tools/merge.py", line 33, in merge
    return op.get_result()
  File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/tools/merge.py", line 180, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/tools/merge.py", line 255, in _get_join_info
    sort=self.sort, how=self.how)
  File "/home/yarden/.local/lib/python2.6/site-packages/pandas-0.9.0-py2.6-linux-x86_64.egg/pandas/tools/merge.py", line 437, in _get_join_indexers
    return join_func(left_group_key, right_group_key, max_groups)
  File "join.pyx", line 105, in pandas.lib.left_outer_join (pandas/src/tseries.c:118232)
  File "join.pyx", line 190, in pandas.lib._get_result_indexer (pandas/src/tseries.c:119829)
MemoryError
```


I'm happy to provide any information that might be needed and/or test code that might fix this. Any insights on what might be causing this will be greatly appreciated. 

Jeff Hsu

unread,
Dec 5, 2012, 12:39:36 AM12/5/12
to pystat...@googlegroups.com
Trying dropping the NAs from KgID.  Also I don't think your first couple of merges work since main_table is using the default index rather than the name (transcript ID) column.  Set the index_col when you load for both main_table and the subsequent tables.  

I'm a big fan of MISO by the way.  

yarden

unread,
Dec 5, 2012, 9:22:36 AM12/5/12
to pystat...@googlegroups.com
Thanks very much Jeff, I appreciate it! Dropping the NAs was key.

It seems to me that the first few merges do work though, because I set their headers in such a way that there are overlapping column names, and I think the default behavior is to do the merge operation on that, even when there's no index set. E.g. 

    print "Before merge: "
    print main_table.set_index("name").ix["ENST00000516583"]
    main_table = pandas.merge(main_table, ensGene_to_names,
                              # try left index
                              how="left")
    t2 = time.time()
    print "  - Merge took %.2f secs" %(t2 - t1)
    print "After: "
    print main_table.set_index("name").ix["ENST00000516583"]

Reveals that the info is added to the table.

Jeff Hsu

unread,
Dec 5, 2012, 10:40:06 AM12/5/12
to pystat...@googlegroups.com
Ah whoops, sorry you're right.  
Reply all
Reply to author
Forward
0 new messages