merge_asof doc: duplicate handling, index preservation

868 views
Skip to first unread message

RobinFiveWords

unread,
Sep 22, 2016, 10:42:37 PM9/22/16
to PyData
For the pd.merge_asof function, I believe that when right contains duplicates for the key, the values returned are from the last row, as if right.drop_duplicates(key, keep='last'). I haven't dived deeply enough to understand where this occurs, but I would have expected getting back either a one-to-many merge (rarely desired here, I'm sure) or the first row found (like Excel). Of course, there's no reason to want duplicate rows in right. What would be appropriate here, documenting the behavior and seeing if there's a need to clean it up down the road?

Also, in some cases, sorting left on the key may have no relevance, and it can be useful to return the result with the original left.index. (The merge wipes out the original index.) One way to do this, assuming left and right are not already sorted:

left = left.sort_values(...).reset_index()  # adds 'index' column to left
right = right.sort_values(...)
result = (pd.merge_asof(left, right, ...)
          .set_index('index')
          .sort_index())

Would it be helpful to include an example like this in the documentation?

I learned about merge_asof this morning. I'd just written https://github.com/RobinFiveWords/merge_closest (also available in PyPI), using np.searchsorted(), indices, and slices, to handle the simplest case; it’s basically a port of Excel’s VLOOKUP when the last argument is TRUE. Naturally, the pandas function seems to be faster, except in the case where the original index is desired, at which point merge_closest seems to be faster than the wrapped version of merge_asof above. merge_closest may also be a little more intuitive for Excel users new to Python; it doesn't require either DataFrame to be sorted.

Tim Michelsen

unread,
Sep 23, 2016, 4:26:27 PM9/23/16
to pyd...@googlegroups.com
> Would it be helpful to include an example like this in the documentation?
+1

I had a similar issue and needed to find out to sort first...


Christopher Aycock

unread,
Sep 30, 2016, 3:03:18 PM9/30/16
to PyData
On Thursday, September 22, 2016 at 10:42:37 PM UTC-4, RobinFiveWords wrote:
For the pd.merge_asof function, I believe that when right contains duplicates for the key, the values returned are from the last row...

On Friday, September 23, 2016 at 4:26:27 PM UTC-4, Timmie wrote:

I had a similar issue and needed to find out to sort first...

 The docstring lists both of these at the top:

For each row in the left DataFrame, we select the last row in the right
DataFrame whose 'on' key is less than or equal to the left's key. Both
DataFrames must be sorted by the key.

Christopher Aycock

unread,
Sep 30, 2016, 3:03:33 PM9/30/16
to PyData
Regarding indices instead of a regular column to join on, this something I have requested too:


And regarding pd.merge_asof() vs np.searchsorted(), the performance will be similar when matching only on timestamp. When matching on an additional key with the by parameter, pd.merge_asof() will be much faster than grouping the DataFrames and running np.searchsorted() iteratively.
Reply all
Reply to author
Forward
0 new messages