For the
pd.merge_asof function, I believe that when
right contains duplicates for the key, the values returned are from the last row, as if
right.drop_duplicates(key, keep='last'). I haven't dived deeply enough to understand where this occurs, but I would have expected getting back either a one-to-many merge (rarely desired here, I'm sure) or the first row found (like Excel). Of course, there's no reason to want duplicate rows in right. What would be appropriate here, documenting the behavior and seeing if there's a need to clean it up down the road?
Also, in some cases, sorting left on the key may have no relevance, and it can be useful to return the result with the original left.index. (The merge wipes out the original index.) One way to do this, assuming left and right are not already sorted:
left = left.sort_values(...).reset_index() # adds 'index' column to left
right = right.sort_values(...)
result = (pd.merge_asof(left, right, ...)
.set_index('index')
.sort_index())
Would it be helpful to include an example like this in the documentation?
I learned about
merge_asof this morning. I'd just written
https://github.com/RobinFiveWords/merge_closest (also available in PyPI)
, using np.searchsorted(), indices, and slices, to handle the simplest case; it’s basically a port of Excel’s VLOOKUP when the last argument is TRUE. Naturally, the pandas function seems to be faster, except in the case where the original index is desired, at which point merge_closest seems to be faster than the wrapped version of merge_asof above. merge_closest may also be a little more intuitive for Excel users new to Python; it doesn't require either DataFrame to be sorted.