How can I avoid Pandas merge creating duplicates

9,132 views
Skip to first unread message

Xtian Simon

unread,
May 14, 2017, 6:43:01 PM5/14/17
to PyData
I've been reconciling sales data between two systems using Excel's Power Query, but this process is repetitive, and so ripe for scripting.

I've been working on using Pandas with pandas.merge():

    sales_03_19 = pd.merge(df1, df2, how="inner", left_on='Subtotal', right_on='Net Total' )

Problem is I can't seem to suppress the duplicated entries. If `df1` has shape (40,5) and `df2` has shape (38,6) (shown in Bold), I expect the result (using "inner") to be only the subset of matched records (something < 40). BUT, I'm getting a resulting df with shape (60,11), and in result I get results like that shown below.

These are sales transactions. One transaction in the left `df1` shows up in the right `df2`, and each transaction should have one corresponding match.  Duplicates in one system are duplicate sales, and not duplicated entries. I can't just drop duplicates, because they may be legitimate sales. What's worse, some systems don't have time stamp, but only date stamp (see below, the left `df1` has just date, while the right `df2` has date and time.)

How can I get Pandas to act more like PowerQuery and not made duplicated entries during merge()?











0238
2017-03-1927.52.44$34.433462017-03-19 10:36:0027.52.4429.94
1239
2017-03-1934.53.06$41.322552017-03-19 12:05:0034.53.0537.55
2239
2017-03-1934.53.06$41.324092017-03-19 08:46:0034.53.0537.55
3269
2017-03-1934.53.06$41.312552017-03-19 12:05:0034.53.0537.55
4269
2017-03-1934.53.06$41.314092017-03-19 08:46:0034.53.0537.55

















































Joris Van den Bossche

unread,
May 15, 2017, 3:57:46 AM5/15/17
to PyData
A reproducible example would help to understand what you exactly mean.
But, if you say that you get duplicated entries in the merged results, that means that you have duplicate values in the merge key. In that case, it is the responsibility of the user to deal with those I think (eg keep the first, or keep the last, ..).

Joris

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Xtian Simon

unread,
May 15, 2017, 8:27:09 AM5/15/17
to PyData
Hi Joris. Thanks for your reply.

I'm really looking to understand how to use Pandas to make this 1:1 comparisons like Excel's PowerQuery Plugin. It works like this. Transaction matching takes one sale for $5 in the left DF and tries to find a second $5 in the right DF. If a match is found, then both lines should be excluded from further matching. This is very clearly illustrated in the above table, where there are four yellow highlights on the left (#239, #269) and two highlights on the right (#409) highlighting duplicate IDs. This is the minimal requirement.

A reproducible example would help to understand what you exactly mean.
These are financial transactions. I'm sure you can understand the problem context--can't make imaginary sales! Or you wouldn't want the store to charge you twice for the same purchase.
That said, I did post a basic solution on StackOverflow. The problem has taken a different turn as I've used more data from the wild, and this duplication has become more apparent.


But, if you say that you get duplicated entries in the merged results, that means that you have duplicate values in the merge key.
This isn't a problem in the Excel's PowerQuery plugin, so I don't see how this isn't just a case of finding the right idiom in Pandas to recreate this matching functionality.
 
In that case, it is the responsibility of the user to deal with those I think (eg keep the first, or keep the last, ..).
I've read that in the documentation. That's why I'm here. There must be someone dealing with transaction data using Pandas. It works so well with transaction data in every other respect.
 

Joris Van den Bossche

unread,
May 15, 2017, 8:45:10 AM5/15/17
to PyData
I wanted to answer if you had taken a look at merge_asof, but based on that StackOverflow question, you apparently already have.
Can you try to explain how merge_asof is not solving your problem?


This isn't a problem in the Excel's PowerQuery plugin, so I don't see how this isn't just a case of finding the right idiom in Pandas to recreate this matching functionality.

But you have still not explained which idiom you want


If a match is found, then both lines should be excluded from further matching. 

So you want the first matching row? Why can't you then first drop all duplicates except for the first row, and then do the merge?

Joris

--

Xtian Simon

unread,
May 15, 2017, 1:22:46 PM5/15/17
to PyData
Can you try to explain how merge_asof is not solving your problem?
Let me show you. Here is the same step using merge_asof()

    sales_03_19_b = pd.merge_asof(df1,df2, left_on='Subtotal', right_on='Net Total')
 
As you can see, the resulting DF merge (shown below as df1,df2) contains duplicates in the df2 side of the merge_asof:

    HTML(sales_03_19_b.to_html())

2257O-21382987882017-03-1913.0$1.15$16.273472017-03-19 10:35:0013.01.1614.16
3251O-19406006192017-03-1913.5$1.20$14.703422017-03-19 10:44:0013.51.1914.69
4254O-13440854052017-03-1913.5$1.20$17.643422017-03-19 10:44:0013.51.1914.69
5253O-20402372872017-03-1914.0$1.24$17.613432017-03-19 10:41:0014.01.2415.24
6245O-12205834802017-03-1914.5$1.29$18.163112017-03-19 11:14:0014.51.2915.79
7247O-21391590052017-03-1914.5$1.29$18.163112017-03-19 11:14:0014.51.2915.79
8258O-13727251542017-03-1914.5$1.29$18.163112017-03-19 11:14:0014.51.2915.79
 
But you have still not explained which idiom you want

That's why I'm reaching out to the community. (^_^) 

Why can't you then first drop all duplicates except for the first row, and then do the merge?

I think we're going in circles. The reason you can't drop duplicates, as I answered in my second email, is because duplicates in either DF (not the merged) are duplicate sales transactions.
Tell me, what about the problem domain of sales transactions is still unclear to you?
Let's talk about that, and then see what idioms in Pandas can be used to address this domain of problems.

Alain Ledon

unread,
May 15, 2017, 2:19:50 PM5/15/17
to pyd...@googlegroups.com
Could you post a sample of your dataset? It's hard to figure out what exactly is going on without headers in your data.

Which column is 'Subtotal'? Which one is 'Net Total'?

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
Message has been deleted

Xtian Simon

unread,
May 15, 2017, 2:54:56 PM5/15/17
to PyData
Hey Little A1,

I intentionally left the data abstract (df1, df2) so we could focus on the problem of two data frames joining on columns with repeat (non-unique) values. And this is illustrated by the table using bold for df2. Then, I draw your attention to the repeated rows as if to say, Yes, it's real data from a real example, and these duplicated rows are the problem.

That said, I understand to someone who isn't familiar with the datasets, this may seem...a bridge too far. Here are the headers,
("DF1", ['ID', 'Date', 'Subtotal', 'Tax', 'Total'])
("DF2", ['Time', 'Net Total', 'Tax', 'Total Due'])

What's more. If you're the type of person who thinks better in code, than at the pseudo-level. I totally dig it. Attached here are two CSV files--have at them! (And perfect data masks other problems, so I added two errors I manually coded into each: df2 on line 10 and df1 on line 20, and I dropped a row; not to mention, df1 doesn't have a time stamp--only date).
df1.csv
df2.csv

Joris Van den Bossche

unread,
May 15, 2017, 4:33:11 PM5/15/17
to PyData


2017-05-15 19:22 GMT+02:00 Xtian Simon <xtian...@gmail.com>:
 
But you have still not explained which idiom you want

That's why I'm reaching out to the community. (^_^) 

Why can't you then first drop all duplicates except for the first row, and then do the merge?

I think we're going in circles. The reason you can't drop duplicates, as I answered in my second email, is because duplicates in either DF (not the merged) are duplicate sales transactions.
Tell me, what about the problem domain of sales transactions is still unclear to you?
Let's talk about that, and then see what idioms in Pandas can be used to address this domain of problems.

I don't know anything about sales transactions, so again, you will have to explain what behaviour you exactly want, instead of assuming that we know/can guess that (whether it is about sales transactions or not, that doesn't matter).

You can for example do that by giving a small example data (the csv files you provided at least make it reproducible, but it would be even clearer if you would slim it down to the smallest possible reproducible example, eg leave out columns that do not matter, use less rows, ..), and also provide your expected result (up to now you only showed the output you didn't want).

Joris
 

Xtian Simon

unread,
May 15, 2017, 8:18:46 PM5/15/17
to PyData
I don't know anything about sales transactions,

Ok.
TL;DR

These are sales transactions. One transaction in the left `df1` shows up in the right `df2`, and each transaction should have one corresponding match.  Duplicates in one system are duplicate sales, and not duplicated entries.

A sales transaction is a purchase of goods for money, right? You buy things all the time, right?

Now let's pretend you're a Merchant. You put up on eBay a "1993 SP Foil Derek Jeter Upper Deck Premiere Prospects Rookie Card #279" and it sells for $52. You ring this sale up in your iPad Point of Purchase System (POS) on the day of the sale. Then, you take the baseball card and pack it up for shipping.

At the end of the month you download a report of all your POS sales, and download another report of all your eBay sales in CSV format. Want you need to do is match all the POS sales to be certain that eBay hasn't cheated you, and shorted the payout of any of your sales.

You're a power seller; you sell a lot of different items. And in addition to selling unique auction items on eBay you also sell commodity items with a fixed price. And lets say, at the end of the month, you sold 50 baseball card protectors at 5 for $3.99. Now you have 10 duplicate entries for $3.99 in your POS and eBay report which are 10 separate sales to 10 separate individuals.

But this is eBay. Sales happen whenever someone buys the goods and you log the sales when you have time. The timestamps between reports don't correlate. And the User IDs on eBay and the POS system don't correlate. In the POS system, it might just get entered as 'EBAY SALE'. Again, duplicate column values--not unique. You can't make a unique 1:1 match with these columns.

What do you do?
The next best match you can make is a count. In my POS I see I have 10 x $3.99  'EBAY SALE' in the POS system. So as the Merchant, you want to see 10 x $3.99 sales paid in the eBay report.

That's all you need to know about matching sales as DFs.

You can for example do that by giving a small example data (the csv files you provided at least make it reproducible, but it would be even clearer if you would slim it down to the smallest possible reproducible example, eg leave out columns that do not matter, use less rows, ..),
and also provide your expected result (up to now you only showed the output you didn't want).

I've been testing (and posting my questions on SO) using a toy datasets where I drop everything to just an index and the join columns (subtotals)
df1 = np.array)[
  [1.0, 10.00],
  [3.0, 15.00],
  [5.0, 15.00],
  [7.0, 20.00]])

df2 = np.array)[
  [2.0, 10.00],
  [4.0, 15.00],
  [6.0, 15.00],
  [8.0, 15.75]])

 
This produces the positive match,


df1Xdf2Y
01.0010.002.0010.00
13.0015.004.0015.00
25.0015.006.0015.00

And the negative match,

df1Xdf2Y
07.0020.00NaNNaN
1NaNNaN8.0015.75

And, Importantly, the undesired result:

13.0015.004.0015.00
25.0015.004.0015.00











 

Joris Van den Bossche

unread,
May 16, 2017, 6:28:34 PM5/16/17
to PyData
You are still not clear on what your desired result is.
Is it the "positive match" ?
And if that is the case, again, what is the logic that you want? It seems that for duplicate keys you just matched the first occurence with the first occurence and second with second?

Joris
Message has been deleted

Xtian Simon

unread,
May 16, 2017, 9:23:47 PM5/16/17
to PyData
You are still not clear on what your desired result is.  [...] And if that is the case, again, what is the logic that you want?
I don't know what else I can give you. You have the scenario:
  • "suppress the duplicated entries"
  • "1:1 comparisons"
  • "like Excel's PowerQuery Plugin"
  • "duplicates in either DF (not the merged) are [repeat] sales transactions"
  • "columns with repeat (non-unique) values "
  • "best match you can make is a count"
    Because the join is on a column with duplicates, the best we can expect is 1 for 1 matches.
    Its a process of elimination.

    Reply all
    Reply to author
    Forward
    0 new messages