Transaction Deduplication

已查看 237 次
跳至第一个未读帖子

gmcg...@gmail.com

未读,
2021年8月26日 20:12:272021/8/26
收件人 hledger
Hi,

I am familiar with the hledger import method, the hledger diff functionality, and also ledger-autosync. I'm wondering if everybody else finds these methods to be sufficient for doing transaction deduplication or if anybody does anything else.

I used to use ledger-autosync but I found that the OFX IDs that appear in some banks QFX/OFX files sometimes change, months after staying the same! This messes up deduplication and causes errant transactions to be duplicated. Also, a few places I import data from did not support well-formatted QFX/OFX. In the end, I've ended up writing my own scripts to do deduplication but so far, they are fairly cumbersome to integrate with a workflow that otherwise relies on hledger to process data, including the CSV import rules which I like using.

My issues with the existing hledger "import" command are that it misses transactions when two transactions clear in a different order than their transaction date. Here is an example with a credit card:

* I pay for dinner at a restaurant on Monday
* I pay for gas on Tuesday
* On Wednesday, the gas transaction clears and hits the transaction log dated on Tuesday
* I import my data on Wednesday morning which includes the gas transaction dated Tuesday
* Later on Wednesday, the dinner transaction clears (the restaurant finally entered the tips and pushed the transaction through)
* I import my data again on Thursday morning and hledger misses my restaurant transaction because it is dated 1 day earlier (Monday) than the already imported gas transaction (Tuesday)

Obviously, this confusion goes away if the bank would publish the cleared date as the transaction date rather than the original transaction date but sometimes this is not done, different banks do it differently from each other. I try to code my strategies so that they always work, even when banks do weird things like this.

My current script does this (pseudocode):
  • for transaction in existingtransactions (array of transactions from existing ledger)
    • transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )  #  604800 is number of seconds in 1 week
    • tuple = (transaction.firstaccount, transactionweeks, transaction.amount)
    • existing[ tuple ] ++ (building a map to "index" existing transactions)

  • for transaction in importtransactions (array of transactions from recently downloaded CSV)
    • transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )
    • tuple1 = (transaction.firstaccount, transactionweeks, transaction.amount)
    • tuple2 = (transaction.firstaccount, transactionweeks - 1, transaction.amount)
    • tuple3 = (transaction.firstaccount, transactionweeks + 1, transaction.amount)
    • if existing[ tuple1 ]
      • existing[ tuple1 ] - - (decrement map to mark transaction as "found")

    • else if existing[ tuple2 ]  (allow date to be a little behind)
      • existing[ tuple2 ] - - (decrement map to mark transaction as "found")

    • else if existing[ tuple3 ]  (allow date to be a little ahead)
      • existing[ tuple3 ] - - (decrement map to mark transaction as "found")

    • else
      • print transaction (found a new non-duplicate transaction!)
This algorithm (coded in python) has worked almost flawlessly for me for over a year and it performs in O(n) and never seems to mess up, as long as I download several weeks of transactions each time I download a CSV. The thing is, I wonder if there is a simpler approach that other people use which would allow me to use hledger alone for my data import rather than using an external script that must parse ledger format and manipulate the ledger externally.

Thanks for any thoughts or ideas!
Garret

Simon Michael

未读,
2021年8月26日 20:22:272021/8/26
收件人 hledger

On Aug 26, 2021, at 2:12 PM, gmcg...@gmail.com <gmcg...@gmail.com> wrote:

Hi,

I am familiar with the hledger import method, the hledger diff functionality, and also ledger-autosync. I'm wondering if everybody else finds these methods to be sufficient for doing transaction deduplication or if anybody does anything else.

I used to use ledger-autosync but I found that the OFX IDs that appear in some banks QFX/OFX files sometimes change, months after staying the same! This messes up deduplication and causes errant transactions to be duplicated. Also, a few places I import data from did not support well-formatted QFX/OFX. In the end, I've ended up writing my own scripts to do deduplication but so far, they are fairly cumbersome to integrate with a workflow that otherwise relies on hledger to process data, including the CSV import rules which I like using.

My issues with the existing hledger "import" command are that it misses transactions when two transactions clear in a different order than their transaction date. Here is an example with a credit card:

* I pay for dinner at a restaurant on Monday
* I pay for gas on Tuesday
* On Wednesday, the gas transaction clears and hits the transaction log dated on Tuesday
* I import my data on Wednesday morning which includes the gas transaction dated Tuesday
* Later on Wednesday, the dinner transaction clears (the restaurant finally entered the tips and pushed the transaction through)
* I import my data again on Thursday morning and hledger misses my restaurant transaction because it is dated 1 day earlier (Monday) than the already imported gas transaction (Tuesday)

Obviously, this confusion goes away if the bank would publish the cleared date as the transaction date rather than the original transaction date but sometimes this is not done, different banks do it differently from each other. I try to code my strategies so that they always work, even when banks do weird things like this.

Hi Garret,

A nice example, thanks. As you say, that behaviour defeats hledger import, unfortunately.
(https://hledger.org/hledger.html#deduplication -> new items always have the newest dates.)

It would be cool if we can come up with a variant that handles this case. 


My current script does this (pseudocode):
  • for transaction in existingtransactions (array of transactions from existing ledger)
    • transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )  #  604800 is number of seconds in 1 week
    • tuple = (transaction.firstaccount, transactionweeks, transaction.amount)
    • existing[ tuple ] ++ (building a map to "index" existing transactions)

  • for transaction in importtransactions (array of transactions from recently downloaded CSV)
    • transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )
    • tuple1 = (transaction.firstaccount, transactionweeks, transaction.amount)
    • tuple2 = (transaction.firstaccount, transactionweeks - 1, transaction.amount)
    • tuple3 = (transaction.firstaccount, transactionweeks + 1, transaction.amount)
    • if existing[ tuple1 ]
      • existing[ tuple1 ] - - (decrement map to mark transaction as "found")

    • else if existing[ tuple2 ]  (allow date to be a little behind)
      • existing[ tuple2 ] - - (decrement map to mark transaction as "found")

    • else if existing[ tuple3 ]  (allow date to be a little ahead)
      • existing[ tuple3 ] - - (decrement map to mark transaction as "found")

    • else
      • print transaction (found a new non-duplicate transaction!)
This algorithm (coded in python) has worked almost flawlessly for me for over a year and it performs in O(n) and never seems to mess up, as long as I download several weeks of transactions each time I download a CSV. The thing is, I wonder if there is a simpler approach that other people use which would allow me to use hledger alone for my data import rather than using an external script that must parse ledger format and manipulate the ledger externally.

I think no one recipe can be reliable with everyone's bank, even the one you describe here - but I'd be happy to be wrong. 

Could you give a brief high-level description of how it works, and if possible list the conditions required for it to work ?

gmcg...@gmail.com

未读,
2021年8月26日 22:49:252021/8/26
收件人 hledger
On Thursday, August 26, 2021 at 8:22:27 PM UTC-4 Simon Michael (sm) wrote:

On Aug 26, 2021, at 2:12 PM, gmcg...@gmail.com <gmcg...@gmail.com> wrote:

Hi,

I am familiar with the hledger import method, the hledger diff functionality, and also ledger-autosync. I'm wondering if everybody else finds these methods to be sufficient for doing transaction deduplication or if anybody does anything else.

I used to use ledger-autosync but I found that the OFX IDs that appear in some banks QFX/OFX files sometimes change, months after staying the same! This messes up deduplication and causes errant transactions to be duplicated. Also, a few places I import data from did not support well-formatted QFX/OFX. In the end, I've ended up writing my own scripts to do deduplication but so far, they are fairly cumbersome to integrate with a workflow that otherwise relies on hledger to process data, including the CSV import rules which I like using.

My issues with the existing hledger "import" command are that it misses transactions when two transactions clear in a different order than their transaction date. Here is an example with a credit card:

* I pay for dinner at a restaurant on Monday
* I pay for gas on Tuesday
* On Wednesday, the gas transaction clears and hits the transaction log dated on Tuesday
* I import my data on Wednesday morning which includes the gas transaction dated Tuesday
* Later on Wednesday, the dinner transaction clears (the restaurant finally entered the tips and pushed the transaction through)
* I import my data again on Thursday morning and hledger misses my restaurant transaction because it is dated 1 day earlier (Monday) than the already imported gas transaction (Tuesday)

Obviously, this confusion goes away if the bank would publish the cleared date as the transaction date rather than the original transaction date but sometimes this is not done, different banks do it differently from each other. I try to code my strategies so that they always work, even when banks do weird things like this.

Hi Garret,

A nice example, thanks. As you say, that behaviour defeats hledger import, unfortunately.
(https://hledger.org/hledger.html#deduplication -> new items always have the newest dates.)

It would be cool if we can come up with a variant that handles this case. 

Yes, I have thought a bit about how to make a subtle variant that handles this case but I haven't come up with one yet.
 


My current script does this (pseudocode):
  • for transaction in existingtransactions (array of transactions from existing ledger)
    • transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )  #  604800 is number of seconds in 1 week
    • tuple = (transaction.firstaccount, transactionweeks, transaction.amount)
    • existing[ tuple ] ++ (building a map to "index" existing transactions)

  • for transaction in importtransactions (array of transactions from recently downloaded CSV)
    • transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )
    • tuple1 = (transaction.firstaccount, transactionweeks, transaction.amount)
    • tuple2 = (transaction.firstaccount, transactionweeks - 1, transaction.amount)
    • tuple3 = (transaction.firstaccount, transactionweeks + 1, transaction.amount)
    • if existing[ tuple1 ]
      • existing[ tuple1 ] - - (decrement map to mark transaction as "found")

    • else if existing[ tuple2 ]  (allow date to be a little behind)
      • existing[ tuple2 ] - - (decrement map to mark transaction as "found")

    • else if existing[ tuple3 ]  (allow date to be a little ahead)
      • existing[ tuple3 ] - - (decrement map to mark transaction as "found")

    • else
      • print transaction (found a new non-duplicate transaction!)
This algorithm (coded in python) has worked almost flawlessly for me for over a year and it performs in O(n) and never seems to mess up, as long as I download several weeks of transactions each time I download a CSV. The thing is, I wonder if there is a simpler approach that other people use which would allow me to use hledger alone for my data import rather than using an external script that must parse ledger format and manipulate the ledger externally.

I think no one recipe can be reliable with everyone's bank, even the one you describe here - but I'd be happy to be wrong. 

Could you give a brief high-level description of how it works, and if possible list the conditions required for it to work ?

Sure, the high-level description of my algorithm is that it is similar to hledger diff's algorithm but it is not date-agnostic. It matches transactions based on a particular account and it diffs the list of amounts.  Additionally (unlike "hledger diff"), before marking two transactions as duplicates of each other, it will make sure that the date is "kinda close" within some tolerance. 

Hledger diff:
  • Matches on account name
  • Matches on posting amount
Garret's 3-tuple algorithm:
  • Matches on account name
  • Matches on posting amount
  • Also fuzzy matches on transaction date meaning that the transaction dates must be within a couple of weeks from each other before they are considered duplicates. This prevents a brand new transaction whose amount happens to match a very old existing transaction's amount from being skipped during import.
 
Here is a pros / cons list, from my experimentation:

Garret's 3-tuple algorithm
  • Pros
    • Is order agnostic.  (I've had the order of transactions shuffle when transactions clear in some bank CSVs)
    • Can correctly import transactions even if the dates change by a small amount (This happens very often for some banks but never for other banks)
    • Is stateless. No external file must be maintained and synchronized
  • Cons
    • You must choose a particular account to index the transactions off of during deduplication. This differentiates my approach from the simplicity of the "hledger import" approach. Alternatively, the user could choose to designate the "account on the first posting" as the account to index. This is what I do in my files but of course transactions can be generated from hledger's CSV import that reference multiple bank accounts.
    • This approach requires that you download CSVs with a substantial amount of account transaction history. At least one month of transactions is sufficient but with a shorter transaction history, it begins to suffer from the same flaws as "hledger diff"
    • This approach is sensitive to amounts on transactions changing (I have never actually seen this happen, in real life)

hledger import
  • Pros
    • Is account and posting agnostic, which means that the import process can happen based off of date alone
    • Works even if you use CSVs with limited account transaction history
  • Cons
    • It is stateful, which means a separate file must be maintained
    • It is sensitive to the order of transactions being imported. If the order of transactions changes in the CSV, transactions can be duplicated or missed

hledger diff  (While "hledger diff" is not meant to be a transaction importer utility, it can be used as one)
  • Pros
    • Same pros as Garret's 3-tuple algorithm
  • Cons
    • Easily misses transactions when there are transactions with similar historical amounts being imported (example: recurring transaction for the cable bill is often the same "amount" each time)
    • Of course this one is also sensitive to amounts on transactions changing across downloads (not a common problem)

Simon Michael

未读,
2021年8月27日 12:51:132021/8/27
收件人 hledger
On Aug 26, 2021, at 4:49 PM, gmcg...@gmail.com <gmcg...@gmail.com> wrote:
My current script does this (pseudocode):
  • for transaction in existingtransactions (array of transactions from existing ledger)
    • transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )  #  604800 is number of seconds in 1 week
    • tuple = (transaction.firstaccount, transactionweeks, transaction.amount)
    • existing[ tuple ] ++ (building a map to "index" existing transactions)

  • for transaction in importtransactions (array of transactions from recently downloaded CSV)
    • transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )
    • tuple1 = (transaction.firstaccount, transactionweeks, transaction.amount)
    • tuple2 = (transaction.firstaccount, transactionweeks - 1, transaction.amount)
    • tuple3 = (transaction.firstaccount, transactionweeks + 1, transaction.amount)
    • if existing[ tuple1 ]
      • existing[ tuple1 ] - - (decrement map to mark transaction as "found")

    • else if existing[ tuple2 ]  (allow date to be a little behind)
      • existing[ tuple2 ] - - (decrement map to mark transaction as "found")

    • else if existing[ tuple3 ]  (allow date to be a little ahead)
      • existing[ tuple3 ] - - (decrement map to mark transaction as "found")

    • else
      • print transaction (found a new non-duplicate transaction!)
This algorithm (coded in python) has worked almost flawlessly for me for over a year and it performs in O(n) and never seems to mess up, as long as I download several weeks of transactions each time I download a CSV. The thing is, I wonder if there is a simpler approach that other people use which would allow me to use hledger alone for my data import rather than using an external script that must parse ledger format and manipulate the ledger externally.
Could you give a brief high-level description of how it works, and if possible list the conditions required for it to work ?

Sure, the high-level description of my algorithm is that it is similar to hledger diff's algorithm but it is not date-agnostic. It matches transactions based on a particular account and it diffs the list of amounts.  Additionally (unlike "hledger diff"), before marking two transactions as duplicates of each other, it will make sure that the date is "kinda close" within some tolerance. 

And now a description somewhere between these two ? Sorry :-) I don't remember hledger diff's algorithm exactly and I'm wondering about the details of "kinda close", and what's happening with the chunking into weeks, and what are the situtations in which this algorithm fails.


gmcg...@gmail.com

未读,
2021年8月27日 16:28:052021/8/27
收件人 hledger
Did you miss the "pros and cons" part of my previous post? It looks like your email client might have cut it off, I'm not sure. I have attached an additional chart to help explain it further. 

My algorithm tends to fail in categories that involve maintenance and user-education rather than data correctness. I don't think my algorithm would be a good candidate to completely replace hledger import because of this. That being said, I think my algorithm is very good at keeping data from becoming corrupt by accumulating duplicates or missing transactions that ought to have been imported. 

To answer your question, "Kinda close" is flexible. For my banks 5-7 days seems to be the appropriate chunk. The reason for chunking the dates into "weeks" is to keep my fixed internet bill of $79.99 (or whatever it might be) from being ignored during data import, simply because it matched up with a historical internet bill of exactly the same amount from months ago. Since my algorithm finds duplicates based primarily upon *account name* and *transaction amount*, it's easy to see data import issues if I didn't include *transaction date converted to weeks* as the third criteria for data import.

Here is an example #1:

Existing transactions:

May 5 - Internet bill - $79.99
May 10 - Grocery store - $143.54
June 5 - Internet bill - $79.99
June 10 - Grocery store - $155.12
July 5 - Internet bill - $79.99
July 10 - Grocery store - $140.20
August 5 - Internet bill - $79.99

Downloaded CSV from bank (downloading past 3 months):

June 5 - Internet bill - $79.99
June 10 - Grocery store - $155.12
July 5 - Internet bill - $79.99
July 10 - Grocery store - $140.20
August 5 - Internet bill - $79.99
August 10 - Grocery store - $166.32
September 5 - Internet bill - $79.99

If my algorithm simply deduplicated by counting the number of $79.99 transactions that currently exist in this account (there are 4), it would see that the newly imported CSV also includes 4 of the same of $79.99 transactions. Therefore it would ignore the September 5th internet bill.  BUT if the algorithm pays attention to the date of the imported transactions, it will see that the most recent $79.99 is new, since it does not fall into a previously used "weeks bucket"

Hopefully this example shows the advantage of using my approach, assuming that you "feed it" a sufficient amount of data.

Here is an example which shows why you need to feed my algorithm a sufficient amount of data.

Example #2

Existing transactions:

May 5 - City parking - $3.00
May 10 - Groceries - $90.34
May 18 - City parking - $3.00
June 5 - City parking - $3.00
June 10 - Groceries - $73.44
June 11 - City parking - $3.00
June 19 - City parking - $3.00

Downloaded CSV from bank (downloaded 2 months of transactions)

May 18 - City parking - $3.00
June 5 - City parking - $3.00
June 10 - Groceries - $73.44
June 11 - City parking - $3.00
June 19 - City parking - $3.00
June 21 - City parking - $3.00  (this transaction is properly is recognized as a new transaction, since all other transactions have been marked as duplicates)

Downloaded CSV from bank (downloaded 2 weeks of transactions which is NOT sufficient to avoid issues)

June 10 - Groceries - $73.44
June 11 - City parking - $3.00 (THIS TRANSACTION IS ERRONEOUSLY MATCHED WITH June 5th "City Parking" transaction because it is within 1 week from that transaction) 
June 19 - City parking - $3.00 (subsequently, this transaction is erroneously matched with the June 11th "City Parking" transaction)
June 21 - City parking - $3.00 (this transaction is therefore NOT recognized as new because it is matched with the existing June 19 "City Parking" transaction)

Hopefully these examples show why this algorithm tends to work, as long as you download a lot of overlapping transaction history.

Unfortunately, there is also a subtle weakness in my algorithm where if you have a very consistent repeated transaction every single week for months on end, it could cause an error to cascade through each "weeks bucket" over a very long dataset and cause the import to miss a transaction. Thankfully, this case is rare, in my experience. Of course, all algorithms have weaknesses.

Please let me know if I can explain this more.

Garret
hledger-import-algorithms.png
回复全部
回复作者
转发
0 个新帖子