On Aug 26, 2021, at 2:12 PM, gmcg...@gmail.com <gmcg...@gmail.com> wrote:Hi,I am familiar with the hledger import method, the hledger diff functionality, and also ledger-autosync. I'm wondering if everybody else finds these methods to be sufficient for doing transaction deduplication or if anybody does anything else.I used to use ledger-autosync but I found that the OFX IDs that appear in some banks QFX/OFX files sometimes change, months after staying the same! This messes up deduplication and causes errant transactions to be duplicated. Also, a few places I import data from did not support well-formatted QFX/OFX. In the end, I've ended up writing my own scripts to do deduplication but so far, they are fairly cumbersome to integrate with a workflow that otherwise relies on hledger to process data, including the CSV import rules which I like using.My issues with the existing hledger "import" command are that it misses transactions when two transactions clear in a different order than their transaction date. Here is an example with a credit card:* I pay for dinner at a restaurant on Monday* I pay for gas on Tuesday* On Wednesday, the gas transaction clears and hits the transaction log dated on Tuesday* I import my data on Wednesday morning which includes the gas transaction dated Tuesday* Later on Wednesday, the dinner transaction clears (the restaurant finally entered the tips and pushed the transaction through)* I import my data again on Thursday morning and hledger misses my restaurant transaction because it is dated 1 day earlier (Monday) than the already imported gas transaction (Tuesday)Obviously, this confusion goes away if the bank would publish the cleared date as the transaction date rather than the original transaction date but sometimes this is not done, different banks do it differently from each other. I try to code my strategies so that they always work, even when banks do weird things like this.
My current script does this (pseudocode):
- for transaction in existingtransactions (array of transactions from existing ledger)
- transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 ) # 604800 is number of seconds in 1 week
- tuple = (transaction.firstaccount, transactionweeks, transaction.amount)
- existing[ tuple ] ++ (building a map to "index" existing transactions)
- for transaction in importtransactions (array of transactions from recently downloaded CSV)
- transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )
- tuple1 = (transaction.firstaccount, transactionweeks, transaction.amount)
- tuple2 = (transaction.firstaccount, transactionweeks - 1, transaction.amount)
- tuple3 = (transaction.firstaccount, transactionweeks + 1, transaction.amount)
- if existing[ tuple1 ]
- existing[ tuple1 ] - - (decrement map to mark transaction as "found")
- else if existing[ tuple2 ] (allow date to be a little behind)
- existing[ tuple2 ] - - (decrement map to mark transaction as "found")
- else if existing[ tuple3 ] (allow date to be a little ahead)
- existing[ tuple3 ] - - (decrement map to mark transaction as "found")
- else
- print transaction (found a new non-duplicate transaction!)
This algorithm (coded in python) has worked almost flawlessly for me for over a year and it performs in O(n) and never seems to mess up, as long as I download several weeks of transactions each time I download a CSV. The thing is, I wonder if there is a simpler approach that other people use which would allow me to use hledger alone for my data import rather than using an external script that must parse ledger format and manipulate the ledger externally.
On Aug 26, 2021, at 2:12 PM, gmcg...@gmail.com <gmcg...@gmail.com> wrote:Hi,I am familiar with the hledger import method, the hledger diff functionality, and also ledger-autosync. I'm wondering if everybody else finds these methods to be sufficient for doing transaction deduplication or if anybody does anything else.I used to use ledger-autosync but I found that the OFX IDs that appear in some banks QFX/OFX files sometimes change, months after staying the same! This messes up deduplication and causes errant transactions to be duplicated. Also, a few places I import data from did not support well-formatted QFX/OFX. In the end, I've ended up writing my own scripts to do deduplication but so far, they are fairly cumbersome to integrate with a workflow that otherwise relies on hledger to process data, including the CSV import rules which I like using.My issues with the existing hledger "import" command are that it misses transactions when two transactions clear in a different order than their transaction date. Here is an example with a credit card:* I pay for dinner at a restaurant on Monday* I pay for gas on Tuesday* On Wednesday, the gas transaction clears and hits the transaction log dated on Tuesday* I import my data on Wednesday morning which includes the gas transaction dated Tuesday* Later on Wednesday, the dinner transaction clears (the restaurant finally entered the tips and pushed the transaction through)* I import my data again on Thursday morning and hledger misses my restaurant transaction because it is dated 1 day earlier (Monday) than the already imported gas transaction (Tuesday)Obviously, this confusion goes away if the bank would publish the cleared date as the transaction date rather than the original transaction date but sometimes this is not done, different banks do it differently from each other. I try to code my strategies so that they always work, even when banks do weird things like this.Hi Garret,A nice example, thanks. As you say, that behaviour defeats hledger import, unfortunately.(https://hledger.org/hledger.html#deduplication -> new items always have the newest dates.)It would be cool if we can come up with a variant that handles this case.
My current script does this (pseudocode):
- for transaction in existingtransactions (array of transactions from existing ledger)
- transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 ) # 604800 is number of seconds in 1 week
- tuple = (transaction.firstaccount, transactionweeks, transaction.amount)
- existing[ tuple ] ++ (building a map to "index" existing transactions)
- for transaction in importtransactions (array of transactions from recently downloaded CSV)
- transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )
- tuple1 = (transaction.firstaccount, transactionweeks, transaction.amount)
- tuple2 = (transaction.firstaccount, transactionweeks - 1, transaction.amount)
- tuple3 = (transaction.firstaccount, transactionweeks + 1, transaction.amount)
- if existing[ tuple1 ]
- existing[ tuple1 ] - - (decrement map to mark transaction as "found")
- else if existing[ tuple2 ] (allow date to be a little behind)
- existing[ tuple2 ] - - (decrement map to mark transaction as "found")
- else if existing[ tuple3 ] (allow date to be a little ahead)
- existing[ tuple3 ] - - (decrement map to mark transaction as "found")
- else
- print transaction (found a new non-duplicate transaction!)
This algorithm (coded in python) has worked almost flawlessly for me for over a year and it performs in O(n) and never seems to mess up, as long as I download several weeks of transactions each time I download a CSV. The thing is, I wonder if there is a simpler approach that other people use which would allow me to use hledger alone for my data import rather than using an external script that must parse ledger format and manipulate the ledger externally.I think no one recipe can be reliable with everyone's bank, even the one you describe here - but I'd be happy to be wrong.Could you give a brief high-level description of how it works, and if possible list the conditions required for it to work ?
On Aug 26, 2021, at 4:49 PM, gmcg...@gmail.com <gmcg...@gmail.com> wrote:
My current script does this (pseudocode):
- for transaction in existingtransactions (array of transactions from existing ledger)
- transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 ) # 604800 is number of seconds in 1 week
- tuple = (transaction.firstaccount, transactionweeks, transaction.amount)
- existing[ tuple ] ++ (building a map to "index" existing transactions)
- for transaction in importtransactions (array of transactions from recently downloaded CSV)
- transactionweeks = floor( convert_to_unix_epoch( transaction.date ) / 604800 )
- tuple1 = (transaction.firstaccount, transactionweeks, transaction.amount)
- tuple2 = (transaction.firstaccount, transactionweeks - 1, transaction.amount)
- tuple3 = (transaction.firstaccount, transactionweeks + 1, transaction.amount)
- if existing[ tuple1 ]
- existing[ tuple1 ] - - (decrement map to mark transaction as "found")
- else if existing[ tuple2 ] (allow date to be a little behind)
- existing[ tuple2 ] - - (decrement map to mark transaction as "found")
- else if existing[ tuple3 ] (allow date to be a little ahead)
- existing[ tuple3 ] - - (decrement map to mark transaction as "found")
- else
- print transaction (found a new non-duplicate transaction!)
This algorithm (coded in python) has worked almost flawlessly for me for over a year and it performs in O(n) and never seems to mess up, as long as I download several weeks of transactions each time I download a CSV. The thing is, I wonder if there is a simpler approach that other people use which would allow me to use hledger alone for my data import rather than using an external script that must parse ledger format and manipulate the ledger externally.
Could you give a brief high-level description of how it works, and if possible list the conditions required for it to work ?
Sure, the high-level description of my algorithm is that it is similar to hledger diff's algorithm but it is not date-agnostic. It matches transactions based on a particular account and it diffs the list of amounts. Additionally (unlike "hledger diff"), before marking two transactions as duplicates of each other, it will make sure that the date is "kinda close" within some tolerance.