Duplicate detection during import

631 views
Skip to first unread message

redst...@gmail.com

unread,
Mar 30, 2021, 6:57:42 AM3/30/21
to Beancount
Reg. class SimilarityComparator in similarity.py:

The final check is:
        # Here, we have found at least one common account with a close
        # amount. Now, we require that the set of accounts are equal or that
        # one be a subset of the other.
        return accounts1.issubset(accounts2) or accounts2.issubset(accounts1)

I've been instead using a slightly modified version, where I just check for intersection:
        return accounts1.intersection(accounts2)

For my use cases, this has worked better in every case. The common case is an import of a credit card transaction that is modified post-import. On a subsequent import (with an overlapping date range), dedupe does not work with the original heuristic.

I can't help but wonder if this would be universally better for everyone. Thoughts?

If not, perhaps an option might help users fine tune for their use cases? Suggestions:
--aggressive_match
--heuristic=match_on_one_common_posting  (--heuristic would take in a list)

Making dedupe detection better further cuts down ingest effort (links to 5min ledger update article).

Martin, would you be opposed to one of the approaches above?

Thanks,
-red

Daniele Nicolodi

unread,
Mar 30, 2021, 7:11:26 AM3/30/21
to bean...@googlegroups.com
On 30/03/2021 12:57, redst...@gmail.com wrote:
> Reg. class SimilarityComparator in similarity.py:
>
> The final check is:
>         # Here, we have found at least one common account with a close
>         # amount. Now, we require that the set of accounts are equal or that
>         # one be a subset of the other.
>         return accounts1.issubset(accounts2) or
> accounts2.issubset(accounts1)
>
> I've been instead using a slightly modified version, where I just check
> for intersection:
>         return accounts1.intersection(accounts2)
>
> For my use cases, this has worked better in every case. The common case
> is an import of a credit card transaction that is modified post-import.
> On a subsequent import (with an overlapping date range), dedupe does not
> work with the original heuristic.
>
> I can't help but wonder if this would be universally better for
> everyone. Thoughts?

What to consider duplicate entries in fuzzy matching as implemented by
SimilarityComparator depends on how the ledger is organized. For
example, (I have the impression that) your relaxed check would mark all
couples of transactions that use a transfer account to record
transactions that are posted and cleared on different days.

Deduplication in beancount.ingest is handled by an hook that can be
customized, thus I don't think there is need to provide command line
switches.

In Benagulp, the successor of beancount.ingest, the deduplication will
be delegated to the importer (with a default implementation doing pretty
much what the current one does) thus there will be the possibility for
finer (and easier) customization.

Cheers,
Dan

redst...@gmail.com

unread,
Mar 30, 2021, 7:32:36 AM3/30/21
to Beancount
Deduplication in beancount.ingest is handled by an hook that can be
customized, thus I don't think there is need to provide command line
switches.

Right. I'd looked at this ages ago and forgot that hooks already exist. That works well.

In Benagulp, the successor of beancount.ingest, the deduplication will
be delegated to the importer (with a default implementation doing pretty
much what the current one does) thus there will be the possibility for
finer (and easier) customization.

Even better.

Thanks,
-red

Martin Blais

unread,
Mar 30, 2021, 8:01:22 AM3/30/21
to Beancount
Dedup detection is definitely far from perfect and was just something I tried at the time.

In the new version - beangulp, which Daniele is driving - dedup can be done by importer. I think that per-importer custom dedup is best. For example, any importer that has a unique ID per transaction should leverage this.





--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/ee41980d-dcea-4e82-879d-9bd41b9d7363n%40googlegroups.com.

redst...@gmail.com

unread,
Mar 31, 2021, 10:46:46 PM3/31/21
to Beancount
It's does a pretty good job for something you just tried :). Agree, custom, per-importer dedup solves all the problems here. Thanks!
Reply all
Reply to author
Forward
0 new messages