custom SimilarityComparator for bean-extract

Stefano Zacchiroli

unread,

Sep 2, 2018, 8:23:22 AM9/2/18

to bean...@googlegroups.com

Heya,
I'm using the built-in CSV importer (beancount.ingest.importers.csv)
with bean-extract and, in spite of being documented as bare bone, it
works perfectly fine for my need :)

The only issue I'm facing is that I want to customize the behavior of
beancount.ingest.similar.SimilarityComparator and I didn't find a way to
do so.

(In short, I've a special metadata key, bank-label, which I import from
my CSV files and which I trust as quasi-unique ID for deduplicating
transactions. That key + transaction date would be my ideal
deduplication criteria. SimilarityComparator() is both more strict,
e.g., it requires dates to be relatively near in time, without a way to
pass a different time window; and more lax, e.g., allow amounts to vary
a bit; than what I want.)

Ideally, I'd like to write my own SimilarityComparator and pass it down
to bean-extract via the importer configuration, but the configuration
API doesn't allow to do so ATM. Would such a generalization be welcome
to you, Martin? (as bug report and/or patch)

Cheers
--
Stefano Zacchiroli . za...@upsilon.cc . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
Former Debian Project Leader & OSI Board Director . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »

Martin Blais

unread,

Sep 2, 2018, 1:01:56 PM9/2/18

to Beancount

I made a change a little while ago that allows you to turn your import file into a script:

from beancount.ingest.scripts_utils import ingest

...

CONFIG = [ .. list of importer instances .. ]

...

ingest(CONFIG)

This makes your .import file into a script, you can run it with a "identify", "extract" or "file" subcommand.

(You can still use the bean-identify, bean-extract, bean-file programs with it as before.)

But why am I mentioning this?

Well, because the purpose of doing that was to allow you to insert code before and/or after running the ingestion processes, and also to pass in arguments to the ingestion tools to customize it, see here:

https://bitbucket.org/blais/beancount/src/353d874f678149eb4af951d1e57b92041f7bbc7b/beancount/ingest/scripts_utils.py#lines-29

What you're looking for here is the "detect_duplicates_func".

You should be able to insert

ingest(CONFIG, my_duplicates_func)

at the bottom of your .import file and it should be invoked.

If for whatever reason it doesn't fulfill your customization need, please let me know.

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/20180902122320.GA27063%40upsilon.cc.
For more options, visit https://groups.google.com/d/optout.

Stefano Zacchiroli

unread,

Sep 2, 2018, 1:53:14 PM9/2/18

to bean...@googlegroups.com

On Sun, Sep 02, 2018 at 01:01:42PM -0400, Martin Blais wrote:
> I made a change a little while ago that allows you to turn your import
> file into a script:
>
> from beancount.ingest.scripts_utils import ingest

So basically in the relatively short period elapsed from my last "hg
pull" and me asking about this here, you implemented the solution for a
problem I didn't know I had yet :-)

Amazing, thank you! (And, in passing, I also like the design a lot.)

I'll report back here if it's not flexible enough for my needs, but at
first sight it totally fits the bill.

Reply all

Reply to author

Forward