Categorizing transactions automatically on import

174 views
Skip to first unread message

Martin Michlmayr

unread,
Apr 8, 2020, 10:15:31 PM4/8/20
to bean...@googlegroups.com
The "Importing External Data in Beancount" document says:

"Beancount does not currently provide a mechanism to automatically
categorize transactions. You can build this into your importer code. I
want to provide a hook for the user to register a completion function
that could run across all the importers where you could hook that code
in."

I have several importer scripts now that work as as well as
beancount's OFX import, but I'd like to assign account names, payees,
narrations and meta-data automatically.

At the moment I pipe the output of bean-extract to a Perl file to
add the data, but there must be a better way.

What would be an elegant way to wrap the importers into code that
can run over the entries and improve the transactions?

I looked at some importers on GitHub and I haven't seen anything.
I believe smart_importer does something like that, but I couldn't
figure out how it works or how I could do the same (without
smart_importer). (My Python isn't fantastic.)

--
Martin Michlmayr
https://www.cyrius.com/

Patrick Ruckstuhl

unread,
Apr 9, 2020, 1:47:05 AM4/9/20
to bean...@googlegroups.com, Martin Michlmayr
Hi Martin,

Smart importer works by learning from your existing transactions. You configure it as an wrapper around your importers.

The workflow I'm using is to run my importers through fava. That way I see the new trx and can adjust them if needed and afterwards they are placed in the right location in the right file based on favas config.

If you want to do it just from the cmd line you need to run bean-extract with the option to pass your existing beancount file so smart importer can use it to train its model.

Regards,
Patrick

Martin Michlmayr

unread,
Apr 9, 2020, 1:51:37 AM4/9/20
to Patrick Ruckstuhl, bean...@googlegroups.com
* Patrick Ruckstuhl <pat...@ch.tario.org> [2020-04-09 07:46]:
> Smart importer works by learning from your existing transactions. You configure it as an wrapper around your importers.

Sorry if my question wasn't clear. Basically I'm asking how I can
wrap my own tagging mechanism around importers. I haven't found
any examples (apart from smart_importer, but I can't figure out
how they do it).

Maybe I should just use smart_importer, but for now I'm leaning
towards a simple match approach.

Stefano Zacchiroli

unread,
Apr 9, 2020, 5:13:11 AM4/9/20
to bean...@googlegroups.com
On Thu, Apr 09, 2020 at 10:15:23AM +0800, Martin Michlmayr wrote:
> The "Importing External Data in Beancount" document says:
>
> "Beancount does not currently provide a mechanism to automatically
> categorize transactions. You can build this into your importer code. I
> want to provide a hook for the user to register a completion function
> that could run across all the importers where you could hook that code
> in."
[...]
> What would be an elegant way to wrap the importers into code that
> can run over the entries and improve the transactions?

For I'm using only a custom CSV importer, which inherits from the
built-in CSV importer, and the only thing I'm doing is categorizing
transactions, i.e., determining the account of one leg of the
transaction. To that end, I'm reusing the notion of "categorizer"
supported by the built-in CSV importer. It's a function from and to
entries; so even if I'm using it only for changing the account, you
could in theory change whatever you want of the extracted entries. But I
understand you're not using the CSV importer, so the categorizer
abstraction is probably not what you're looking for.

Still, the importer.py protocol, which you've probably implemented, has
this:

extract(): Extract directives from a file's contents and return of
list of entries.

which:

Returns:
A list of new, imported directives (usually mostly Transactions)
extracted from the file.

So, assuming you already have an importer that returns "raw"
transactions, you can just have a function that improves your
transactions and map it onto the output of extract(), I believe. It can
also be a wrapper class around any importer that turns a "dumb" importer
into a "smart" one, that enriches transactions as you please.

Is that what you're looking for here?

Cheers
--
Stefano Zacchiroli . za...@upsilon.cc . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
Former Debian Project Leader & OSI Board Director . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »

Martin Michlmayr

unread,
Apr 9, 2020, 6:20:26 AM4/9/20
to bean...@googlegroups.com
* Stefano Zacchiroli <za...@upsilon.cc> [2020-04-09 11:13]:
> It can also be a wrapper class around any importer that turns a
> "dumb" importer into a "smart" one, that enriches transactions as
> you please.
>
> Is that what you're looking for here?

Right, this is exactly what I'm trying to do. Write a wrapper so
I can do:

CONFIG = [
mywrapper(ofx.Importer('1234', 'Assets:ABC')),
...

and then mywrapper would exactly act like any importer (i.e. it would
just call the original importer) except that extract() would first
call the original importer and then iterate over the entries and
fix them up.

I'm sure this is basic Python stuff but unfortunately I can't figure
out how to do it. Has anyone done this and can give an example of
such a wrapper?

Stefano Zacchiroli

unread,
Apr 9, 2020, 7:35:19 AM4/9/20
to bean...@googlegroups.com
On Thu, Apr 09, 2020 at 06:20:18PM +0800, Martin Michlmayr wrote:
> CONFIG = [
> mywrapper(ofx.Importer('1234', 'Assets:ABC')),
> ...
>
> and then mywrapper would exactly act like any importer (i.e. it would
> just call the original importer) except that extract() would first
> call the original importer and then iterate over the entries and
> fix them up.
>
> I'm sure this is basic Python stuff but unfortunately I can't figure
> out how to do it. Has anyone done this and can give an example of
> such a wrapper?

Here's an untested sketch:

------------------------------------------------------------------------

class mywrapper(ofx.Importer):

def extract(self, file, existing_entries=None):
entries = super().extract(file, existing_entries)
return list(map(categorize_entry, entries))

------------------------------------------------------------------------

where categorize_entry is your function that takes an input an entry and
returns an improved version of it. (Note that due to the extract
interface your function should really handle all possible Beancount
entries, not only transactions. Presumably you'll return all other
entries unchanged.)

Let me know if it works :-)

Patrick Ruckstuhl

unread,
Apr 9, 2020, 7:37:58 AM4/9/20
to bean...@googlegroups.com

H

In smart importer it's done in https://github.com/beancount/smart_importer/blob/master/smart_importer/hooks.py


and then applied like this in the import config


apply_hooks(my_bank_importer, [PredictPostings(), PredictPayees()])


Where the hooks PredictPostings and PredictPayees extend the ImporterHook and implement def __call__(self, importer, file, imported_entries, existing_entries):

Regards,

Patrick

Martin Michlmayr

unread,
Apr 10, 2020, 12:15:03 AM4/10/20
to 'Patrick Ruckstuhl' via Beancount
Thank you Patrick and Zack for your patience and explanations!
I got both approches working.

Zack's approach seems simpler to me, but the problem is that it's a
sub-class of ofx.Importer whereas I was hoping for a wrapper that
I could apply to any importer. In the case I'm trying to solve,
I only have OFX so I might go with this approach.

Patrick's hook mechanism is more flexible in that regard since
I can apply it to any importer and I could even apply different
hooks to different importers. The downside is that it relies
on smart_importer which has a lot of ML dependencies.

Patrick, have you (or other smart_importer developers) proposed that
hook mechanism for inclusion into beancount?

For those equally Python challenged as me, here's some example code:

Note: both attach the meta-data to non-txn transactions (like
'balance'). There needs to be a check for transactions.

1) Using Zack's sub-class approach:

def categorize_entry(entry):
entry.meta['test'] = 'foo'
return entry

class myOFX(ofx.Importer):

def extract(self, file, existing_entries=None):
entries = super().extract(file, existing_entries)
return list(map(categorize_entry, entries))

CONFIG = [
myOFX('1...', 'Assets:...'),
]

2) Using smart-import's hook mechanism:

from smart_importer.hooks import apply_hooks, ImporterHook

class SPIImporter(ImporterHook):
def __call__(self, importer, file, imported_entries, existing_entries):
self.account = importer.file_account(file)
for entry in imported_entries:
entry.meta['test'] = 'foo'
return imported_entries

CONFIG = [
apply_hooks(ofx.Importer('1...', 'Assets:...'), [SPIImporter()]),
]

Stefano Zacchiroli

unread,
Apr 10, 2020, 4:23:42 AM4/10/20
to bean...@googlegroups.com
On Fri, Apr 10, 2020 at 12:14:55PM +0800, Martin Michlmayr wrote:
> Zack's approach seems simpler to me, but the problem is that it's a
> sub-class of ofx.Importer whereas I was hoping for a wrapper that
> I could apply to any importer. In the case I'm trying to solve,
> I only have OFX so I might go with this approach.

I'm pretty sure it can be made fully general with a mixin that takes any
importer and use your categorizer before returning results. Try
something like this (again, untested):

------------------------------------------------------------------------

class CategorizerMixin():

@staticmethod
def categorize_entry(entry):
entry.meta['test'] = 'foo'
return entry

def extract(self, file, existing_entries=None):
entries = super().extract(file, existing_entries)
return list(map(self.categorize_entry, entries))


class MyOFX(ofx.Importer, CategorizerMixin): pass


class MyOtherImporter(OtherImporter, CategorizerMixin): pass


CONFIG = [
myOFX('1...', 'Assets:...'),
MyOtherImporter(...),
]

------------------------------------------------------------------------

There might be some issues with MRO that I haven't got right without
testing, but there's definitely a way along the above lines to make a
fully general wrapper that "lifts" any (non-categorizing) importer to
one that categorizes as you please. Happy to chat more offline if you
want to give this a try.

TRS-80

unread,
Jul 15, 2020, 10:14:21 AM7/15/20
to bean...@googlegroups.com
On 2020-04-10 04:23, Stefano Zacchiroli wrote:
> I'm pretty sure it can be made fully general with a mixin that takes
> any
> importer and use your categorizer before returning results. Try
> something like this (again, untested):
>
> ------------------------------------------------------------------------
>
> class CategorizerMixin():
>
> @staticmethod
> def categorize_entry(entry):
> entry.meta['test'] = 'foo'
> return entry
>
> def extract(self, file, existing_entries=None):
> entries = super().extract(file, existing_entries)
> return list(map(self.categorize_entry, entries))
>
>
> class MyOFX(ofx.Importer, CategorizerMixin): pass
>
>
> class MyOtherImporter(OtherImporter, CategorizerMixin): pass
>
>
> CONFIG = [
> myOFX('1...', 'Assets:...'),
> MyOtherImporter(...),
> ]
>


@Martin,

I was curious if you ever got this working?

@All,

Thanks a lot Stefano for that example, it is greatly appreciated! Like
Martin, I am also still learning Python. But with the help of your
example, I now have it working.

One thing I wanted to point out, not to be a grammar Nazi, but for
others like us who are Python novices, is that the "MyOFX" class was
capitalized where it was declared, but then in the "CONFIG = [" section
it was referenced as "myOFX(...)". Just a small typo, but enough to
result in a "NameError: name 'myOFX' is not defined" error which I
started searching the Internet about for a little while until I realized
it was just a capitalization mistake. :)

Since this is one of the few threads that come up when you search the
mailing list for "categorizer OFX" (and by far the most relevant, IMO) I
will share my more complete working example. A few notes:

1. All I did was basically combine Stefano wrapper with my already
previously working (with CSV anyway) categorizer, which itself I found
some time ago searching around for "dumb categorizer."

2. I also added a few comments, mostly to remind myself of things the
next time I touch this which might be a long time from now. :) Maybe
they will also be helpful for others, so I left them in.

3. I changed the name of the function from "categorize_entry" to
"new_categorizer" (to distinguish it from the previous
"dumb_categorizer" it was based upon). Oh yes and I changed "entry" to
"txn." I am not sure which is better / more correct, but this was the
way all 100+ of my pre-existing rules were written, so I decided it was
easier to change this a couple places in the invocation rather than all
100+ of my existing rules (even with a good editor). ;)

4. Again, I am a Python noob, but from the little I read about Python
mixins (https://www.ianlewis.org/en/mixins-and-python), I think the
"parent" classes to the mixin are read in (and inheritance set) from
right to left, therefore we are supposed to write them like:

"Class myOFX(CategorizerMixin, ofx.Importer):" instead of like:
"Class myOFX(ofx.Importer, CategorizerMixin):"

I think in most cases it probably doesn't matter so much, but something
I think I may have learned and wanted to share.

5. Be careful the order of your rules, the first one that matches will
"win." I mostly keep mine alphabetical (to keep them organized) however
I have had to move a few to the bottom for prioritization reasons. I
make sure to note them accordingly in their own section.

6. Note the ".lower()" function, which will transform the
incoming/existing txn.narration to all lower case (just for purposes of
rule matching; it dosn't change it permanently). Which is also why all
the rules are also lower case.

7.a. If you are looking for a reference as to what other fields might be
available for you to work with in a categorizer, then you can find your
answers in "beancount/core/data.py"[0], in particular the "Transaction"
and "Posting" directives.

7.b. Somewhat related to above, I think this can be used as a base to
extend the "dumb categorizer" quite far in custom directions, without
the need of using "AI" or any sort of Bayesian "smarts" which I don't
really want. Personally, I by far prefer to explicitly define my
categorizing rules, and I figure that there are probably others out
there who also feel the same way (I don't want any "surprises" nor do I
want to fight with my machines; I like them to do /exactly/ as I tell
them). ;) For those who feel differently, there is Smart Categorizer
of course (already mentioned further up thread).

8. There are couple different ways you can attach the Postings to the
Transaction (whether to sort values or not), which I explain at the
bottom, after the example. Other than that, the rest of this post will
be example.

[0]:
https://github.com/beancount/beancount/blob/master/beancount/core/data.py#L168

Alright then, without further ado:

class CategorizerMixin():

@staticmethod
def new_categorizer(txn):

# If you want to add any meta data, do it here for all
directives including eg. Balance
# assertions (which do not have any legs):
#
txn.meta['meta_for_all_directives'] = 'foo'

# At this time the txn has only one posting
try:
posting1 = txn.postings[0]
except IndexError:
return txn
# Ex. Balance objects don't have any postings, either
except AttributeError:
return txn

# Otherwise to add metadata to all normal transactions (with one
or more legs), add them
# here (after things above return):
#
txn.meta['meta_for_transactions'] = 'bar'


# Guess the account(s) of the other posting(s)
#
# Standard searches, listed alphabetically. Better to be longer
than shorter and end up with
# false positives.

if 'aldi' in txn.narration.lower():
account = 'Expenses:Groceries'

elif 'aliexpress' in txn.narration.lower():
account = 'Expenses:Unknown:CheckReceipt'

elif 'amazon' in txn.narration.lower():
account = 'Expenses:Unknown:CheckReceipt'

elif 'amzn' in txn.narration.lower():
account = 'Expenses:Unknown:CheckReceipt'

elif 'anytime fit' in txn.narration.lower():
account = 'Expenses:Self:Fitness'

elif 'applebees' in txn.narration.lower():
account = 'Expenses:Social:EatingDrinkingOut'

elif 'arby\'s' in txn.narration.lower():
account = 'Expenses:Food:EatingOut'

elif 'atm' in txn.narration.lower(): # pretty broad but so
far working, includes deposits
account = 'Assets:Self:Cash:Wallet-C'

# ... 100+ more 'elif' statements ... ;)

else:
account = 'Expenses:Unknown:NewOneLeg' # default to this
if nothing else

# Make the other posting(s)
posting2 = posting1._replace(
account=account,
units=-posting1.units
)

# Insert / Append the posting into the transaction (see note
below)
txn.postings.append(posting2)

return txn


def extract(self, file, existing_entries=None):
entries = super().extract(file, existing_entries)
return list(map(self.new_categorizer, entries))

END OF EXAMPLE

OK, so about that "Insert / Append" section. Originally the "dumb
categorizer" as I found it contained the following code. What this does
is to sort the Posting legs such that the smaller amount is always
first. For example, subtracting some amount out of a Checking account
would be first, then the + to Expense account would be second. Which is
fine, until you make a deposit. I prefer that whatever is happening to
the account in question (whether + or -) be listed first, and then the
"other" account be listed second. It's all a matter of preference, I
just wanted to point it out. I include the original code below in case
you prefer the other way.

# Insert / Append the posting into the transaction
if posting1.units < posting2.units:
txn.postings.append(posting2)
else:
txn.postings.insert(0, posting2)

return txn
Reply all
Reply to author
Forward
0 new messages