Automatically classifying transactions imported from ledgerhub

338 views
Skip to first unread message

Jason Chu

unread,
Dec 14, 2015, 12:12:16 AM12/14/15
to Beancount
Is there a way to automatically classify common transactions as the expenses they're associated with?  I plan on using ledgerhub to import my credit card transactions and I'm used to gnucash, which can classify transactions with at least 70% accuracy...

All I'm really looking for is a way to say "TRADER JOES" transactions are Expenses:Groceries, etc.

Martin Blais

unread,
Dec 14, 2015, 12:28:26 AM12/14/15
to Beancount
On Mon, Dec 14, 2015 at 12:12 AM, Jason Chu <xen...@gmail.com> wrote:
Is there a way to automatically classify common transactions as the expenses they're associated with?  I plan on using ledgerhub to import my credit card transactions and I'm used to gnucash, which can classify transactions with at least 70% accuracy...

All I'm really looking for is a way to say "TRADER JOES" transactions are Expenses:Groceries, etc.

I haven't built anything from that so far. My own 10-year old file lives with messy payees and I'd like to fix it eventually. 

The best way to do this IMHO would be to use the contents of the existing ledger and somehow figure out a way to automatically classify new entries from seen entries. It presupposes that at least some of the entries in the ledger are "labeled" with correct payees and that you're able to figure out which they are.

Or, you can just write a script that does that for you using some rules. You can do that at import time, or you could write a plugin that corrects imperfect input by fixing up the payees at parse time.

These are just ideas.

Jason Chu

unread,
Dec 14, 2015, 1:18:49 AM12/14/15
to Beancount
I would really just like something that could look at the proposed transactions and pre-classify some of them for me.  Even if I had to curate the pre-classification by hand, it would save a lot of time when you repeatedly go to the same places month after month.

I was hoping to get away without having to write something myself in python, but I might end up having to do that.  Can you point me to the APIs for modifying transactions from within python?

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAK21%2BhPc8Nc%3De9GJnKiXRadg1TwXWn2iBwoqWUoe6nZbg6QQaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Martin Blais

unread,
Dec 14, 2015, 1:54:03 AM12/14/15
to Beancount
On Mon, Dec 14, 2015 at 1:18 AM, Jason Chu <xen...@gmail.com> wrote:
I would really just like something that could look at the proposed transactions and pre-classify some of them for me.  Even if I had to curate the pre-classification by hand, it would save a lot of time when you repeatedly go to the same places month after month.

In my experience, if you have account name completion setup properly the categorization takes too little time to bother.


I was hoping to get away without having to write something myself in python, but I might end up having to do that.  Can you point me to the APIs for modifying transactions from within python?


Once you've got it built let us know what approached worked well for you.
I suspect that a little mapping of regexp to account name is probably all that's required to automate away most of it.




On Sun, Dec 13, 2015 at 9:28 PM Martin Blais <bl...@furius.ca> wrote:
On Mon, Dec 14, 2015 at 12:12 AM, Jason Chu <xen...@gmail.com> wrote:
Is there a way to automatically classify common transactions as the expenses they're associated with?  I plan on using ledgerhub to import my credit card transactions and I'm used to gnucash, which can classify transactions with at least 70% accuracy...

All I'm really looking for is a way to say "TRADER JOES" transactions are Expenses:Groceries, etc.

I haven't built anything from that so far. My own 10-year old file lives with messy payees and I'd like to fix it eventually. 

The best way to do this IMHO would be to use the contents of the existing ledger and somehow figure out a way to automatically classify new entries from seen entries. It presupposes that at least some of the entries in the ledger are "labeled" with correct payees and that you're able to figure out which they are.

Or, you can just write a script that does that for you using some rules. You can do that at import time, or you could write a plugin that corrects imperfect input by fixing up the payees at parse time.

These are just ideas.

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAK21%2BhPc8Nc%3De9GJnKiXRadg1TwXWn2iBwoqWUoe6nZbg6QQaQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.

Jeremy Maitin-Shepard

unread,
Dec 14, 2015, 3:47:14 AM12/14/15
to bean...@googlegroups.com
I wrote a program that does this a little while ago.  I used to import several years of transactions into beancount, and continue to use to import new transactions.

It is in an extremely unpolished state, but I just posted the code here because it may be useful to you:

https://github.com/jbms/beancount-import

It is specifically designed to work with CSV files downloaded from Mint, but could very easily be extended to work with any bank/credit card statement CSV export file, and could be extended for more general purposes.

The input format effectively specifies one of the two postings, and it then uses a decision tree classifier to predict the other account name, based on the source account and the description text from the input file.  (The decision tree features consist of all consecutive subsequences of one or more words from the description text).  You can then either accept the transaction and have it added to the end of the journal, or manually enter a new account.  The posting corresponding to the source account is given two metadata fields, date and source_data.  These are used to skip importing transactions that have already been imported, and the source_data is also used to train the classifier the next time the program is run.  Additionally, if the source account and amount exactly match and the date approximately matches an existing posting in the journal that does not already have a source_data field, then the user is also given the option to indicate that it is a correct match, in which case the existing posting in the journal is edited to include the source_data and date metadata fields.

The prediction process works pretty well, though it does depend on how consistent your transactions are.  Recurring bills/direct deposit tend to get handled perfectly.  Naturally when the same store is treated as different types of expenses (or perhaps Assets:Reimbursable sometimes), it doesn't work.

Right now the UI is implemented in npyscreen and is rather hacky.  The editing of existing transactions is also done in a rather hacky way (but works).

Martin Blais

unread,
Dec 14, 2015, 9:30:07 AM12/14/15
to Beancount
Added a link to the contributions page. Thanks for sharing.


redst...@gmail.com

unread,
Dec 16, 2015, 8:38:37 PM12/16/15
to Beancount
As a note, this is one of the things I use Yodlee to do. It gets well over 90% right. It uses its inbuilt data to figure out most of the expenses itself, and you can override its guesses permanently, using rules, in cases where it's needed.

However, this might not be a solution for everyone, and it certainly doesn't work for historical records.

Martin Blais

unread,
Dec 17, 2015, 12:48:52 AM12/17/15
to Beancount
On Wed, Dec 16, 2015 at 8:38 PM, <redst...@gmail.com> wrote:
As a note, this is one of the things I use Yodlee to do. It gets well over 90% right. It uses its inbuilt data to figure out most of the expenses itself, and you can override its guesses permanently, using rules, in cases where it's needed.

Yodlee was very attractive to me as a piece of technology, until I met someone who's essentially a client of their data feeds and I learned that they essentially sell away your personal transaction data to banks and hedge funds so that they can derive economic predictions from it. This data access is pretty scary to me. Not sure how well anonymified the data is, but even so, someone with access to another data feed can probably join Yodlee transaction data with it and narrow down to the individual. *Shivers*

(That being said, I think it's entirely possible that one of the institutions I deal with has a clause that says they can share my data with one of their partners, and Yodlee could be one of them, and my data shows up there too even if I don't have an account there. *Double shivers*)


redst...@gmail.com

unread,
Dec 17, 2015, 2:51:15 AM12/17/15
to Beancount
Yes. My understanding though is, unfortunately, Yodlee does this anyway in many cases if you have an account with one of the institutions that use Yodlee's technology in their backend. And unfortunately, Yodlee has a rather impressive list of clients. So when I looked into this, my own conclusion was, Yodlee does what they do with the data anyway; signing up allows you to get access to it, nothing else.

Also note: there are (at least) two ways to sign up for Yodlee. One way is directly at their website. The other way is if you bank with one of their customers who give you access - like Bank of America's My Portfolio. I don't know if there are differences in terms and conditions with the two methods.

Anyway, on a different note, OP, have you tried:

Jeremy Maitin-Shepard

unread,
Dec 17, 2015, 2:07:46 PM12/17/15
to bean...@googlegroups.com
That was the basis for the tool that I wrote.  I tried using it, but found it to not be sufficiently robust, and the naive bayes classifier didn't work as well as a decision tree classifier.

Martin Blais

unread,
Dec 17, 2015, 10:49:09 PM12/17/15
to Beancount
Ideally, a classifier would operate on imported Beancount transactions directly rather than at the import level, so that it's reusable across a variety of different importers. In other words, the automatic categorization / completion of a transaction is a feature orthogonal to that of its creation via import.


Jeremy Maitin-Shepard

unread,
Dec 18, 2015, 2:04:02 AM12/18/15
to bean...@googlegroups.com
On Thu, Dec 17, 2015 at 7:49 PM, Martin Blais <bl...@furius.ca> wrote:
Ideally, a classifier would operate on imported Beancount transactions directly rather than at the import level, so that it's reusable across a variety of different importers. In other words, the automatic categorization / completion of a transaction is a feature orthogonal to that of its creation via import.

I completely agree that the classification should operate independently of the source, but there are some complications:

- Classification is only easy to define for the case where you have a transaction with exactly two postings: one posting from a known "source account", and one unknown posting for the opposite amount in an unknown account, to be determined by the classifier.

- You need some way to determine, for each imported posting, a description string (or other source of features), that will be used along with the source account, and possibly the amount and date, for classification.

- You need to be able to determine for each existing posting in the beancount journal that was imported (or manually entered but later matched with an imported posting) what the original description string used to import it was.  This is used to train the classifier based on a beancount journal.  Just using the payee/narration field of the transaction isn't a good idea, since the user may have edited it, and we only want to train based on the description strings seen when importing.

A good way to achieve this would be to have a single import tool that supports arbitrary backends that provide the actual transactions to import.
Reply all
Reply to author
Forward
0 new messages