Categorizing spending

74 views
Skip to first unread message

Aaron Stacy

unread,
Dec 20, 2021, 6:56:04 PM12/20/21
to ledge...@googlegroups.com
Hi, I'm looking for suggestions for categorizing spending (not so much things like paycheck, brokerage transactions, etc, but stuff like credit card spending for budgeting). My ledger has around 2800 transactions over about 2 years, so it's not a ton of data, but it seems like enough that I could leverage something smarter than just string matching the transaction narrations.

Does anyone have recommendations for categorizing spending?

I'm thinking of applying a full text search index as follows:

- Each expense account is a "document".
- The document contents is the narration of every transaction for that account.
- To categorize a new transaction, use an engine like Lucene to or sklearn.TfidfVectorizer and pick the most likely account.

Any thoughts on this approach? (aside from being over-engineered. I'm an engineer, IDK what to tell you it's what I do)

Thanks!

Daniele Nicolodi

unread,
Dec 21, 2021, 6:29:33 AM12/21/21
to ledge...@googlegroups.com, Aaron Stacy
On 21/12/2021 00:55, Aaron Stacy wrote:
> Hi, I'm looking for suggestions for categorizing spending (not so much
> things like paycheck, brokerage transactions, etc, but stuff like credit
> card spending for budgeting). My ledger has around 2800 transactions
> over about 2 years, so it's not a ton of data, but it seems like enough
> that I could leverage something smarter than just string matching
> the transaction narrations.
>
> Does anyone have recommendations for categorizing spending?
>
> I'm thinking of applying a full text search index as follows:
>
> - Each expense account is a "document".
> - The document contents is the narration of every transaction for that
> account.
> - To categorize a new transaction, use an engine like Lucene
> <https://lucene.apache.org> to or sklearn.TfidfVectorizer
> <http://sklearn.TfidfVectorizer> and pick the most likely account.
>
> Any thoughts on this approach? (aside from being over-engineered. I'm an
> engineer, IDK what to tell you it's what I do)

I use Beancount and to assign accounts to transactions I use a machine
learning classifier trained on my existing ledger implemented using sklearn.

This works reasonably well for recurring transactions but is not
infallible. I found that putting a threshold on the confidence score
from the classifier is essential for not ending up with completely bogus
account assignments.

Cheers,
Dan

o1bigtenor

unread,
Dec 21, 2021, 6:58:12 AM12/21/21
to Ledger
Greetings

Not sure how many transactions in my file (a command would be wonderful).
My file is more complicated as I run a few sole proprietorships as well.
I started with the GIFI codes from my gooberment. For me that just wasn't
enough granularity (most would likely find my system WAY over the top!!!)
so I added 6 more digits (sometimes I could use a couple more in fact) in a
xxxx.xx.xx.xx pattern. I enter my additions into the document (the list of GIFI
codes) that I started with. That document started as an 8 page file. Its now
a 54 page file after about 7 years of use. Dunno how many actual line items
but I separate out a lot of things to help me do good long term thinking
(the type of product works better than that for longer term.

When I was looking for record
keeping (most commonly called accounting) software the level of granularity
I thought I needed (and am now using) would have moved me into a cost
area enjoyed by very large companies. I have found that ledger enables me
to fairly quickly and very very accurately keep my record keeping and is
so very easy to pull reports from. I tend to need to ask questions to find
good queries but I am finding the support here excellent and so I now
endorse 'ledger' for this use. I can call any level of account (and even
combinations of accounts) as a 'document' and every transaction is
listed. As for categorization - - - well - - - I look to my account doc and
find the area the item - - - or service, belongs in and if its a new to me
so far - - -well I add a new account number and - - - we're off to the
races.

HTH

psionl0

unread,
Dec 22, 2021, 1:16:45 AM12/22/21
to Ledger
I'm not sure I see the problem. Ledger is designed to have any categories in any grouping you want. You might have Expenses:Electicity, Expenses:Gas, Expenses:Food etc. Your bal report would list subtotals for each of these categories and you could do a reg report on any individual categories if you wanted more detail.
Eg:
2021/12/01 * Electric Company
   Expenses:Electricity           $250.00
   Bank:Mastercard

2021/12/02 * Gas Company
   Expenses:Gas                   $110.00
   Bank:Mastercard

2020/12/03 * Shopping Centre
   Expenses:Food                  $140.00
   Bank:Mastercard

$ ledger -f sample1.txt bal | cat
            $-500.00  Bank:Mastercard
             $500.00  Expenses
             $250.00    Electricity
             $140.00    Food
             $110.00    Gas
--------------------
                   0

$ ledger -f sample1.txt reg "Food" | cat
20-Dec-03 Shopping Centre       Expenses:Food               $140.00      $140.00

$ ledger -f sample1.txt reg "Mastercard" and @Electric | cat
21-Dec-01 Electric Company      Bank:Mastercard            $-250.00     $-250.00

Daniele Nicolodi

unread,
Dec 24, 2021, 11:08:21 AM12/24/21
to ledge...@googlegroups.com, psionl0
On 22/12/2021 07:16, psionl0 wrote:
> I'm not sure I see the problem. Ledger is designed to have any
> categories in any grouping you want. You might have Expenses:Electicity,
> Expenses:Gas, Expenses:Food etc. Your bal report would list subtotals
> for each of these categories and you could do a reg report on any
> individual categories if you wanted more detail.

The goal is to leverage the existing ledger history and some machine
learning to automatically assign accounts (usually expense accounts) to
transactions when they are imported from account statements (which
usually are associated with some assets).

Cheers,
Dan

o1bigtenor

unread,
Dec 24, 2021, 2:00:54 PM12/24/21
to Ledger
Greetings

If you are willing and able to share what you come up with I would love
to try something like this.

TIA

Aaron Stacy

unread,
Jan 3, 2022, 1:14:41 PM1/3/22
to Daniele Nicolodi, ledge...@googlegroups.com
Ah that makes sense, thank you! Any recommendation on which algorithm works well?

Daniele Nicolodi

unread,
Jan 13, 2022, 4:22:04 PM1/13/22
to ledge...@googlegroups.com
I quickly uploaded the code here
https://gist.github.com/dnicolodi/063d1898f4deb2ca64f7df0ef410333e

It can be used as a standalone script or can be integrated into a
Beancount importer. Making it work with Ledger syntax would require some
adaptation.

Cheers,
Dan

Daniele Nicolodi

unread,
Jan 13, 2022, 4:24:51 PM1/13/22
to ledge...@googlegroups.com
On 03/01/2022 19:14, Aaron Stacy wrote:
> Ah that makes sense, thank you! Any recommendation on which algorithm
> works well?

I use sklearn TfidfVectorizer with a LinearSVC classifier and it works
reasonably well. I haven't extensively tried other approaches. I think
smart_importer https://github.com/beancount/smart_importer uses a
slightly different approach.

Cheers,
Dan
Reply all
Reply to author
Forward
0 new messages