Beancount with large journals

mick.p...@gmail.com

unread,

Feb 10, 2019, 11:34:47 AM2/10/19

to Beancount

Hi.

I've been using Beancount and fava to report on microinvestment transactions. I'm hitting serious performance issues, as the journal for a single account is approaching 11Mb. (This is no criticism of either fava or Beancount, as I think this use case is probably far beyond their intended usage.)

* Are there any big performance hits I could avoid (e.g. does relying on auto-posting have a significant impact)?
* Does anyone know of any tools out there for aggregating journal entries into summary journals (or has anyone had any success using Beancount's API to do this)?

Cheers.

-Mick.

Martin Blais

unread,

Feb 10, 2019, 11:07:17 PM2/10/19

to Beancount

On Sun, Feb 10, 2019 at 11:34 AM <mick.p...@gmail.com> wrote:

Hi.

I've been using Beancount and fava to report on microinvestment transactions. I'm hitting serious performance issues, as the journal for a single account is approaching 11Mb. (This is no criticism of either fava or Beancount, as I think this use case is probably far beyond their intended usage.)

Big file. My entire history is around 4MB now, and it's starting to bother me (even with the cache).

* Are there any big performance hits I could avoid (e.g. does relying on auto-posting have a significant impact)?

I don't think so, though never say never, a pointed performance sprint by someone who can profile C / Python well might yield some savings.

I've been thinking about rewriting all of beancount.core in C++, but that's not going to be for Tomorrow just yet (I'm resisting, I have very few cycles on my personal time as of late) and I'd have to also reimplement the plugins (see below).

You can view the breakdown in time with the -v option to bean-check:

$ bean-check -v $L

INFO : Operation: 'beancount.parser.parser.parse_file' Time: 732 ms

INFO : Operation: 'beancount.parser.parser.parse_file' Time: 7 ms

INFO : Operation: 'beancount.parser.parser' Time: 740 ms

INFO : Operation: 'parse' Time: 755 ms

INFO : Operation: 'booking' Time: 1219 ms

INFO : Operation: 'beancount.ops.pad' Time: 125 ms

INFO : Operation: 'beancount.ops.documents' Time: 128 ms

INFO : Operation: 'beancount.plugins.ira_contribs' Time: 21 ms

INFO : Operation: 'beancount.plugins.implicit_prices' Time: 171 ms

INFO : Operation: 'beancount.plugins.sellgains' Time: 23 ms

INFO : Operation: 'beancount.plugins.check_closing' Time: 18 ms

INFO : Operation: 'washsales.commissions' Time: 29 ms

INFO : Operation: 'beancount.plugins.check_commodity' Time: 31 ms

INFO : Operation: 'beancount.plugins.commodity_attr' Time: 4 ms

INFO : Operation: 'office.options' Time: 5 ms

INFO : Operation: 'office.share_caroline' Time: 19 ms

INFO : Operation: 'beancount.plugins.divert_expenses' Time: 7 ms

INFO : Operation: 'beancount.ops.balance' Time: 616 ms

INFO : Operation: 'run_transformations' Time: 1470 ms

INFO : Operation: 'function: validate_open_close' Time: 6 ms

INFO : Operation: 'function: validate_active_accounts' Time: 38 ms

INFO : Operation: 'function: validate_currency_constraints' Time: 25 ms

INFO : Operation: 'function: validate_duplicate_balances' Time: 8 ms

INFO : Operation: 'function: validate_duplicate_commodities' Time: 4 ms

INFO : Operation: 'function: validate_documents_paths' Time: 5 ms

INFO : Operation: 'function: validate_check_transaction_balances' Time: 264 ms

INFO : Operation: 'function: validate_data_types' Time: 100 ms

INFO : Operation: 'beancount.ops.validate' Time: 450 ms

INFO : Operation: 'beancount.loader (total)' Time: 4529 ms

That's on a ~4MB file running on my little Intel NUC.

As you can see, the parsing, booking, and plugins (transformations) code are the big hitters.

* Does anyone know of any tools out there for aggregating journal entries into summary journals (or has anyone had any success using Beancount's API to do this)?

I may have tried to do that in the past, but Beancount itself doesn't provide a tool.

If I did, it lives somewhere under experiments/ (honestly I can't remember).

Well, you can also run bean-query with a FROM CLOSE ON <date> bit to compute the balances.

TBH, a script to do this is probably easy to do. The only potentially problematic part is the automatically inserted Conversions transaction, which is used to bring all the income accounts precisely to zero.

(Further, note that this conversion transaction (generated by CLOSE ON) would disappear should be implement the Selinger "currency accounts" plugin, which would be easy to do. This plugin idea would automatically insert currency account postings on transactions which have more than one currency group, to implement Selinger's method. Then you don't need a special conversions transaction when you close the year, because in every transaction every currency group sums up to zero. WHat I'm talking about is this doc: https://www.mathstat.dal.ca/~selinger/accounting/tutorial.html)

Cheers.

-Mick.

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/d778316a-e755-4d7f-94e4-1969280e8bdd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mick.p...@gmail.com

unread,

Feb 14, 2019, 12:20:28 AM2/14/19

to Beancount

Thanks for the info, Martin.

On my laptop, most of the time is spent in the parser and the validator.

The heads-up about the conversions is good to know. Fortunately, the account with the largest number of transactions has no conversions to worry about (they're microloans, so it's all the same currency) - I can probably aggregate those without any headaches.

Thanks for the link to the Selinger tutorial: tracking multiple currencies is something I'd been thinking about ahead of a move later this year, so is very useful.

Stefano Zacchiroli

unread,

Feb 14, 2019, 2:44:11 AM2/14/19

to bean...@googlegroups.com

On Sun, Feb 10, 2019 at 11:07:03PM -0500, Martin Blais wrote:
> You can view the breakdown in time with the -v option to bean-check:

You've probably already thought about that, so out of curiosity: how
much of this is potentially parallelizable, as an avenue for "easily"
getting a performance boost? I guess not much, due to either I/O
constraints or the GIL lock, right? I'm curious about whether
validation, booking, and plugins might be made parallelizable in the
future.

--
Stefano Zacchiroli . za...@upsilon.cc . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o
Former Debian Project Leader & OSI Board Director . . . o o o . . . o .
« the first rule of tautology club is the first rule of tautology club »

Martin Blais

unread,

Feb 18, 2019, 1:22:24 PM2/18/19

to Beancount

On Thu, Feb 14, 2019 at 2:44 AM Stefano Zacchiroli <za...@upsilon.cc> wrote:

On Sun, Feb 10, 2019 at 11:07:03PM -0500, Martin Blais wrote:
> You can view the breakdown in time with the -v option to bean-check:

You've probably already thought about that, so out of curiosity: how
much of this is potentially parallelizable, as an avenue for "easily"
getting a performance boost? I guess not much, due to either I/O
constraints or the GIL lock, right? I'm curious about whether
validation, booking, and plugins might be made parallelizable in the
future.

None.

It's a sequential process.

Something that /might/ have an impact is to sequence all the operations as a chain of streams consuming each other (think: generators/iterators), for memory locality, but at this (small) scale I doubt it would make any difference TBH. Some of the plugins do multiple passes over the stream, which makes this not work and would require pirouettes to harvest opportunities for reusing already computed quantities (e.g. results of stuff from getters.py)

No, I think what should be done for the next major release is a rewrite.

At the very coarse level, it looks like this in my mind:

- Beancount reports/web gets deleted in favor of Fava.

- Beancount query/SQL gets forked to a separate project operating on arbitrary schemas (via protobufs as common representation for various sources of data) and has support for Beancount integration (e.g. a Decimal type, and simple aggregators with the semantics of beancount.core.Inventory/Position/Amount). That's all that's needed, and it would enable the query language to work on CSV files and other data sources. Moreover, this version would be tested property, and have data types in its compiler (no exceptions at runtime).

- Beancount core, parser, booking and plugins get rewritten in simple C++ (no boost/templates, but rather on top of a bazel + absl + protobuf + clif base with functional-style and a straightforward subset of C++, no classes), providing its parsed and booked contents as a stream of protobuf objects.

- All tests would remain in Python (I'm not rewriting those). Comprehensive clean Python bindings for beancount.core would be provided, to do as much scripting as is done today, except with types implemented fully in C++.

- Moreover, all the big ticket items would have to be addressed, e.g. explicitly setting the precision instead of inference, currency trading accounts, reports of trades built-in, etc.

Shreedhar Hardikar

unread,

Feb 18, 2019, 3:35:55 PM2/18/19

to bean...@googlegroups.com

Will the rewrite in C++ really help speed that much? I mean, C++ does comes with a number of additional costs, and so do you believe ultimately that the benefit of C++ (execution speed) for an accounting tool like beancount, really outweighs those costs?

Here's some of my thoughts:

C++ cross-platform dependency management & build - I personally use beancount on a FreeBSD system, and I do have to manually build it (even when install from pip) because there are some C/C++ library dependences for the parser etc. I can say that part is not very fun. If then entire thing is written in C++, care would have to be taken to not use "fancy" C++ features because that means not being able to use on certain systems (because they have older compilers or don't have the specific). Perhaps bazel solves that?
Ease of development & hacking on the code - One prime reason I chose beancount over ledger was the fact that the data structures and algorithms used were written in Python and so easier to grok. I am fairly adept in C++, but running through .h & .cpp & Make & inheritance hierarchies is much more work in C++ than other languages. It was difficult for me to follow along the datatypes available in ledger and how the python integration really worked. I mean, perhaps some more documentation would have helped. Also C++ bugs may give segfaults a lot more often than python code does - a different beast than the stack trace bugs in python. I'm not saying it's not possible to write seg-fault-free code. It gets harder very fast as the complexity goes up.
Also, I'm not sure of what design you have in mind, but if you are going to expose Python bindings for plugins (which, according to the docs is a fundamental part of beancount extensions model), won't you need to be constantly converting between Python objects & C++ objects anyway? That might nullify down all the benefits from C++. Caveat here: I'm not very familiar with Python/C++ bindings, there may be a way to do this efficiently. And maybe googe/clif solves that problem superbly.

Finally, I reckon that you can get a lot from your execution speeds by using other compiled language. Have you considered Go? It should give much faster execution speeds of integers/decimals with easier development, maintenance (and package management) etc. Caveat here: I have not used Go very much, that is, I know only basics, and what I've heard from others. It may work really well to solve the problem beancount is facing in an elegant manner.

Anyway, I do hope you take these points in good spirit - as they were well intentioned. Beancount is a great product and I can't wait till it gets even better with all the features you listed out here!

Thanks,

Shreedhar

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAK21%2BhMXqd9sOAey%2B3aFDi6gh22B5bG8Y08E7CKa5WssWcryZg%40mail.gmail.com.

Jason Chu

unread,

Feb 18, 2019, 3:51:40 PM2/18/19

to Beancount

If my vote counted (which I don't expect it does), I'd vote for go over c++ because I'm more familiar with it as my job has a lot more of that coding day to day.

To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAAY9sD8%2BXEKOEstkmF5mHNMTWsGOjKJcFarBV15v%2BUCA7pAmYw%40mail.gmail.com.

Martin Blais

unread,

Feb 18, 2019, 4:45:38 PM2/18/19

to Beancount

On Mon, Feb 18, 2019 at 3:35 PM Shreedhar Hardikar <shreedha...@gmail.com> wrote:

Will the rewrite in C++ really help speed that much?

It should be well below 1s, a feeling of "instantaneous" is what I'm after.

In any case, a quick prototype would be written, to assess how long it takes.

I mean, C++ does comes with a number of additional costs, and so do you believe ultimately that the benefit of C++ (execution speed) for an accounting tool like beancount, really outweighs those costs?

5-10secs of processing is just too much at the moment. Using the cache or web interface offers a good workaround, but ultimately, even with the cache, I find myself annoyed at how long it takes to process when I just edit + save my file.

I'm with you about the hassle of C++ maintenance, but I want something stable and for the ages, and simple.

Here's some of my thoughts:
C++ cross-platform dependency management & build - I personally use beancount on a FreeBSD system, and I do have to manually build it (even when install from pip) because there are some C/C++ library dependences for the parser etc. I can say that part is not very fun. If then entire thing is written in C++, care would have to be taken to not use "fancy" C++ features because that means not being able to use on certain systems (because they have older compilers or don't have the specific). Perhaps bazel solves that?

I absolutely share your sentiment!

I would avoid unnecessary dependencies as much as possible, and my personal C++ is closer to C (I generally avoid object-oriented programming and too much overloading, and I use a small number of constructs very selectively). The same way that my Python looks a bit "flat" - I'm using namedtuples everywhere - the C++ in which this would be written would be very conservative. While I might enjoy learning and fiddling with bleeding edge features of C++, I recognize that it's a bit of an adolescent exercise in bravado, and I'm very sensitive to the complexity they add so in that codebase I'd avoid those, specifically for portability and ease of long-term maintenance.

This is my main worry, and a central question: how much support should the project offer for that? e.g. Should I have to support somebody coming with a question about a compiler from 4 years ago on a platform I've never used? (e.g. Arch) What about Windows support? Packaging? I don't really have the time. The parameters would be fairly narrow (Debian/Ubuntu, recent compilers). On the other hand, I'd be shooting for simple dependencies which have had a lot of testing (mostly open source google tech -- ABSL is designed for the long-term).

I'm not sure yet. That's why I'd like to build a prototype with the more recently appeared tools and share it for people to try.

Ease of development & hacking on the code - One prime reason I chose beancount over ledger was the fact that the dta structures and algorithms used were written in Python and so easier to grok. I am fairly adept in C++, but running through .h & .cpp & Make & inheritance hierarchies is much more work in C++ than other languages. It was difficult for me to follow along the datatypes available in ledger and how the python integration really worked. I mean, perhaps some more documentation would have helped. Also C++ bugs may give segfaults a lot more often than python code does - a different beast than the stack trace bugs in python. I'm not saying it's not possible to write seg-fault-free code. It gets harder very fast as the complexity goes up.

Absolutely. I would write C++ code that is mostly free of classes (not object-oriented), and would use exactly the same schema as I do now -- number, amount, position, posting, inventory. "Naked" data structures where everything is public and if not immutable, in practice, used as if it were. I'd probably use Protocol Buffers to define and represent those in memory, along with a library of (stateless) functions to replace beancount/core. I would basically just mirror the schema that I'm using now (I think it does the job well) but in protos and C++ functions. It wouldn't be a redesign, mainly a rewrite, fixing some things along the way (e..g tolerance/precision). Anyone already familiar with the beancount.core.data/position/inventory would immediately feel at home. Moreover, the support for Python would be first-class: All the unit tests would remain in Python, and I really care about being able to quickly put things together in a quick Python script myself, so I would be guaranteed for this to work well. (I have yet to experiment with CLIF, so that's something I still have to assess.)

Also, I'm not sure of what design you have in mind, but if you are going to expose Python bindings for plugins (which, according to the docs is a fundamental part of beancount extensions model), won't you need to be constantly converting between Python objects & C++ objects anyway? That might nullify down all the benefits from C++. Caveat here: I'm not very familiar with Python/C++ bindings, there may be a way to do this efficiently. And maybe googe/clif solves that problem superbly.

Good point. Something to keep in mind indeed. I've seen cases where crossing the language barrier (e.g., between Go and C++) would be done by serializing and deserializing entire messages on the other side, which is (relatively) slow. (Go maintains its own copy of the representation in its runtime, which offers advantages.) If I recall, protobufs have two targets for Python bindings: one that is purely Python using some generic C library calls, and a "protoc" one that manipulates the C objects directly (I'm not sure, I need to dig in the details). The latter would be cheap to send across Python/C. That's something I'd test for sure when writing a prototype (ideally a cheap cross-language barrier - passing a pointer - would be ideal). This is one of the reasons why I think the core data structures, parser, booking code, ops and the main plugins would be C++. Plugins would probably have to be written in C++ (though the API would be very simple, as it is now), but if possibly a Python API for them would also be there (it might just slow down your processing a bit). Some of the functionality that's currently there as plugins might also be required by default (e.g. requiring commodities to be declared, implicit prices) so maybe less plugins and options where it makes sense.

Finally, I reckon that you can get a lot from your execution speeds by using other compiled language. Have you considered Go? It should give much faster execution speeds of integers/decimals with easier development, maintenance (and package management) etc. Caveat here: I have not used Go very much, that is, I know only basics, and what I've heard from others. It may work really well to solve the problem beancount is facing in an elegant manner.

I have a lot to say about Go -- I've led a team for a few years where we implemented a project entirely in Go from scratch -- I know it very well. I don't really want to get in the details (no time or place here), but Go is not my favorite choice for this project and I won't be implementing it in Go. One of the good things about this redesign idea is that this new base (the "first third") of Beancount would output a stream of protocol buffer objects. These could be parsed and processed in any language (Go included).

Ultimately, my goal is to have to maintain only about 1/3rd of the current codebase so I have enough cycles to improve the core features and focus on that. The interface/web activity already migrated to Fava, and the hope is that a generalized query framework that operates on any type of data might take wings of its own.

Anyway, I do hope you take these points in good spirit - as they were well intentioned. Beancount is a great product and I can't wait till it gets even better with all the features you listed out here!

Absolutely!

To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAAY9sD8%2BXEKOEstkmF5mHNMTWsGOjKJcFarBV15v%2BUCA7pAmYw%40mail.gmail.com.

Daniele Nicolodi

unread,

Feb 18, 2019, 7:05:03 PM2/18/19

to bean...@googlegroups.com

Hello Martin,

(trying a second time because Google Groups semes to have lost my
previous message)

On 18/02/2019 11:22, Martin Blais wrote:
> - Beancount core, parser, booking and plugins get rewritten in simple
> C++ (no boost/templates, but rather on top of a bazel + absl +
> protobuf + clif base with functional-style and a straightforward subset
> of C++, no classes), providing its parsed and booked contents as a
> stream of protobuf objects.
> - All tests would remain in Python (I'm not rewriting those).
> Comprehensive clean Python bindings for beancount.core would be
> provided, to do as much scripting as is done today, except with types
> implemented fully in C++.

How do you see the possibility of using Cython instead of C++?
Advantages would include the possibility of an (easier) piecewise
conversion instead of a rewrite and not having to solve the problem of
generating Python binding from a C++ codebase.

Cheers,
Dan

Martin Blais

unread,

Feb 18, 2019, 10:14:44 PM2/18/19

to Beancount

On Mon, Feb 18, 2019 at 7:05 PM Daniele Nicolodi <dan...@grinta.net> wrote:

Hello Martin,

(trying a second time because Google Groups semes to have lost my
previous message)

On 18/02/2019 11:22, Martin Blais wrote:
> - Beancount core, parser, booking and plugins get rewritten in simple
> C++ (no boost/templates, but rather on top of a bazel + absl +
> protobuf + clif base with functional-style and a straightforward subset
> of C++, no classes), providing its parsed and booked contents as a
> stream of protobuf objects.
> - All tests would remain in Python (I'm not rewriting those).
> Comprehensive clean Python bindings for beancount.core would be
> provided, to do as much scripting as is done today, except with types
> implemented fully in C++.

How do you see the possibility of using Cython instead of C++?

Too much magical incantation.

I'm using Cython at work here and there, and while robertwb's work is amazing, I want something really straightforward.

Advantages would include the possibility of an (easier) piecewise
conversion instead of a rewrite and not having to solve the problem of
generating Python binding from a C++ codebase.

That's true, but I'm not considering a piecewise conversion, I'm considering a full conversion of the bottom third.

And at that level I'd personally rather rewrite the extension modules in pure C, I know the API well enough, it's more straightforward IMO.

Cheers,
Dan

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/f1ecbaee-899e-4a6a-26b3-b8b0fb66ae40%40grinta.net.

Daniele Nicolodi

unread,

Feb 19, 2019, 10:04:39 AM2/19/19

to bean...@googlegroups.com

On 18/02/2019 11:22, Martin Blais wrote:

> - Beancount core, parser, booking and plugins get rewritten in simple
> C++ (no boost/templates, but rather on top of a bazel + absl +
> protobuf + clif base with functional-style and a straightforward subset
> of C++, no classes), providing its parsed and booked contents as a
> stream of protobuf objects.
> - All tests would remain in Python (I'm not rewriting those).
> Comprehensive clean Python bindings for beancount.core would be
> provided, to do as much scripting as is done today, except with types
> implemented fully in C++.

How do you see the possibility of using Cython instead of C++?

Advantages would include the possibility of an (easier) piecewise
conversion instead of a rewrite and not having to solve the problem of
generating Python binding from a C++ codebase.

Cheers,
Dan

Stefano Zacchiroli

unread,

Feb 19, 2019, 10:23:48 AM2/19/19

to bean...@googlegroups.com

On Mon, Feb 18, 2019 at 04:45:23PM -0500, Martin Blais wrote:
> Plugins would probably have to be written in C++ (though the API would
> be very simple, as it is now), but if possibly a Python API for them
> would also be there (it might just slow down your processing a
> bit).

Requiring end users to write plugins in C++ would be a major set back
w.r.t. the current state of affairs. So, yeah, if you go that way
(which is an understandable requirement if you notice that *some*
plugins require considerable speedup) you should also have as a
requirement that of *also* supporting native Python plugins. That way
the trade-off between the flexibility of Python plugins and the
efficiency of C++ ones will be up to users. You will have to pay the
price of supporting multiple ways of writing plugins, but I speculate it
will be entirely worth it in terms of user adoption.

(I realize this kind of feedback is pointless until it turns into actual
code, but you seem to be welcoming the discussion, so I bite :-))

Cheers

Alen Šiljak

unread,

Apr 15, 2019, 8:01:58 AM4/15/19

to Beancount

Just out of curiosity - would changing the data format shorten the time required for processing? I know this is plain-text-accounting but it would be interesting to see what effect would using SQLite have on the performance.

It might help in reducing the load time of transactions simply due to the nature of the technology.

Stefano Zacchiroli

unread,

Apr 15, 2019, 8:06:42 AM4/15/19

to bean...@googlegroups.com

It won't help in the (common) scenario in which you don't actually parse
the text file, but rely on the Pickle cache instead. My understanding is
that Martin wants to improve performances in that scenario as well.

--
Stefano Zacchiroli . za...@upsilon.cc . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o

Former Debian Project Leader . OSI Board Director . . . o o o . . . o .

Aamer Abbas

unread,

Apr 15, 2019, 8:08:02 AM4/15/19

to Beancount

Sort of defeats the purpose of plain text accounting though. I think the product would lose something special if it's in some non-plain-text format. If it were to go this route, I think the better solution would be to try some sort of caching.

For example, it could be interesting to cache files in a serialized format. It could check to see if the file size or timestamp has changed (and then invalidate the cache in such an event). This would give you the option to move your older transactions into separate files (for example, a different file for each year)

On Mon, Apr 15, 2019 at 3:02 PM Alen Šiljak <alen....@gmx.com> wrote:

Just out of curiosity - would changing the data format shorten the time required for processing? I know this is plain-text-accounting but it would be interesting to see what effect would using SQLite have on the performance.
It might help in reducing the load time of transactions simply due to the nature of the technology.

--

You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/ba60a1ca-22f0-455c-9a36-531b05e81278%40googlegroups.com.

Aamer Abbas

unread,

Apr 15, 2019, 8:11:10 AM4/15/19

to Beancount

Stefano, does it already do caching? Where is the pickle cache stored? I see __pycache__ files in my importer folders, but not anywhere else. Sorry, I have not really done much python in the past, so I am probably asking a dumb question.

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/20190415120638.nrme4od3zj3zdlje%40upsilon.cc.

Martin Blais

unread,

Apr 15, 2019, 8:13:40 AM4/15/19

to Beancount

On Mon, Apr 15, 2019 at 8:02 AM Alen Šiljak <alen....@gmx.com> wrote:

Just out of curiosity - would changing the data format shorten the time required for processing? I know this is plain-text-accounting but it would be interesting to see what effect would using SQLite have on the performance.

None.

The query time is small, it's parsing and processing the input file that takes most of the time.

It might help in reducing the load time of transactions simply due to the nature of the technology.

And FWIW, I already investigated using the sqlite API as a replacement for bean-query. It doesn't allow enough customization for it to be possible. I need custom rendering routines and custom aggregators, and besides I really like the shortcuts (e.g. optional group-by).

Martin Blais

unread,

Apr 15, 2019, 8:18:17 AM4/15/19

to Beancount

On Mon, Apr 15, 2019 at 8:08 AM Aamer Abbas <aa...@aamerabbas.com> wrote:

Sort of defeats the purpose of plain text accounting though. I think the product would lose something special if it's in some non-plain-text format. If it were to go this route, I think the better solution would be to try some sort of caching.

That's already done. There's a pickle cache right next to the top-level input file.

For example, it could be interesting to cache files in a serialized format. It could check to see if the file size or timestamp has changed (and then invalidate the cache in such an event). This would give you the option to move your older transactions into separate files (for example, a different file for each year)

I already though of doing this here in the big TODO file I sort of jot-down everything to:

https://bitbucket.org/blais/beancount/src/fa1edde3bcd02a277fac193f460a39c9a1461161/TODO#lines-2311

Though I noted at the time it would be a "great idea for performance" I don't think it would make much of a difference now.

Run bean-check -v on a large file and you'll see parsing is only a fraction of the time spent.

On Mon, Apr 15, 2019 at 3:02 PM Alen Šiljak <alen....@gmx.com> wrote:
Just out of curiosity - would changing the data format shorten the time required for processing? I know this is plain-text-accounting but it would be interesting to see what effect would using SQLite have on the performance.
It might help in reducing the load time of transactions simply due to the nature of the technology.

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/ba60a1ca-22f0-455c-9a36-531b05e81278%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAOHSxbm%2BDiBw%3DMnhoaFUa1jjs48kqqcxfsdrBg3M75eu26qUaA%40mail.gmail.com.

Alen Siljak

unread,

Apr 15, 2019, 8:20:20 AM4/15/19

to bean...@googlegroups.com

One thing I was thinking of is that most users would filter down to a specific set of data, which would mean that only that set needs to be parsed and processed.

Stefano Zacchiroli

unread,

Apr 15, 2019, 8:25:34 AM4/15/19

to bean...@googlegroups.com

On Mon, Apr 15, 2019 at 03:10:31PM +0300, Aamer Abbas wrote:
> Stefano, does it already do caching? Where is the pickle cache stored? I
> see __pycache__ files in my importer folders, but not anywhere else. Sorry,
> I have not really done much python in the past, so I am probably asking a
> dumb question.

Not dump at all, especially because it's not entirely trivial to
reproduce :-) If you have a foo.beancount file, Beancount will use a
.foo.beancount.picklecache cache file, but *only* if loading the textual
file took more than 1 second. See:

https://bitbucket.org/blais/beancount/src/fa1edde3bcd02a277fac193f460a39c9a1461161/beancount/loader.py?at=default&fileviewer=file-view-default#loader.py-53

Hence it will probably not create the cache file in artificial examples
like running bean-check on the output of bean-example. But it will use
cache files in any real-life scenario with a significant number of
transactions.

Hope this helps,
Cheers

--
Stefano Zacchiroli . za...@upsilon.cc . upsilon.cc/zack . . o . . . o . o
Computer Science Professor . CTO Software Heritage . . . . . o . . . o o

Former Debian Project Leader & OSI Board Director . . . o o o . . . o .

Aamer Abbas

unread,

Apr 15, 2019, 8:26:13 AM4/15/19

to Beancount

That's already done. There's a pickle cache right next to the top-level input file.

Hmm, not for me. Maybe it's something specific to my version of Python or the way it's set up on my computer. Here's the output I see ...

~/g/beancount (master|✔) $ ls -a
.DS_Store .gitignore .vscode downloads importers personal.beancount plugins README
.git .pytest_cache documents example_queries.txt includes personal.import price_sources

~/g/beancount (master|✔) $ bean-check personal.beancount

~/g/beancount (master|✔) $ ls -a
.DS_Store .gitignore .vscode downloads importers personal.beancount plugins README
.git .pytest_cache documents example_queries.txt includes personal.import price_sources

~/g/beancount (master|✔) $

To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAK21%2BhPJQPPOcP2GQDeAurGSiu-HAEMoWoumca%2BMPteWPGhPpQ%40mail.gmail.com.

Aamer Abbas

unread,

Apr 15, 2019, 8:28:37 AM4/15/19

to Beancount

Hmm, not for me. Maybe it's something specific to my version of Python or the way it's set up on my computer. Here's the output I see ...

Nevermind, Stefano answered my question. "beancount.loader (total)" is 315ms, so I'm under the 1s threshold.

Thanks!

huy...@gmail.com

unread,

Nov 19, 2019, 7:30:31 AM11/19/19

to Beancount

I love beancount. It changes everything that I do with accounting.

But, I'm also hitting the performing issue whenever I save the file.

Have you ever thought of using Django to take Beancount to the next level. It won't lose the plain-text accounting meaning much since we can easily export the whole transactions table into a beancount file any time.

If not, is this approach doable? I recently have a structure like this:

myfolder/journals.beancount (imports all archived beancount files)
myfolder/archived/2018trans.beancount
myfolder/archived/2017trans.beancount
myfolder/archived/2016trans.beancount

All files in archived should be cached indefinitely (until the archived files change)

So the main beancount file is always small. It should be fast. But of course, that's not the case right now. It's still slow for me every time I save my journals.beancount

Best,

Huy.

To unsubscribe from this group and stop receiving emails from it, send an email to bean...@googlegroups.com.

Aamer Abbas

unread,

Nov 19, 2019, 9:27:04 AM11/19/19

to Beancount

Not sure I get the relevance of Django here. However, I had asked this question in the past and I learned Beancount is already doing caching using Pickle.

Anything that takes longer than the threshold of 1 second to load will be cached (https://bitbucket.org/blais/beancount/src/43f8972d57f4ac40757a0462af57b2be5feb311e/beancount/loader.py#lines-54).

To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/a45fed05-24b8-4e4f-a966-d5eec455029c%40googlegroups.com.

huy...@gmail.com

unread,

Nov 19, 2019, 10:07:43 AM11/19/19

to Beancount

Sorry please ignore the Django comment for now.

Hmm, interesting.

bean-check -v journal.beancount

INFO : Operation: 'beancount.loader (total)' Time: 171 ms

But every time I insert a new transaction, the beancount logo at the top left conner spins a little bit (definitely not instantaneous like the first few months). At my main computer, it's about a second. At my other machine, it definitely takes more than 2 seconds for every transaction. I have a total of 50k lines of data. But now i have split them into multiple files. But the performance is kinda the same.

Is the issue I'm having on a slower machine normal?

I don't know if it's relevant but I'm using beancount fava. Sorry if i'm asking the wrong question or posting it in the wrong group.

Best,

Huy.

To unsubscribe from this group and stop receiving emails from it, send an email to bean...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/a45fed05-24b8-4e4f-a966-d5eec455029c%40googlegroups.com.

Martin Blais

unread,

Nov 19, 2019, 6:55:04 PM11/19/19

to Beancount

It's reloading the file because you saved it.

It will assume the file has changed and has to reload.

Cache is only used if no changes occurred

To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/e8cd764e-bc63-4b87-9c2f-0c89060e1267%40googlegroups.com.

Justus Pendleton

unread,

Nov 20, 2019, 9:20:11 AM11/20/19

to Beancount

On Tuesday, November 19, 2019 at 10:07:43 PM UTC+7, huy...@gmail.com wrote:

But every time I insert a new transaction [...]

I don't know if it's relevant but I'm using beancount fava.

Having to wait 1-2 seconds after each transaction you enter would make anyone frustrated. That said, it sounds like you're using beancount in a somewhat unique workflow if you're entering transactions regularly through fava's webapp. Most of us are using beancount -- or one of its plaintext accounting cousins -- because we largely add & edit transactions using a normal text editor and not an app. Instead of entering transactions via the fava webapp, have you tried just using notepad or emacs or vim or whatever your favorite editor is? Then you'll be able to enter all the transactions you want, as fast as you want.

If your performance problem is with fava itself, you might want repost but clearly call that out so fava developers see it.

Reply all

Reply to author

Forward