Beangulp import/extract options

Eric Altendorf

unread,

Jul 21, 2023, 5:06:35 PM7/21/23

to bean...@googlegroups.com

I'm trying to figure out whether I can use the Beangulp import driver with hooks, or if I need to write my own driver to call my importers and do postprocessing. As you may recall, my workflow is atypical, as I have no curated Beancount ledger file; my source of truth are my input data files and the Beancount ledger is a built artifact for running analysis.

There are two things I'd like to do that I don't think are currently possible; I'd appreciate feedback on whether these seem like things Beangulp should support (I could contribute a patch), or if I'm better off finding a different solution:

- I'd like to deduplicate entries among different importers in a single run, not just dedup against a pre-existing ledger

- I'd like to be able to emit the output file globally sorted by date (first the official entry date, then secondarily by a timestamp attached to the metadata) rather than grouped by import file. (Broadly this will make it easier for me to debug issues sequentially, and ordering within-day may alleviate some of the issues I've seen with same-day purchase & transfer transactions.)

And just to double check that this should already be possible:

- I'd like to be able to add entries (i.e., account declarations, initial balance pads, etc.) via a hook

Thanks,

eric

Daniele Nicolodi

unread,

Jul 22, 2023, 6:34:21 AM7/22/23

to bean...@googlegroups.com

On 21/07/23 23:06, Eric Altendorf wrote:
> I'm trying to figure out whether I can use the Beangulp import driver
> with hooks, or if I need to write my own driver to call my importers and
> do postprocessing. As you may recall, my workflow is atypical, as I
> have no curated Beancount ledger file; my source of truth are my input
> data files and the Beancount ledger is a built artifact for running
> analysis.
>
> There are two things I'd like to do that I don't think are currently
> possible; I'd appreciate feedback on whether these seem like things
> Beangulp should support (I could contribute a patch), or if I'm better
> off finding a different solution:
>
> - I'd like to deduplicate entries among different importers in a single
> run, not just dedup against a pre-existing ledger

I was going to reply that this is already supported, then I realized
that I never merged the patch implementing it
https://github.com/beancount/beangulp/pull/64 I'm going to rebase and
merge it ASAP.

> - I'd like to be able to emit the output file globally sorted by date
> (first the official entry date, then secondarily by a timestamp attached
> to the metadata) rather than grouped by import file. (Broadly this will
> make it easier for me to debug issues sequentially, and ordering
> within-day may alleviate some of the issues I've seen with same-day
> purchase & transfer transactions.)

It is trivial to post-process the output of beangulp to apply any
ordering you like. Indeed I do something very similar for ledgers.
Writing from memory:

import beanquery.parser.parser
import beanquery.parser.printer

def key(entry):
return (entry.date, entry.meta['timestamp'])

entries, errors, options = parser.parse_file(filename)
entries.sort(key=key)
printer.print_entries(entries)

> And just to double check that this should already be possible:
>
> - I'd like to be able to add entries (i.e., account declarations,
> initial balance pads, etc.) via a hook

You can do this as part of the sorting post-processing step, or with a
beancount plugin. See for example the beancount.plugins.auto_accounts
(and other) plugins.

Cheers,
Dan

Daniele Nicolodi

unread,

Jul 22, 2023, 7:12:14 AM7/22/23

to bean...@googlegroups.com

On 22/07/23 12:34, Daniele Nicolodi wrote:
> On 21/07/23 23:06, Eric Altendorf wrote:
>> I'm trying to figure out whether I can use the Beangulp import driver
>> with hooks, or if I need to write my own driver to call my importers and
>> do postprocessing. As you may recall, my workflow is atypical, as I
>> have no curated Beancount ledger file; my source of truth are my input
>> data files and the Beancount ledger is a built artifact for running
>> analysis.
>>
>> There are two things I'd like to do that I don't think are currently
>> possible; I'd appreciate feedback on whether these seem like things
>> Beangulp should support (I could contribute a patch), or if I'm better
>> off finding a different solution:
>>
>> - I'd like to deduplicate entries among different importers in a single
>> run, not just dedup against a pre-existing ledger
>
> I was going to reply that this is already supported, then I realized
> that I never merged the patch implementing it
> https://github.com/beancount/beangulp/pull/64 I'm going to rebase and
> merge it ASAP.

I've merged it now. Please give it a spin and let me know if it works
for your use case (or if I broke something when resolving the merge
conflicts).

Cheers,
Dan

Eric Altendorf

unread,

Jul 22, 2023, 9:58:33 PM7/22/23

to bean...@googlegroups.com

On Sat, Jul 22, 2023 at 3:34 AM Daniele Nicolodi <dan...@grinta.net> wrote:

On 21/07/23 23:06, Eric Altendorf wrote:
> I'm trying to figure out whether I can use the Beangulp import driver
> with hooks, or if I need to write my own driver to call my importers and
> do postprocessing. As you may recall, my workflow is atypical, as I
> have no curated Beancount ledger file; my source of truth are my input
> data files and the Beancount ledger is a built artifact for running
> analysis.
>
> There are two things I'd like to do that I don't think are currently
> possible; I'd appreciate feedback on whether these seem like things
> Beangulp should support (I could contribute a patch), or if I'm better
> off finding a different solution:
>
> - I'd like to deduplicate entries among different importers in a single
> run, not just dedup against a pre-existing ledger

I was going to reply that this is already supported, then I realized
that I never merged the patch implementing it
https://github.com/beancount/beangulp/pull/64 I'm going to rebase and
merge it ASAP.

That's great! I have pulled the latest code, and it doesn't seem to be deduplicating the expected items. Let me check my assumptions:

I'm not sure how one is supposed to run multiple importers at once, the doc kind of only describes running one. So I'm currently running with a Python script that builds a list of importers, then runs Ingest, as follows; is this correct, or am I missing some other setup code?

if __name__ == '__main__':
importers = get_importers()
hooks = []
cli = beangulp.Ingest(importers, hooks).cli
cli()

The deduplication is supposed to run by default, correct?

There seems to be a fairly good default implementation of similarity comparison, yes?

Deduplication will happen among entries from *different* importers running in the same run, right?

> - I'd like to be able to emit the output file globally sorted by date
> (first the official entry date, then secondarily by a timestamp attached
> to the metadata) rather than grouped by import file. (Broadly this will
> make it easier for me to debug issues sequentially, and ordering
> within-day may alleviate some of the issues I've seen with same-day
> purchase & transfer transactions.)

It is trivial to post-process the output of beangulp to apply any
ordering you like. Indeed I do something very similar for ledgers.
Writing from memory:

import beanquery.parser.parser
import beanquery.parser.printer

def key(entry):
return (entry.date, entry.meta['timestamp'])

entries, errors, options = parser.parse_file(filename)
entries.sort(key=key)
printer.print_entries(entries)

Hmm, OK, that may work fine, thanks.

> And just to double check that this should already be possible:
>
> - I'd like to be able to add entries (i.e., account declarations,
> initial balance pads, etc.) via a hook

You can do this as part of the sorting post-processing step, or with a
beancount plugin. See for example the beancount.plugins.auto_accounts
(and other) plugins.

Cool, sounds good. I hadn't dug into plugins yet.

Thank you!

eric

Cheers,
Dan

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/3fdc241b-1fae-062b-22c6-42b718bd00cf%40grinta.net.

Daniele Nicolodi

unread,

Jul 23, 2023, 6:19:05 AM7/23/23

to bean...@googlegroups.com

On 23/07/23 03:58, Eric Altendorf wrote:
> That's great! I have pulled the latest code, and it doesn't seem to be
> deduplicating the expected items. Let me check my assumptions:
>
> I'm not sure how one is supposed to run multiple importers at once, the
> doc kind of only describes running one.

This is a design document, not user documentation. What has been
implemented does not necessarily exactly reflects that design.
Unfortunately the documentation for beangulp is coming along slowly. The
best there is are probably the examples in the package.

> So I'm currently running with a Python script that builds a list of
> importers, then runs Ingest, as follows; is this correct, or am I
> missing some other setup code? >
> if __name__ == '__main__':
> importers = get_importers()
> hooks = []
> cli = beangulp.Ingest(importers, hooks).cli
> cli()

This looks correct, but you can simplify it a tiny bit: the Ingest
object can be called directly:

ingest = beangump.Ingest(importers, hooks)
ingest()

> The deduplication is supposed to run by default, correct?

It should, but I just found that indeed I overlooked some divergence in
the code paths between the time the intra-importer deduplication patches
were written and the time I merged them. As a result, the intra-importer
deduplication is not run. This is what you get when you leave patches
not merged for years. I'm fixing it now.

> There seems to be a fairly good default implementation of similarity
> comparison, yes?

"fairly good" ?

> Hmm, OK, that may work fine, thanks.

"may work fine" ?

These tools have been written to scratch the itch of the persons that
wrote them. If you think that you can find something better elsewhere,
or that you can write something better yourself, you are free to do so.
Just let me know and I'll process a refund for your support contract.

Daniele Nicolodi

unread,

Jul 23, 2023, 6:55:01 PM7/23/23

to bean...@googlegroups.com

On 23/07/23 12:19, Daniele Nicolodi wrote:
> It should, but I just found that indeed I overlooked some divergence in
> the code paths between the time the intra-importer deduplication patches
> were written and the time I merged them. As a result, the intra-importer
> deduplication is not run. This is what you get when you leave patches
> not merged for years. I'm fixing it now.

It is fixed now.

Cheers,
Dan

Eric Altendorf

unread,

Jul 23, 2023, 8:37:16 PM7/23/23

to bean...@googlegroups.com

On Sun, Jul 23, 2023 at 3:19 AM Daniele Nicolodi <dan...@grinta.net> wrote:

On 23/07/23 03:58, Eric Altendorf wrote:

...

> I'm not sure how one is supposed to run multiple importers at once, the
> doc kind of only describes running one.

This is a design document, not user documentation. What has been
implemented does not necessarily exactly reflects that design.
Unfortunately the documentation for beangulp is coming along slowly. The
best there is are probably the examples in the package.

Sorry, I just meant to imply that I have tried to follow guidance where I could.

Looking at the code was indeed how I got it running, but I was looking for

confirmation because sometimes also code isn't up to date or doing things in

the recommended ways :)

This looks correct, but you can simplify it a tiny bit: the Ingest
object can be called directly:

ingest = beangump.Ingest(importers, hooks)
ingest()

Cool. What does the `cli` object do, anyway? I'm not super experienced

in Python and I wasn't sure I fully understood what was going on there.

It should, but I just found that indeed I overlooked some divergence in
the code paths between the time the intra-importer deduplication patches
were written and the time I merged them. As a result, the intra-importer
deduplication is not run. This is what you get when you leave patches
not merged for years. I'm fixing it now.

Thank you for the fix, I believe it's working for me now!

> There seems to be a fairly good default implementation of similarity
> comparison, yes?

"fairly good" ?

> Hmm, OK, that may work fine, thanks.

"may work fine" ?

These tools have been written to scratch the itch of the persons that
wrote them. If you think that you can find something better elsewhere,
or that you can write something better yourself, you are free to do so.
Just let me know and I'll process a refund for your support contract.

I am very sorry, that was a poor choice of words and came across wrong.

I had looked over the similarity/deduping code and it appeared to me that

it had been written with the expectation that the default setup would work

out of the box for many cases users would run into. I just wanted to

confirm that was the intention, as opposed to, say, that it was an example

implementation only, or only for particular use cases, with the expectation

that users ought to override it with their own implementation.

I also have an open source project on github and you provide 100x better

service than I ever did :)

eric

Reply all

Reply to author

Forward