beangulp adds newly extracted entries to existing entries before running hooks?

200 views
Skip to first unread message

Justus Pendleton

unread,
Mar 2, 2025, 11:01:20 PM3/2/25
to Beancount
According to examples/import.py hooks take two parameters

Args:
    extracted_entries_list: A list of (filename, entries) pairs, where
'entries' are the directives extract from 'filename'.
     ledger_entries: If provided, a list of directives from the existing
ledger of the user. This is non-None if the user provided their
ledger file as an option.

Returns:
A possibly different version of extracted_entries_list, a list of
(filename, entries), to be printed.

But "ledger_entries" is never None -- that is, it is non-None even if a user didn't provide their ledger file as an option.

This is because of the deduplicate logic in __init__.py/_extract which extends existing entries before calling hooks

    # Deduplicate.
    for filename, entries, account, importer in extracted:
        importer.deduplicate(entries, existing_entries)
        existing_entries.extend(entries)

    # Invoke hooks.
    for func in ctx.hooks:
        extracted = func(extracted, existing_entries)

This was somewhat surprising to me (especially since it was contrary to the quasi-documentation/comment I quoted) as I wouldn't expect (or want) to have the newly imported entries merged into the existing entries before hooks have run.

Is this intended? Is there another easy way to get the pristine set of entries from within a hook short of just running beancount.loader myself?

My actual use case: I want to know the most recent Balance statement of an account in the ledger, which I am using as a proxy for "last imported date". But the most recent Balance statement I actually find will the one auto-generated by beangulp and then merged into existing_entries.

Cheers,
Justus

Justus Pendleton

unread,
Mar 6, 2025, 2:06:10 AM3/6/25
to Beancount
I'm increasingly convinced this is a bug and submitted a PR to address it.


The deduplicate logic merges the newly imported & deduped entries into the existing entries because it looks at the existing entries to decide what to deduplicate. But handling it this way pollutes the existing_entries for any hooks that get run later. As one other example of the problem: if you run ML training on "existing_data" it is wrong (with varying degrees of wrongness) because your newly imported entries (which btw only have a single leg so aren't fully valid beancount yet) have been merged into the training data.

I think the right way to handle this is just track existing_entries and "newly imported and deduped entries" separately for purposes of deduplication.

Cheers,
Justus

Philip Xie

unread,
May 8, 2025, 6:57:45 PM5/8/25
to Beancount
I'm surprised this hasn't gotten more attention, because in my experience this completely broke smart_importer for me. I couldn't get any matches from the training at all, until I dug around and found the same line you did. I still have it "fixed" locally via a direct edit to the file in my venv, which is totally the proper way to patch things right? 

Great PR, you did more investigating into the intent of the original code than I was able to. Would love to see it merged.
Reply all
Reply to author
Forward
0 new messages