Code Review: Importer with duplicate detection and transformation

53 views
Skip to first unread message

Florian Lindner

unread,
May 6, 2019, 8:49:57 AM5/6/19
to Beancount
Hello,

as you might have guessed from my previous questions, I am currently into the issues of importing and de-duplication. For that reason, I wrote an importer for the Frankfurter Sparkasse 1822direkt CSV input data.


* It does de-duplication by computing a hash of the CSV input line and saves it to meta data as "hash". Entries from the same input line are not imported again.

* It converts "Rechnungabschlüsse" in the CSV files to balance assertions.

* It adds some additional information as "empfaenger" and "buchungsart" as meta data.

* The importer does some transformations of payees based on regular expressions and setting of accounts based on python expressions allowing for more flexible rules. The latter might be interesting to you.

* It can also be used to transform an existing beancount file and apply the aforementioned transformations.


Questions / Remarks:

* Is "hash" the best meta variable name to store the hash too? Is there some notion of hidden/internal use only meta names, such as "__hash__" (which is invalid, as bean-check told me).

* UTF-8 in metadata key names would be cool, for me specifically, the German Umlauts (öüä).

* I am an open to any other suggestions, remarks as this is my first piece of code using the beancount API.

Best Thanks,
Florian

Martin Blais

unread,
May 7, 2019, 12:41:23 AM5/7/19
to Beancount
On Mon, May 6, 2019 at 8:50 AM Florian Lindner <mailin...@xgm.de> wrote:
Hello,

as you might have guessed from my previous questions, I am currently into the issues of importing and de-duplication. For that reason, I wrote an importer for the Frankfurter Sparkasse 1822direkt CSV input data.


* It does de-duplication by computing a hash of the CSV input line and saves it to meta data as "hash". Entries from the same input line are not imported again.

* It converts "Rechnungabschlüsse" in the CSV files to balance assertions.

* It adds some additional information as "empfaenger" and "buchungsart" as meta data.

* The importer does some transformations of payees based on regular expressions and setting of accounts based on python expressions allowing for more flexible rules. The latter might be interesting to you.

* It can also be used to transform an existing beancount file and apply the aforementioned transformations.


Questions / Remarks:

* Is "hash" the best meta variable name to store the hash too? Is there some notion of hidden/internal use only meta names, such as "__hash__" (which is invalid, as bean-check told me).

SGTM
In theory I've tried pretty hard to avoid using metadata from Beancount itself and to leave it alone for users to peruse, but a few instances of special keys have crept in:

bergamot [hg|default]:~/p/beancount$ grep -Esrn "^[A-Z][A-Z_]+ = ['\"]__[a-z]+"  beancount
beancount/core/interpolate.py:199:AUTOMATIC_META = '__automatic__'
beancount/core/interpolate.py:202:AUTOMATIC_RESIDUAL = '__residual__'
beancount/core/interpolate.py:205:AUTOMATIC_TOLERANCES = '__tolerances__'
beancount/ingest/extract.py:31:DUPLICATE_META = '__duplicate__'

I'd like to remove these eventually and put this information in the schema at a more appropriate place.
Just avoid the __...__ names and you should be alright.

 

* UTF-8 in metadata key names would be cool, for me specifically, the German Umlauts (öüä).

Not impossible and perhaps not even difficult; somebody else has already done the legwork to add utf-8 to account names.
You'd have to change the KEY token in lexer.l to use some of the UTF-* definitions carefully.
See rule key_value in grammar.y
It's pretty isolated, I don't think it would break much else.


* I am an open to any other suggestions, remarks as this is my first piece of code using the beancount API.

Best Thanks,
Florian

--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/f37bff25-8af4-4bda-b27f-9c1c08c37437%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages