Re: ledger-autosync: CSV support

156 views
Skip to first unread message

Martin Blais

unread,
Aug 7, 2016, 1:05:35 PM8/7/16
to ledger-cli, hle...@googlegroups.com, Beancount
On Sun, Aug 7, 2016 at 12:36 PM, Simon Michael <si...@joyful.com> wrote:
Using a checksum for deduplication won't handle identical CSV records well, right ? Those are unlikely with our usual banks but I think quite possible if you consider CSV data generally.

Here's my recent plan for hledger.

Why reinvent the wheel? 
Write a customer HLedger printer (very simple) and reuse all the Beancount tools for importing.
See also:


Let reading CSV files work as it currently does, but add a separate import command which does some additional things: append the converted entries to the main journal file, and save a position marker in CSVFILE.lastimport. Also when reading, if the marker file is found, skip any CSV records before the marked position.

This much is generic and could be used with any data format, but I think it makes sense mainly for "import" data which you want to move into a main journal, and which is somewhat sequential, eg CSV, OFX QIF. It's would be harder and is probably not needed for eg journal or timeclock files.

For CSV specifically, I'm thinking the position marker will be the last CSV record processed. It could be elaborated later to consider timestamps, checksums etc. if needed.





On 8/6/16 10:19 PM, Erik Hetzner wrote:
Hi Martin,

On Sat, 06 Aug 2016 21:16:40 -0700,
Martin Blais <bl...@furius.ca> wrote:


Storing a checksum for the imported row suffers from the problem that if
the user does not immediately copy the result of our conversion, it will
not be imported further, it could get lost.

Beancount cross-checks extracted transactions against the contents of its
destination ledger, but because the user often massages the transactions it
has to use heuristics in order to perform an approximate match to determine
which transactions have already been seen. The heuristic I have in place
doesn't work too well at the moment (but it could be improved easily to be
honest).

A better idea would be to store a unique tag computed from the checksum of
the input row and to cross-check the imported transactions against that
special tag. That uses both your insight around validating the input
instead of the resulting transaction, and uses the ledger instead of a
temporary cache. It's the best of both worlds.

Thanks for the comment. I’m not sure the distinction that you are making here.
What I do, and I admit I only thought it through for a few minutes as I don’t
actually use Mint but just wanted a simple CSV format for examples - is:

1. Take the input key-value pairs for the row, e.g. Date=2016/01/10
2. Sort by key
3. Generate a string from the key-value pairs and calculate the MD5 checksum
4. Check against a metadata value in ledger using the checksum,
   a. If the row has already been imported, do nothing
   b. If the row is new (no match), import it.

Here is an example of a generated ledger transaction:

2016/08/02 Amazon
    ; csvid: mint.a7c028a73d76956453dab634e8e5bdc1
    1234                                      $29.99
    Expenses:Shopping                        -$29.99

As you can see, the csvid metadata field is what we query against using ledger
to see if the transaction is already present.

Simlarly, Beancount has a powerful but admittedly immature CSV importer
growing:
https://bitbucket.org/blais/beancount/src/9f3377eb58fe9ec8cfea8d9e3d56f2446d05592f/src/python/beancount/ingest/importers/csv.py

I've switched to using this and CSV file formats whenever I have them
available - banking, credit cards, 401k.

I'd like to make a routine to try to auto-detect the columns eventually, at
the moment, they must be configured when creating the importer
configuration.

Thanks for the pointer - it does look a lot more flexible than my
implementation.

I decided it what simpler, for my needs, to require a new class for each type of
CSV file. It was too much trouble to try to make it configurable. The core code
handles reading the CSV file, deduplicating, and all of that. The CSV class
simply implements a `convert(row)` method which returns a `Transaction` data
structure. I hope that if others need to import a particular type of CSV file,
e.g. from their bank, they can contribute that back to the project.

best, Erik
--
Sent from my free software system <http://fsf.org/>.



--

--- You received this message because you are subscribed to the Google Groups "Ledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ledger-cli+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erik Hetzner

unread,
Aug 7, 2016, 1:17:39 PM8/7/16
to ledge...@googlegroups.com, hle...@googlegroups.com, bean...@googlegroups.com, Simon Michael
Hi Simon,

On Sun, 07 Aug 2016 09:36:13 -0700,
Simon Michael <si...@joyful.com> wrote:
>
> Using a checksum for deduplication won't handle identical CSV records well,
> right ? Those are unlikely with our usual banks but I think quite possible if
> you consider CSV data generally.

No, that could be an issue. But with a date + unique amount it seems unlikely.

The MD5 method is only the basic method. If your CSV file provides a unique ID
row - which paypal does - that is used instead.

> Here's my recent plan for hledger. Let reading CSV files work as it currently
> does, but add a separate import command which does some additional things:
> append the converted entries to the main journal file, and save a position
> marker in CSVFILE.lastimport. Also when reading, if the marker file is found,
> skip any CSV records before the marked position.
>
> This much is generic and could be used with any data format, but I think it
> makes sense mainly for "import" data which you want to move into a main journal,
> and which is somewhat sequential, eg CSV, OFX QIF. It's would be harder and is
> probably not needed for eg journal or timeclock files.
>
> For CSV specifically, I'm thinking the position marker will be the last CSV
> record processed. It could be elaborated later to consider timestamps, checksums
> etc. if needed.

That does seem a lot simpler than the method ledger-autosync uses. On the other
hand it ties the system to ensuring that file ordering does not change. Would it
work with multiple input sources? Another tricky issue might be overlap. A nice
feature of ledger-autosync is that it can handle overlapping imports, e.g.
importing from a file where some portion of the transactions are new and some
portion are old (known).

best, Erik

PS: I’ve also updated the ledger-autosync README to include some basic
information about tracking 401k and investment accounts. ledger-autosync now
includes code to properly (mostly) import these type of transactions.

Simon Michael

unread,
Aug 7, 2016, 3:49:36 PM8/7/16
to bean...@googlegroups.com, ledge...@googlegroups.com, hle...@googlegroups.com
On 8/7/16 10:17 AM, Erik Hetzner wrote:
> That does seem a lot simpler than the method ledger-autosync uses. On the other
> hand it ties the system to ensuring that file ordering does not change.

Yes, it's for that case. For more randomly changing data, checksums
would be more important.

> Would it work with multiple input sources?

It should, each input file/source has its own position marker.

> Another tricky issue might be overlap. A nice
> feature of ledger-autosync is that it can handle overlapping imports, e.g.
> importing from a file where some portion of the transactions are new and some
> portion are old (known).

That's the point of this scheme - to reliably import just the new CSV
records from an intermittently updated bank download.

I like the idea of the position marker (contents of FILE.lastimport)
being flexible, so the deduplication strategy can be changed without
changing the UI.

Writing extra files is a bit of clutter, but hledger users are used to
seeing FILE.csv.rules, and I think it's better to keep them visible (not
dot files) to remind that they exist, can be edited/deleted to adjust
the last import position, and might contain sensitive data.

> PS: I’ve also updated the ledger-autosync README to include some basic
> information about tracking 401k and investment accounts. ledger-autosync now
> includes code to properly (mostly) import these type of transactions.

Nice, docs++.


Jeremy Maitin-Shepard

unread,
Aug 9, 2016, 12:54:40 AM8/9/16
to bean...@googlegroups.com, ledge...@googlegroups.com, hle...@googlegroups.com
I handle this problem by including both the date and the precise description text as metadata in each posting imported/matched with an entry from the CSV file.  This allows reliably determining which entries in the CSV have yet to be matched/imported, and allows detecting problems if a given (date, description_text) pair occurs more times in the beancount journal than it does in the CSV file (note that this handles the case of two entries with the same date and the same exact description text, corresponding to two different transactions, occurring in the CSV).  It also allows training a classifier for automatically labeling CSV entries by the appropriate "other" account, since it preserves the association to the original CSV description text.




--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+unsubscribe@googlegroups.com.
To post to this group, send email to bean...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/no83bo%24s77%241%40blaine.gmane.org.
Reply all
Reply to author
Forward
0 new messages