--
---
You received this message because you are subscribed to the Google Groups "Ledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ledger-cli+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
---
You received this message because you are subscribed to the Google Groups "Ledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ledger-cli+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Thinking about this more, there's the potential for a nice big project independent of all our Ledger implementations, to deal with external data. Here's the idea, five components of a single project:
- "Fetching": code that can automatically obtain the data by connecting to various data sources. The ledger-autosync attempts to do this using ofxclient for institutions that support OFX. This could include a scraping component for other institutions.
- "Recognition": given a filename and its contents, automatically guess which institution and account it is for. Beancount's import package deals with this by allowing the user to specify a list of regexps that the file must match. I'm not entirely sure this can always be done irrespective of the user, as the account-id is often a required part of a regexp, but it might. This is used to automate "figuring out what to do" given a bunch of downloaded files in a directory, a great convenience. There is some code in ledger-autosync and the beancount.sources Python package.
- "Extraction": parse the file, CSV or OFX or otherwise, and extract a list of double-entry transactions data structures from it in some sort of generic internal format, independent of Ledger / HLedger / Beancount / other. The Reckon project aims to do this for CSV files.
- "Export": convert the internal transactions data structure to the syntax of one particular double-entry language implementation, Ledger or other. This spits out text.
Martin,I really like the idea of a staged system, perhaps with a set of programs and drivers (see below).I'd be interested in helping with a project along these lines. Unfortunately my programming skills are rusty, but I work with a colleague who might help out.My own processing approach is similar to yours. Apologies for length and detail level. I have not looked at Rekon in detail yet so perhaps some of these ideas are already employed in other manners. My comments on each stage (and one of my own added) below...--Andy
On Tuesday, February 11, 2014 3:40:41 PM UTC-5, Martin Blais wrote:Thinking about this more, there's the potential for a nice big project independent of all our Ledger implementations, to deal with external data. Here's the idea, five components of a single project:- thanks for dissecting things so nicely.
- "Fetching": code that can automatically obtain the data by connecting to various data sources. The ledger-autosync attempts to do this using ofxclient for institutions that support OFX. This could include a scraping component for other institutions.
- the output of this stage would be a number of files of different formats -- OFX, a spectrum of CSV file formats, and others.
- "Recognition": given a filename and its contents, automatically guess which institution and account it is for. Beancount's import package deals with this by allowing the user to specify a list of regexps that the file must match. I'm not entirely sure this can always be done irrespective of the user, as the account-id is often a required part of a regexp, but it might. This is used to automate "figuring out what to do" given a bunch of downloaded files in a directory, a great convenience. There is some code in ledger-autosync and the beancount.sources Python package.
- I really like the approach CSV2Ledger takes with it's FileMatches.yaml (https://github.com/jwiegley/CSV2Ledger/blob/master/FileMatches.yaml) file. I think defining a spec for FileMatches.yaml that either Perl, Python, or whatever code could employ for following stages might be worthwhile. Filematches.yaml (or the equivalent) would provide key information for future processing stages of files from different sources. If CSV files then information about field-separators, field-names, a reg-ex for "real" records, etc. can be specified here. The result of "Recognition" would be to pass the file off to a customized driver (see my next comment).
- "Extraction": parse the file, CSV or OFX or otherwise, and extract a list of double-entry transactions data structures from it in some sort of generic internal format, independent of Ledger / HLedger / Beancount / other. The Reckon project aims to do this for CSV files.
- I suggest employing small driver programs, written by others, that ingest custom formats. The path to the appropriate driver program would be included in the FileMatches.yaml file (or it's equivalent). These drivers would ingest files output by "Fetching" stage and generate the "generic internal format" you mention. However, In support of flexibility I suggest that the result of this stage be a CSV file, that we strictly specify the format of, that would be processed by the next stage.
- I add an additional stage here I'll call "AccountAssignment". I examine several fields of the imported record (things like employeeID, PONumber, etc. that are associated with the transaction) to determine which DEB account name to assign it to. Account names for all DEB systems should be hierarchical so that could still be done in a DEB-software-agnostic manner. A more sophisticated version CSV2Ledger's PreProcess.yaml (https://github.com/jwiegley/CSV2Ledger/blob/master/PreProcess.yaml) could help drive this stage. The output of this stage is the same CSV as above with a "DEBAccount" field appended to each record.
- "Export": convert the internal transactions data structure to the syntax of one particular double-entry language implementation, Ledger or other. This spits out text.- I once again like the approach of CSV2Ledger.pl (see source code at https://github.com/jwiegley/CSV2Ledger/blob/master/CSV2Ledger.pl#L138). It allows for the FileMatches.yaml file to include a variable called TxnOutputTemplate that specifies how to setup the ledger-cli transaction in your journal file. A similar templating approach could be used for other double-entry language file formats.
Here is the design doc for it:Design Doc for LedgerHubPlease (anyone) feel free to comment in the margins (right-click -> Comment...).
Just a thought on the internal format of a library: I would be tempted
to use OFX as an internal format and then from there to
ledger/beancount format. This is because OFX is a well defined format,
so should hold any kind of financial data without problems. This will
also make it easier for other tools to adopt, because they might
already have an OFX import function.
- "Fetching": code that can automatically obtain the data by connecting to various data sources. The ledger-autosync attempts to do this using ofxclient for institutions that support OFX. This could include a scraping component for other institutions.- "Recognition": given a filename and its contents, automatically guess which institution and account it is for.
Martin et al,Nice initiative. A suggestion:On Wed, Feb 12, 2014 at 2:10 AM, Martin Blais <bl...@furius.ca> wrote:- "Fetching": code that can automatically obtain the data by connecting to various data sources. The ledger-autosync attempts to do this using ofxclient for institutions that support OFX. This could include a scraping component for other institutions.- "Recognition": given a filename and its contents, automatically guess which institution and account it is for.The "fetching" module already has information about where the data was downloaded from. Wouldn't it be better to retain this meta-data somewhere, to help with the "recognition" / "identification" step?
Recognition from a filename or contents alone seems flaky to me.
The "fetching" module already has information about where the data was downloaded from. Wouldn't it be better to retain this meta-data somewhere, to help with the "recognition" / "identification" step?
Hmmm that's an interesting idea.But I'm not convinced we can effectively implement fetching reliably yet.I like to be able to recognize just from files stashed in ~/Downloads