Automation being the key to success (aka Know Thyself) featuring ofxclient and Selenium WebDriver

148 views
Skip to first unread message

TRS-80

unread,
Jul 15, 2020, 2:32:33 PM7/15/20
to bean...@googlegroups.com
OK so this got a little long. Go grab yourself your favorite tasty cold
delicious adult beverage and get comfortable. I did put in some
headings
at least to mitigate the wall of text. :D

I am on my second or third go 'round with Beancount, over a period of
some
years. I have had various levels of success or failure, but for me
personally, I really felt like I was the most successful when automating
as
much as possible.

This post will be about how I arrived at that conclusion, and some
things I
learned along the way. I hope it ends up being useful to others who may
have similar ideas, but perhaps not put all the pieces together yet. Or
maybe need some encouragement, or...?

* Know Thyself

I guess I felt the need to make this post because Martin himself
throughout
the docs seems to put forth his more "manual" way of doing things (as a
way
to keep more "in touch" with his numbers, if I am reading him
correctly).

But perhaps I read too much into that. I can only say that generally I
do
try and follow the recommendations made in docs by founder and those
more
involved in a particular project (especially when I am just starting
out).
I suppose I figure there must be some reason for it, even if I do not
yet
understand what those reason(s) might be... So maybe this is just me
finally gaining enough experience to know what is what, and perhaps more
importantly "knowing myself" enough to recognize what works for me,
personally.

If you prefer the more "manual" approach (or any other approach, for
that
matter) I encourage you to "do what works for you." Thankfully we have
such flexible tools available to us...

* Automate as much as Possible

For me, what seems to work is "as much automation as possible." I still
end up manually doing some stuff of course, but for me if I can get that
down to 10% or 5% (or whatever) the way I see it is I have reduced
90-95%
of the work (and drudgery) involved. I mean, this is what computers are
best at, isn't it?

* Moving from CSV to OFX import

Along those lines, I recently moved from CSV import to OFX. It's still
early, but I am well on my way to nearly //completely// automating my
download, import, and categorization.

Before with CSV I had to log on to my bank and click through stuff, save
the file (and then remember my file naming scheme), etc. and some times
that just became too much friction and sooner or later I would start
falling behind from the simple drudgery of it all.

Further complicating the issue, my bank only keeps "transactions" around
for 90 days, so if I got busy or fell behind, I would be back to
/manually/
entering any "missed" transactions (yeah, right!).

Enter OFX (via ofxclient), which solves these problems by being
completely
scriptable (and thus automateable) tool.

There were a couple bugbears with ofxclient[0] however. The guy is not
really actively maintaining it. However after fixing a couple missing
apostrophes I /finally/ got it to work. I guess my Python must be
getting a
little better, because 1-2 years ago I had already failed once or twice
at
this exact same task. :)

So, on to the next hurdle...

* No (built in) OFX "categorizer"

Anyway so then it was a little disappointing to learn that there is no
callable "categorizer" available in the OFX importer example the same
way
that there was in the CSV importer example.

Until I found a recent post titled "Categorizing transactions
automatically
on import" which solved that particular part of the problem. I left a
more
fleshed out example as a reply to that thread for anyone who is
interested
(search the mailing list for that or "OFX categorizer" etc.).

* Next steps (Selenium WebDriver)

At this point I am satisfied enough in my progress (and have learned
enough) that I felt it would be worth sharing that progress with others.
But already I am looking forward to next steps. And I am getting
excited
about Beancount again. :)

The last days I have already been reading up docs about Selenium
WebDriver.
I have heard about Selenium before of course, but what I think motivated
me
to really give it a try now was an article I recently came across over
at
plaintextaccounting.org[1] by Lee Yingtong Li titled "Using selenium to
scrape/import bank transactions for ledger-cli."[2] This is a quite
recent
article (2020-04-29) as you can see by the link.

Anyway he is using it to get his "transactions" but that is not what I
plan
on using it for (I have OFX for that). For me, the only remaining piece
of
the puzzle that is left to automate is...

* Automatically downloading PDF statements

Like my "transactions", downloading these PDF "statements" was an
exercise
in drudgery, for all the reasons already mentioned above (clicking
through
bank website, remembering file naming convention, etc.).

First I tried doing this through OFX protocol itself. And maybe there
is a
way? The standard would seem to indicate maybe there is. But I made
posts
about this not only here but on ledger mailing list before and received
exactly zero replies so far (which is also why I am not even going to
bother looking them up in order to link to them). So I gave up on that
way
(for now).

So then I got the idea to maybe automate this drudgery using Selenium
(WebDriver).

* Arguments for Selenium WebDriver (in general)

Now, I have not even got this actually working yet, and the
implementation
details will of course be very bank (web site) dependant. So why bother
bringing it up now (or at all, for that matter)?

Well for same reason as posted very early on, mainly I have heard of
this
sort of thing being referred to mostly as "too much trouble" and took
that
assessment at face value. But is it? Some things I learned in my
research
the last few days started to change my mind:

1. So far, the Selenium WebDriver docs[3] seem to be very good. Simple
and
to the point.

2. There are bindings for several different languages. And the lanuage
bindings (I was looking at Python mostly) seem to be quite clean,
straightforward, and easy to remember / intuitive.

3. It appears to be quite a mature and reliable thing nowadays, with
browser vendors like Google and Mozilla (and others) actually
maintaining their own drivers for each particular browser. No more
"PhantomJS" and feeling like you are in some neverending cat and
mouse
with an opponent.

4. Not only that, apparently the whole notion of automated browser /
site
testing has actually become an W3C recommendation by now(!). [4]

It really appears to me to be a completely different dynamic nowadays.
Therefore I would challenge the notion that the ROI is not there. Not
only
is this looking quite easy, but dare I say, /well supported/ even! :)

Of course if I run into some brick wall (or get along swimmingly) I will
try and make some time and remember to report back in either case. :)
Which leads me into my final point...

* Choice of tools

At some point during this whole adventure (a while back) I thought long
and
hard about choice of tools.

There are other ways to accomplish "automation." Mainly online
"aggregators" like Plaid, Mint, and probably some others. I actually
had
signed up for a Plaid developer account at one point, before getting
ofxclient working. Those are certainly viable, perhaps even
preferrable,
depending on your personal proclivities. But not for me and here is
why.

First, it is a matter of dependance. Do I want to come to rely on some
centralized service, who could change their API or "developer" terms at
any
time and lock me out? Personally, no, I do not.

Second aspect is trust/security. Do I really trust a third party to
hold
all my various banking credentials? Personally, no, I do not.

And finally, independence and learning new skills in general. We all
have
very limited resources (mostly time). Do I want to spend my valuable
time
learning one particular (likely proprietary) API? Or should I instead
spend it learning a much more general (and F/LOSS) tool (like Selenium)
which also has the benefit of being able to solve lots of other
problems,
in addition to this particular one I am trying to solve right now?
Personally, <s>I think</s> I know that I prefer the latter.

So that is why I have chosen to go this particular route.

I'd love to hear anyone's thoughts on any or all of the above. Please
also
chime in if you have gotten stuck at any particular point along the way,
and maybe myself (or others) can help you get un-stuck. Thanks for
sticking with me if you made it this far. :)

Cheers,

TRS-80

[0] https://github.com/captin411/ofxclient
[1] https://plaintextaccounting.org/#articles-blog-posts
[2] https://yingtongli.me/blog/2020/04/29/hbs-scrape.html
[3] https://www.selenium.dev/documentation/en/webdriver
[4] https://www.w3.org/TR/webdriver1

Justus Pendleton

unread,
Jul 16, 2020, 9:26:47 AM7/16/20
to Beancount
I'm somewhat amused that you don't have time to spend five minutes once a month downloading CSV files but you have the many, many hours required to investigate, implement, and write about your alternative that saves five minutes a month :)

Your approach won't work with any institution that has 2FA and I can't imagine not having 2FA on my financial accounts.

Also, having spent many years dealing with substantial Selenium test suites, they are extremely brittle and required a surprising amount of ongoing maintenance. Failures due to timeouts from some DOM element taking too long to arrive, changes in the DOM breaking everything, etc. For a small personal project those may not be as frustrating as they were for a commercial software effort, though.

TRS-80

unread,
Jul 16, 2020, 1:17:46 PM7/16/20
to bean...@googlegroups.com
> On 2020-07-16 09:26, Justus Pendleton wrote:
> I'm somewhat amused that you don't have time to spend five minutes
> once a month downloading CSV files but you have the many, many hours
> required to investigate, implement, and write about your alternative
> that saves five minutes a month :)

If we were analyzing things on a strictly "ROI" basis, I would
acknowledge your point, as the statement part is not so time consuming
as the transactions and the whole rest of the process, and therefore
based on such criteria we would certainly be getting into "diminishing
returns" territory.

However I stopped some years ago using such criteria. I do things now
because I want to. I enjoy learning new tools and expanding my
skillset, and as I mentioned in OP I expect to be able to automate away
some other drudgery somewhere else in my life at some point using
Selenuim WebDriver.

I also detest drudgery.

I do acknowledge that a different set of criteria and personal
proclivities will almost certainly result in different conclusions.

Put differently, as a friend told me in IRC recently, I am "always
nerding." :D

> Your approach won't work with any institution that has 2FA and I can't
> imagine not having 2FA on my financial accounts.

I could tell you that my bank login uses the last 4(8?) of debit card as
a secondary auth (as an alternative to receiving an SMS), but that is
only one particular bank and may even change in future, so overall your
point still stands.

Especially with banks, the whole security thing does make it much more
difficult. So here I plan to do the 90% thing. And by that I mean, I
will be sitting at my computer while this runs. So I can see what is
happening, notice if it breaks, etc. I plan to have Selenium open the
page and then wait while I log in using KeePassXC Auto Type feature.
This already works very well (including the secondary last x of debit
card) and also takes care of the security issue of storing sensitive
credentials. This is one idea I got from the guy's article in OP. Have
Selenium wait for me to log in, look for some element on the logged in
page like "Hi, TRS-80!" or whatever... And then continue.

As an aside I just sat down this morning to implement this and realized
I want Selenium to open the browser instance on certain workspace. I
use i3 wm, so this is very easy as there are Python bindings for the
i3-ipc interface. So all from within one Python script I should be
largely able to automate the whole process, including window management
if needed.

> Also, having spent many years dealing with substantial Selenium test
> suites, they are extremely brittle and required a surprising amount of
> ongoing maintenance. Failures due to timeouts from some DOM element
> taking too long to arrive, changes in the DOM breaking everything,
> etc. For a small personal project those may not be as frustrating as
> they were for a commercial software effort, though.

I appreciate the feedback from someone who has been there. This
certainly seems to be the general consensus. Right now I am new to it
and all starry eyed, perhaps after some time I will come to same
conclusions and throw in the towel.

From the little reading I have done so far, it does seem like (at least
the "waiting for DOM element" thing) have been improved quite a lot in
the meantime. I don't know how long it has been since you used the
tools, maybe it does work better now or quite possibly I am simply an
over optimistic noob at this point. :)

However something I have been thinking about (already for quite some
time, at least 1-2 years) is that my primary bank is a smaller local
one, and as far as I can tell have not (at least visually) changed the
website much if at all in that time. Of course I have not been
analyzing the underlying DOM during this time (until now). But the
feeling I get is that this is more of a problem the larger the bank?
Where perhaps they have whole teams of "front end" people just looking
for something to do...

I guess we will find out. :)

TRS-80

Martin Blais

unread,
Jul 16, 2020, 10:32:53 PM7/16/20
to Beancount
TRS-80: I applaud you for automating your downloads using Selenium. It works really quite well and I'm using it myself to download ETF basket compositions (https://github.com/blais/baskets; warning: this isn't polished code). It works well, but I concur with Justus: if you don't run it very regularly you're bound to find out that it breaks as soon as even minor things on the websites change. It's a matter of constant gardening.

Justus; I agree with what you said. It's mostly not worth the effort (unless you're having fun doing it, or you want to update very frequently).

BTW, "baskets" is a project that would benefit from the participation, contribution and constant gardening by a multitude of people, and unlike banks, there's no personal password to downloading these portfolios and there are only a relatively small number of ETF issuers; many would benefit from having a unified source of portfolio compositions for ETFs. There are commercial services selling this and it's expensive. I think we could even extend the codebase to update an associated separate git repo with the latest values for a broad spectrum of these portfolios so that others wouldn't even have to run webdriver themselves (instead of running a database, just using files with a simple API to obtain the data in a common way across all issuers). Right now it downloads the portfolio files to a local cache; I should have had it write to a git repo instead.

Anyhow, I use this to deaggregate my ETF positions and be able to surface my specific exposure to a particular stock, across all positions, across all accounts. Beancount provides the input portfolio. And of course, I don't run it often enough... I had to make fixes to it a few days ago and it's still not fully operational on all my ETFs. So little time...





--
You received this message because you are subscribed to the Google Groups "Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beancount+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/39f2aa01-b039-4eb4-ada6-257f6b511627o%40googlegroups.com.

Tono Riesco

unread,
Jul 17, 2020, 5:43:56 AM7/17/20
to 'Patrick Ruckstuhl' via Beancount
Hi Martin,

As always, I’m surprised about your code! Thank you for sharing with us.

Looking at your basket project, I tried to install to learn about how did you do and I got an error in the install.

Looking inside the code I saw that in the setup.py (which give me the error) you are referring to:

name=“ameritrade”

You didn’t mixed the projects, did you? ;-)


Regards.

Tono.
> To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAK21%2BhMo2b-DL-ZmcU%3Dq5PH6CWFDwvVAjMhnY4X-NAbk%2BcmRow%40mail.gmail.com.

Martin Blais

unread,
Jul 17, 2020, 9:22:48 AM7/17/20
to Beancount

Zahan M

unread,
Jul 17, 2020, 12:12:00 PM7/17/20
to Beancount
I also did my best to eliminate the drudgery and ended up using Plaid for my import scripts: https://github.com/zahanm/collect-beans
They're a company that offers a financial API as a business model. If you sign up as a developer, you can probably make use of their "development" API for personal use for good.
The issue is that in order to provide this API, they basically store your credentials and talk to your bank directly. They also have to maintain the brittle scraping scripts, or negotiate a direct API, but that's their business. They're pretty widely used by services that try to link your financial accounts, so they should have a reasonable security story - but I still don't love the arrangement, and only do it because there isn't an alternative.

Zahan

Daniele Nicolodi

unread,
Jul 17, 2020, 12:26:47 PM7/17/20
to bean...@googlegroups.com
On 17/07/2020 10:12, 'Zahan M' via Beancount wrote:
> I also did my best to eliminate the drudgery and ended up using Plaid
> for my import scripts: https://github.com/zahanm/collect-beans
> They're a company that offers a financial API as a business model. If
> you sign up as a developer, you can probably make use of their
> "development" API for personal use for good.
> The issue is that in order to provide this API, they basically store
> your credentials and talk to your bank directly.

A company that maintains an handy database of back account credentials,
I bet no criminal is interested into probing their security :-)

I haven't checked their term of services, but I am pretty sure they
decline any responsibility if your credentials are stolen (also it would
be very hard to prove that they have been indeed stolen from them if
this happen and who does it is just enough smart).

Unlikely a bank, that is responsible and looses money if they get
hacked, I don't think they are. Thus, they security posture is probably
the one of most companies out there dealing with customer data: "good
enough", where "enough" is usually "enough to do not look completely
stupid if they hack us". And when they get hacked they will have a good
opportunity to sell you their premium "protection plan" and make some
extra money on top (see Equifax case).

Your personal credentials stored on your computers are generally ok
(unless you don't store credentials to account that handle millions),
because getting to them is a lot of work for a modest return. But put
many credentials in the same place and it become a completely different
game.

Cheers,
Dan

Chris Singley

unread,
Jul 17, 2020, 9:51:00 PM7/17/20
to Beancount
TRS-80,

You may find it reassuring that every professional I know shares a more-or-less extreme bias toward automated workflows.

You might be interested in https://csingley.github.com/ofxtools (that's mine).  It's a much better OFX client than what you're using (if I do say so myself), and it can also parse the downloaded files to extract the transactions/balances.

Cheers.

TRS-80

unread,
Jul 18, 2020, 4:34:01 PM7/18/20
to bean...@googlegroups.com
> On 2020-07-16 22:32, Martin Blais wrote:
> https://github.com/blais/baskets

Well, I suppose this thread was worth it, if only to flush this
heretofore unknown project out from the depths of Martin's secret
laboratory. :D

> Justus; I agree with what you said. It's mostly not worth the effort
> (unless you're having fun doing it, or you want to update very
> frequently).

Welp, as it turns out, that day I got on to doing other things, then
there were Ethernet wires to pull, shelves to put up, etc... So
apparently, even within the subset of things I "want" to do, I still
have not managed to get around to it... ;)

> BTW, "baskets" is a project that would benefit from the participation,
> contribution and constant gardening by a multitude of people, and
> unlike banks, there's no personal password to downloading these
> portfolios and there are only a relatively small number of ETF
> issuers; many would benefit from having a unified source of portfolio
> compositions for ETFs. There are commercial services selling this and
> it's expensive. I think we could even extend the codebase to update an
> associated separate git repo with the latest values for a broad
> spectrum of these portfolios so that others wouldn't even have to run
> webdriver themselves (instead of running a database, just using files
> with a simple API to obtain the data in a common way across all
> issuers). Right now it downloads the portfolio files to a local cache;
> I should have had it write to a git repo instead.

Not something I personally need right now, but FWIW I do agree with
your analysis.

TRS-80

unread,
Jul 18, 2020, 4:48:18 PM7/18/20
to bean...@googlegroups.com
> On 2020-07-17 12:12, 'Zahan M' via Beancount wrote:
> I also did my best to eliminate the drudgery and ended up using Plaid
> for my import scripts: https://github.com/zahanm/collect-beans
> They're a company that offers a financial API as a business model. If
> you sign up as a developer, you can probably make use of their
> "development" API for personal use for good.

I did mention I tried Plaid at some point, although admittedly buried
in the wall of text.

Thanks for sharing your scripts though, at one point I would have
found them handy and hopefully they will be of use to someone else at
some point as well. At least we are getting a good thread going now a
bit about automation.

> The issue is that in order to provide this API, they basically store
> your credentials and talk to your bank directly. They also have to
> maintain the brittle scraping scripts, or negotiate a direct API, but
> that's their business. They're pretty widely used by services that try
> to link your financial accounts, so they should have a reasonable
> security story - but I still don't love the arrangement

Yeah, these were amongst my reasons for not going that route.

> and only do it because there isn't an alternative.

Well, the alternative was presented in OP. :) Pre-requisite being OFX
availability of course (maybe your bank doesn't offer it).

TRS-80

TRS-80

unread,
Jul 18, 2020, 5:11:45 PM7/18/20
to bean...@googlegroups.com
> On 2020-07-17 21:51, Chris Singley wrote:
>
> You may find it reassuring that every professional I know shares a
> more-or-less extreme bias toward automated workflows.

Interdasting. I suppose I always knew this about hacker types, but
didn't realize it extended to the broader professional class.


> You might be interested in https://csingley.github.com/ofxtools
> (that's mine). It's a much better OFX client than what you're using
> (if I do say so myself), and it can also parse the downloaded files to
> extract the transactions/balances.

Wow, nice project you have going there! Now I'm really glad I made
this thread.

I've already been having a look around at your docs and stuff, and so
far (haven't used it yet) but already I can say I do like your stuff a
lot better. Much better documented, a broader base of contributors,
and therefore actually actively maintained! Because you know anything
dealing with OFX is going to break, sooner or later...

I am not sure how I missed you during my initial search. I like to
think my search-foo is usually pretty good, but all I kept coming
across were references to ofxclient, which is essentially an
unmaintained project, unfortunately.

TRS-80

Daniele Nicolodi

unread,
Jul 18, 2020, 6:33:24 PM7/18/20
to bean...@googlegroups.com
On 17/07/2020 10:12, 'Zahan M' via Beancount wrote:
> I also did my best to eliminate the drudgery and ended up using Plaid
> for my import scripts: https://github.com/zahanm/collect-beans
> They're a company that offers a financial API as a business model. If
> you sign up as a developer, you can probably make use of their
> "development" API for personal use for good.
> The issue is that in order to provide this API, they basically store
> your credentials and talk to your bank directly.

By the way, how is this supposed to work with services that require two
factor authentication? It is not at all a common thing in the US, but in
Europe almost all banks secure online access with two factor
authentication (usually declined as a one time password generation app
or a one time password generation token).

Cheers,
Dan

Martin Blais

unread,
Jul 18, 2020, 9:47:07 PM7/18/20
to Beancount
On Sat, Jul 18, 2020 at 4:34 PM TRS-80 <trs...@isnotmyreal.name> wrote:
> On 2020-07-16 22:32, Martin Blais wrote:
> https://github.com/blais/baskets

Well, I suppose this thread was worth it, if only to flush this
heretofore unknown project out from the depths of Martin's secret
laboratory.  :D
 
It's more like a damp cave with spider webs and dusty wine bottles

Zahan Malkani

unread,
Jul 19, 2020, 2:45:20 PM7/19/20
to Digest recipients
> I tried Plaid at some point
> Pre-requisite being OFX availability

Ah, sorry - I skimmed and thought it was OFX and Selenium.

Yeah, thanks for the feedback. I agree with the security and proprietary-ness points, but decided the convenience was worth the trade-off for me.
I actually started out with OFX (it's still there in my scripts), and used it for a ~year. But it got to a point where the majority of my accounts didn't have, or had broken OFX endpoints - and for the broken ones, the bank IT support to didn't care. I had to pretend to be a Quicken customer to get them to pay attention.
(slight aside here) I wish OFX had some real commitment behind it - maybe compliance was required by law, and they modernised it to have OAuth semantics (so that you're not forced into single-factor auth); but alas it doesn't look like anyone is going to make it happen - there's just no incentive for banks to expose our data they hold in custody.
At some point, I gave in, and I'm happy with the reliability of Plaid, and they cover all the institutions I deal with.


> work with services that require two factor authentication?

The way it works is that you authenticate with the bank, with single or two-factor auth - and Plaid then stores some sort of a session token. (Either a cookie, or using the bank's API if one exists.) If that token expires, you'll need to re-authenticate, in my experience I need to do that ~ 6 months.

Martin Blais <bl...@furius.ca> wrote:
-- 


 You received this message because you are subscribed to a topic in the Google Groups "Beancount" group.
 To unsubscribe from this topic, visit https://groups.google.com/d/topic/beancount/vH15jWNSka4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to beancount+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/CAK21%2BhMRiPss-N6D5dz7TS7uMdcxUyB3tA2G8j%2B_MR8hB-F-ZQ%40mail.gmail.com.

TRS-80

unread,
Jul 20, 2020, 11:15:25 AM7/20/20
to bean...@googlegroups.com
After playing with OFX a bit, I realized there are some things I don't
like about it. As a reminder, I am coming from CSV import. Maybe
these are particular to my bank implementation or not, I don't know;
but here they are:

1. No separate transaction and posting dates like I was getting from
CSV. Not a huge deal but sort of annoying.

2. The combined narrative field is obnoxious. Often lengthy, all
caps, redundant, and full of meaningless (to me) numbers, etc...
This is by far the bigger issue. Not unique to OFX but it seems
even worse here than CSV was.

3. I am sure some others I am forgetting. But #2, above, is enough
all by itself...

As I am essentially re-starting from scratch (yet again), I have been
really re-thinking my approach the last few days, questioning some
assumptions, etc. And at this point I might finally be having some
sort of epiphany regarding the more "manual" approach which has always
been advocated by Martin, and that I thought I wanted to get away from
in my OP. As embarrassing as this may be to admit after essentially
making a thread completely in the other direction, I think I need to
come clean for the benefit of posterity and anyone following along at
home.

I won't bore everyone with a point by point list of all the reasons
why the "manual" approach is better, but suffice it to say there are
many.

Anyway I already been toying around with a workflow that begins with
taking a photo of my receipts, which I will detail in another post to
follow as I think it is a novel approach that might be interesting to
others and so far represents my current thinking on the topic after a
number of iterations.

Cheers,
TRS-80
Reply all
Reply to author
Forward
0 new messages