XMLParsedAsHTMLWarning during import of ofx from ofxget

2,743 views
Skip to first unread message

Colton Crivelli

unread,
Jun 5, 2022, 9:10:35 PM6/5/22
to Beancount
Hey all,

I'm getting the following warning:
venv/lib/python3.10/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(


What I'm doing to get this:
  • Downloading account data using ofxget as described here
  • Importing that data using beancount-reds-importer (e.g. here)
Things I've tried or discovered:
  • I looked for all instances of `soup = BeautifulSoup .. ` and found the main calls in ofx.py. I tried changing these calls from feature=lxml to feature=xml which didn't resolve warning
  • I made sure lxml is downloaded
  • I tried to suppress the warning with a warning.filterwarnings but that didn't work either (not sure it would be the "right" thing either)
  • I found a PR in an unrelated repo where they solved by suppressing here
  • I tried ofx data downloaded from both Fidelity Investments and Chase (not expecting this to be institution specific)
Questions I have:
  • The warning doesn't really help me understand what call into BeautifulSoup caused the warning. Any tips on how to track down where the issue is coming from? Maybe ofx.py isn't part of the issue at all
  • I think bean_extract is still working but any suggestions on if the warning should be ignored or resolved would also be appreciated

Red S

unread,
Jun 6, 2022, 1:33:18 AM6/6/22
to Beancount
Hmm, I haven't come across this issue so far.

It's the ofxparse library that uses BS4. I'd ask there. Indeed, they did decide to parse this as HTML even though it's XML, but that code has worked fine for years now. What platform are you using?

I'd also consider filtering out via the shell, if everything else works fine:
bean-extract [blah blah...] 2> >(grep -v XMLParsedAsHTMLWarning >&2)

Colton Crivelli

unread,
Jun 6, 2022, 1:27:17 PM6/6/22
to Beancount
Your question about my platform got me thinking

I setup a new venv using python3.8 (instead of 3.10) and ran without any warnings. Haven't looked into why that might be yet.

Some tangential things I've ran into:
  • your very helpful template/reference script here runs into a python 3.10 specific deprecation warning mentioned here. They want you to use get_running_loop() instead of get_event_loop(). More discussion here. I'm not asking for a fix or help here, just sharing
  • the original reason why I moved to python3.10 is because my platform is arm64e/macOS. In short, if you are using smart-importer on arm64e with python 3.8 (or earlier) you'll end up with scikit-learn built for x86 and you'll be unable to import. There's a lot of talk about a way to get an arm build of scikit-learn using conda but it's a pain, would not recommend. Another option is install everything for x86 and use rosetta (e.g. `arch -x86_64 ./import.sh`). The last option is using python3.10 which appears to pull in everything you need to run natively with smart-importer
So I think I have two options, use rosetta and x86 for everything with python 3.10 or explore running natively with python 3.10 and getting fixes for the python3.10 specific issues.

Red S

unread,
Jun 6, 2022, 7:07:58 PM6/6/22
to Beancount
Nice. Thanks for jotting down the issues on arm64, python3.10, and such! Sounds to me like you can safely ignore/grep out that warning until it's fixed upstream. Filing an issue with ofxparse would be good.

Related: 'ofx-summarize' is a command included in beancount-reds-importers (bleeding edge) that you can use to inspect any ofx, get a quick and dirty summary, and explore via a pdb shell.

Thanks for the note about asyncio as well. I made the change and cleaned up the script. It is also now installed as a command ("bean-download") as a part of beancount-reds-importers (bleeding edge only, for now).
Reply all
Reply to author
Forward
0 new messages