how to explicitly specify using lxml for mixed xml/html

845 views

Skip to first unread message

William Tanksley, Jr

unread,

Jul 25, 2022, 12:52:38 PM7/25/22

to beautifulsoup

Hey, folks! I used BeautifulSoup in an older version using features="lxml", and all was well. Using "html.parser" has never worked, and that's not a problem.

Now I need to upgrade my Python version, and the current BeautifulSoup dumps a warning to stderr about using an html parser for XML but seems to work otherwise. How can I make it work like it used to -- using lxml, but for html, without complaining?

I tried following the warning's advice (it wants me to switch to features="xml"), but that breaks -- the document is partially HTML, so it doesn't get any of the XML contents.

These are ODM files, by the way. There no doubt is a better way, but the way I'm using has worked for me for a long time.

leonardr

unread,

Jul 26, 2022, 11:11:45 AM7/26/22

to beautifulsoup

Hello,

It sounds like what you're doing is working, and the warning just doesn't apply to you. Most people who use an HTML parser to parse XML will have better luck with an XML parser, partly because the HTML parsers convert all tag names to lowercase (see Bug 1939121 for the background on this change). But if it's working for you there's no need to change anything.

You can filter out the warning with code like this:

from bs4.builder import XMLParsedAsHTMLWarning
import warnings
warnings.filterwarnings('ignore', category=XMLParsedAsHTMLWarning)