Beautiful Soup 4.13.0 beta 2

152 views

Skip to first unread message

leonardr

unread,

Mar 20, 2024, 11:04:57 AMMar 20

to beautifulsoup

Hi, everyone,

For the past few months I've been working on adding type hints to the Beautiful Soup code base. This process exposed a number of very small inconsistencies which couldn't be fixed without changing behavior. Even in an install base the size of Beautiful Soup's, these inconsistencies were so small that I don't think many have ever noticed them, but there are enough of them that I've decided to run the next release through a beta process instead of just putting it out.

I'd also like to get feedback from people who use type hints when developing with Beautiful Soup, since I only use them for type checking the code base. There might be some changes to the type hint system it would be good to make before doing a full 4.13.0 release.

The current beta release is 4.13.0b2 (b1 is only on test PyPI, and doesn't work due to a dependency error; don't install it). To use it, you have to explicitly ask pip to install that version e.g. from the command line:

pip install beautifulsoup==4.13.0b2

or by putting something like this in your requirements.txt

beautifulsoup==4.13.0b2

The 4.13 branch also has a new feature (the ElementFilter class) which makes it easy to swap in your own matching logic for Beautiful Soup's default logic when calling methods like find(). The API for that class isn't finalized, so if this sounds interesting to you, the beta period is a good time to give feedback. The history of this feature can be found in bug #2047713 and the implementation is at the top of bs4/filter.py.

I'm hoping to do a full release of 4.13.0 before PyCon US in mid-May, where we will be celebrating the 20th anniversary of Beautiful Soup (please come if you can!) But I'm not going to rush it.

The changelog for the 4.13.0b2 release is below. Note that adopting 4.13.0b2 may cause your code to start giving lots of DeprecationWarnings. Most of these will be from methods that were described as deprecated in the documentation but didn't issue DeprecationWarning when you used them.

Leonard

---

= 4.13.0b2 (20240320)

This release introduces Python type hints to all public classes and
methods in Beautiful Soup. The addition of these type hints exposed a
large number of very small inconsistencies in the code, which I've
fixed, but the result is a larger-than-usual number of deprecations
and changes that may break backwards compatibility.

# Deprecation notices

These things now give DeprecationWarnings when you try to use them,
and are scheduled to be removed in Beautiful Soup 4.15.0.

* Every deprecated method, attribute and class from the 3.0 and 2.0
major versions of Beautiful Soup. These have been deprecated for a
very long time, but they didn't issue DeprecationWarning when you
tried to use them. Now they do, and they're all going away soon.

This mainly refers to methods and attributes with camelCase names,
for example: renderContents, replaceWith, replaceWithChildren,
findAll, findAllNext, findAllPrevious, findNext, findNextSibling,
findNextSiblings, findParent, findParents, findPrevious,
findPreviousSibling, findPreviousSiblings, getText, nextSibling,
previousSibling, isSelfClosing, fetchNextSiblings,
fetchPreviousSiblings, fetchPrevious, fetchPreviousSiblings,
fetchParents, findChild, findChildren, childGenerator,
nextGenerator, nextSiblingGenerator, previousGenerator,
previousSiblingGenerator, recursiveChildGenerator, and
parentGenerator.

This also includes the BeautifulStoneSoup class.

* The SAXTreeBuilder class, which was never officially supported or tested.

* The private class method BeautifulSoup._decode_markup(), which has not
been used inside Beautiful Soup for many years.

* The first argument to BeautifulSoup.decode has been changed from
pretty_print:bool to indent_level:int, to match the signature of
Tag.decode. Using a bool will still work but will give you a
DeprecationWarning.

* SoupStrainer.text and SoupStrainer.string are both deprecated, since
a single item can't capture all the possibilities of a SoupStrainer
designed to match strings.

* SoupStrainer.search_tag(). It was never a documented method, but if
you use it, you should start using SoupStrainer.allow_tag_creation()
instead.

* The soup:BeautifulSoup argument to the TreeBuilderForHtml5lib
constructor is now required, not optional. It's unclear why it was
optional in the first place, so if you discover you need this,
contact us for possible un-deprecation.

# Compatibility notices

* This version drops support for Python 3.6. The minimum supported
major Python version for Beautiful Soup is now Python 3.7.

* Deprecation warnings have been added for all deprecated methods and
attributes (see above). Going forward, deprecated names will be
removed two feature releases or one major release after the
deprecation warning is added.

* If Tag.get_attribute_list() is used to access an attribute that's not set,
the return value is now an empty list rather than [None].

* If you pass an empty list as the attribute value when searching the
tree, you will now find all tags which have that attribute set to a value in
the empty list--that is, you will find nothing. This is consistent with other
situations where a list of acceptable values is provided. Previously, an
empty list was treated the same as None and False, and you would have
found the tags which did not have that attribute set at all. [bug=2045469]

* For similar reasons, if you pass in limit=0 to a find() method, you
will now get zero results. Previously, you would get all matching results.

* When using one of the find() methods or creating a SoupStrainer,
if you specify the same attribute value in ``attrs`` and the
keyword arguments, you'll end up with two different ways to match that
attribute. Previously the value in keyword arguments would override the
value in ``attrs``.

* All exceptions were moved to the bs4.exceptions module, and all
warnings to the bs4._warnings module (named so as not to shadow
Python's built-in warnings module). All warnings and exceptions are
exported from the bs4 module, which is probably the safest place to
import them from in your own code.

* As a side effect of this, the string constant
BeautifulSoup.NO_PARSER_SPECIFIED_WARNING was moved to
GuessedAtParserWarning.MESSAGE.

* append(), extend(), insert(), and unwrap() were moved from PageElement to
Tag. Those methods manipulate the 'contents' collection, so they would
only have ever worked on Tag objects.

* The BeautifulSoupHTMLParser constructor now requires a BeautifulSoup
object as its first argument. This almost certainly does not affect
you, since you probably use HTMLParserTreeBuilder, not
BeautifulSoupHTMLParser directly.

* The TreeBuilderForHtml5lib methods fragmentClass(), getFragment(),
and testSerializer() now raise NotImplementedError. These methods
are called only by html5lib's test suite, and Beautiful Soup isn't
integrated into that test suite, so this code was long since unused and
untested.

These methods are _not_ deprecated, since they are methods defined by
html5lib. They may one day have real implementations, as part of a future
effort to integrate Beautiful Soup into html5lib's test suite.

* AttributeValueWithCharsetSubstitution.encode() is renamed to
substitute_encoding, to avoid confusion with the much different str.encode()

* Using PageElement.replace_with() to replace an element with itself
returns the element instead of None.

* All TreeBuilder constructors now take the empty_element_tags
argument. The sets of tags found in HTMLTreeBuilder.empty_element_tags and
HTMLTreeBuilder.block_elements are now in
HTMLTreeBuilder.DEFAULT_EMPTY_ELEMENT_TAGS and
HTMLTreeBuilder.DEFAULT_BLOCK_ELEMENTS, to avoid confusing them with
instance variables.

* The unused constant LXMLTreeBuilderForXML.DEFAULT_PARSER_CLASS
has been removed.

* Some of the arguments in the methods of LXMLTreeBuilderForXML
have been renamed for consistency with the names lxml uses for those
arguments in the superclass. This won't affect you unless you were
calling methods like LXMLTreeBuilderForXML.start() directly.

* In particular, the arguments to LXMLTreeBuilderForXML.prepare_markup
have been changed to match the arguments to the superclass,
TreeBuilder.prepare_markup. Specifically, document_declared_encoding
now appears before exclude_encodings, not after. If you were calling
this method yourself, I recommend switching to using keyword
arguments instead.

# New features

* The new ElementFilter class encapsulates Beautiful Soup's rules
about matching elements and deciding which parts of a document to
parse. It's easy to override those rules with subclassing or
function composition. The SoupStrainer class, which contains all the
matching logic you're familiar with from the find_* methods, is now
a subclass of ElementFilter.

* The new PageElement.filter() method provides a fully general way of
finding elements in a Beautiful Soup parse tree. You can specify a
function to iterate over the tree and an ElementFilter to determine
what matches.

* The new_tag() method now takes a 'string' argument. This allows you to
set the string contents of a Tag when creating it. Patch by Chris
Papademetrious. [bug=2044599]

* The NavigableString class now has a .string property which returns the
string itself. This makes it easier to iterate over a mixed list
of Tag and NavigableString objects. [bug=2044794]

* Defined a new warning class, UnusualUsageWarning, which is a superclass
for all of the warnings issued when Beautiful Soup notices something
unusual but not guaranteed to be wrong, like markup that looks like
a URL (MarkupResemblesLocatorWarning) or XML being run through an HTML
parser (XMLParsedAsHTMLWarning).

The text of these warnings has been revamped to explain in more
detail what is going on, how to check if you've made a mistake,
and how to make the warning go away if you are acting deliberately.

If these warnings are interfering with your workflow, or simply
annoying you, you can filter all of them by filtering
UnusualUsageWarning, without worrying about losing the warnings
Beautiful Soup issues when there *definitely* is a problem you
need to correct.

# Improvements

* decompose() was moved from Tag to its superclass PageElement, since
there's no reason it won't also work on NavigableString objects.

* Emit an UnusualUsageWarning if the user tries to search for an attribute
called _class; they probably mean "class_". [bug=2025089]

* The MarkupResemblesLocatorWarning issued when the markup resembles a
filename is now issued less often, due to improvements in detecting
markup that's unlikely to be a filename. [bug=2052988]

* Emit a warning if a document is parsed using a SoupStrainer that's
set up to filter everything. In these cases, filtering everything is
the most consistent thing to do, but there was no indication that
this was happening, so the behavior may have seemed mysterious.

* When using one of the find() methods or creating a SoupStrainer, you can
pass a list of any accepted object (strings, regular expressions, etc.) for
any of the objects. Previously you could only pass in a list of strings.

* A SoupStrainer can now filter tag creation based on a tag's
namespaced name. Previously only the unqualified name could be used.

* Added the correct stacklevel to another instance of the
XMLParsedAsHTMLWarning. [bug=2034451]

# Bug fixes

* Fixed an error in the lookup table used when converting
ISO-Latin-1 to ASCII, which no one should do anyway.

* Corrected the markup that's output in the unlikely event that you
encode a document to a Python internal encoding (like "palmos")
that's not recognized by the HTML or XML standard.

* UnicodeDammit.markup is now always a bytestring representing the
*original* markup (sans BOM), and UnicodeDammit.unicode_markup is
always the converted Unicode equivalent of the original
markup. Previously, UnicodeDammit.markup was treated inconsistently
and would often end up containing Unicode. UnicodeDammit.markup was
not a documented attribute, but if you were using it, you probably
want to switch to using .unicode_markup instead.

Chris Papademetrious

unread,

Mar 24, 2024, 5:57:07 PMMar 24

to beautifulsoup

Hi Leonard,

Thanks for making the 4.13.0b2 package available for easier testing!

I ran our content processing pipeline on about ~50k HTML files and the results from 4.12.2 and 4.13.0b2 are identical. I haven't adapted my code to make use of the new ElementFilter feature yet, but I'll do that next.

The type hints add useful information in VSCode. What method of runtime type-checking are you using? I'd like to run my code with the same settings to see if I catch any invalid behaviors.

One thing I noticed in VSCode is that for methods that can return a PageElement (a Tag *or* a NavigableString), methods specific to Tag or NavigableString elements are not shown for those values. For example, consider the following code:

import bs4
soup = bs4.BeautifulSoup('<div></div>', 'lxml')
for aaa in soup.find_all(True, recursive=False):
for bbb in aaa.find_all(True, recursive=False):
for ccc in bbb.find_all(True, recursive=False):
for ddd in ccc.find_all(True, recursive=False):
ddd.decompose()
ccc.decompose()
bbb.decompose()
aaa.decompose()

VSCode knows that soup is a BeautifulSoup object, and so it shows the find_all() usage in the tool tip of soup.find_all().

VSCode knows that aaa is a PageElement object, and so it shows the decompose() usage in the tool tip of aaa.decompose() (because all PageElement objects have a decompose() method).

But, VSCode does not show the find_all() usage in the tool tip of aaa.find_all(), bbb.find_all(), or ccc.find_all() because the find_all() method exists only for Tag objects, not for PageElement objects.

Likewise, VSCode does not show the decompose() usage in the tool tip of bbb.decompose(), ccc.decompose(), or ddd.decompose() because bbb, ccc, and ddd have type values of Any because the type of their iteration set is unknown.

If I explicitly resolve the ambiguity wherever a PageElement is returned:

import bs4
soup = bs4.BeautifulSoup('<div></div>', 'lxml')
for aaa in soup.find_all(True, recursive=False):
aaa = bs4.Tag(aaa)
for bbb in aaa.find_all(True, recursive=False):
bbb = bs4.Tag(bbb)
for ccc in bbb.find_all(True, recursive=False):
ccc = bs4.Tag(ccc)
for ddd in ccc.find_all(True, recursive=False):
ddd = bs4.Tag(ddd)
ddd.decompose()
ccc.decompose()
bbb.decompose()
aaa.decompose()

then all the method usages become known.

If I go into BS4's find_all() method definition and temporarily hack it like this:

def find_all(
self,
name:_FindMethodName=None,
attrs:_StrainableAttributes={},
recursive:bool=True,
string:Optional[_StrainableString]=None,
limit:Optional[int]=None,
_stacklevel:int=2,
**kwargs:_StrainableAttribute) -> ResultSet[Tag | NavigableString]:

then the methods become known in the original code (*without* the explicit casting). Perhaps when multiple possible return types are defined like this, you get the union of the possible methods offered for the resulting values?

- Chris

Chris Papademetrious

unread,

May 21, 2024, 9:08:23 AMMay 21

to beautifulsoup

Hi Leonard,

I know the previous email was a bit long... If you haven't used VSCode before, I can create a small testcase and share the step-by-step process to install VSCode and reproduce the observation (and possible enhancement?).

- Chris

leonardr

unread,

May 22, 2024, 2:31:11 PMMay 22

to beautifulsoup

Chris,

It's all right, I get what's going on, it's just that changing this throughout element.py causes a whole lot of type checking failures that are tough to deal with.

I've pushed a branch 4.13-more-specific-than-pageelement to the Git repository, where I changed all of the return signatures of the find* methods, without changing the other stuff that causes the type checking failures. Try it out and see if VSCode behaves more like you're expecting. In the meantime I'm looking to see if there's a better way to say "Tag and NavigableString are the only supported direct subclasses of PageElement."

Leonard

Chris Papademetrious

unread,

May 23, 2024, 7:44:42 AMMay 23

to beautifulsoup

Hi Leonard,

The 4.13-more-specific-than-pageelement branch resolves all the relevant unknown-method issues for find* methods - awesome! It was nice to see so many variables change from white (unknown methods) to yellow (known methods) in the VSCode editor.

I noticed a similar situation with generators:

soup = bs4.BeautifulSoup('<div></div>', 'lxml')

for aaa in soup.children:
for bbb in aaa.children:
for ccc in bbb.children:
for ddd in ccc.children:
ddd.decompose()
ccc.decompose()
bbb.decompose()
aaa.decompose()

but mimicking your type change fixed that too - the variables showed both Tag and NavigableString (and str!) methods.

Of course, it sounds like a proper fix is more involved. Hopefully you find something clever and maintainable for it.

One more thing I noticed is that when I define my own Tag subclass with BeautifulSoup(element_classes=...) (which we now do), those extended methods aren't known, but honestly that is not worth considering right now; I only note it for completeness.