Hi, everyone,
For the past few months I've been working on adding type hints to the Beautiful Soup code base. This process exposed a number of very small inconsistencies which couldn't be fixed without changing behavior. Even in an install base the size of Beautiful Soup's, these inconsistencies were so small that I don't think many have ever noticed them, but there are enough of them that I've decided to run the next release through a beta process instead of just putting it out.
I'd also like to get feedback from people who use type hints when developing with Beautiful Soup, since I only use them for type checking the code base. There might be some changes to the type hint system it would be good to make before doing a full 4.13.0 release.
The current beta release is 4.13.0b2 (b1 is only on test PyPI, and doesn't work due to a dependency error; don't install it). To use it, you have to explicitly ask pip to install that version e.g. from the command line:
pip install beautifulsoup==4.13.0b2
or by putting something like this in your requirements.txt
beautifulsoup==4.13.0b2
The 4.13 branch also has a new feature (the ElementFilter class) which makes it easy to swap in your own matching logic for Beautiful Soup's default logic when calling methods like find(). The API for that class isn't finalized, so if this sounds interesting to you, the beta period is a good time to give feedback. The history of this feature can be found in bug #2047713 and the implementation is at the top of bs4/filter.py.
I'm hoping to do a full release of 4.13.0 before PyCon US in mid-May, where we will be celebrating the 20th anniversary of Beautiful Soup (please come if you can!) But I'm not going to rush it.
The changelog for the 4.13.0b2 release is below. Note that adopting 4.13.0b2 may cause your code to start giving lots of DeprecationWarnings. Most of these will be from methods that were described as deprecated in the documentation but didn't issue DeprecationWarning when you used them.
Leonard
---
= 4.13.0b2 (20240320)
This release introduces Python type hints to all public classes and
methods in Beautiful Soup. The addition of these type hints exposed a
large number of very small inconsistencies in the code, which I've
fixed, but the result is a larger-than-usual number of deprecations
and changes that may break backwards compatibility.
# Deprecation notices
These things now give DeprecationWarnings when you try to use them,
and are scheduled to be removed in Beautiful Soup 4.15.0.
* Every deprecated method, attribute and class from the 3.0 and 2.0
major versions of Beautiful Soup. These have been deprecated for a
very long time, but they didn't issue DeprecationWarning when you
tried to use them. Now they do, and they're all going away soon.
This mainly refers to methods and attributes with camelCase names,
for example: renderContents, replaceWith, replaceWithChildren,
findAll, findAllNext, findAllPrevious, findNext, findNextSibling,
findNextSiblings, findParent, findParents, findPrevious,
findPreviousSibling, findPreviousSiblings, getText, nextSibling,
previousSibling, isSelfClosing, fetchNextSiblings,
fetchPreviousSiblings, fetchPrevious, fetchPreviousSiblings,
fetchParents, findChild, findChildren, childGenerator,
nextGenerator, nextSiblingGenerator, previousGenerator,
previousSiblingGenerator, recursiveChildGenerator, and
parentGenerator.
This also includes the BeautifulStoneSoup class.
* The SAXTreeBuilder class, which was never officially supported or tested.
* The private class method BeautifulSoup._decode_markup(), which has not
been used inside Beautiful Soup for many years.
* The first argument to BeautifulSoup.decode has been changed from
pretty_print:bool to indent_level:int, to match the signature of
Tag.decode. Using a bool will still work but will give you a
DeprecationWarning.
* SoupStrainer.text and SoupStrainer.string are both deprecated, since
a single item can't capture all the possibilities of a SoupStrainer
designed to match strings.
* SoupStrainer.search_tag(). It was never a documented method, but if
you use it, you should start using SoupStrainer.allow_tag_creation()
instead.
* The soup:BeautifulSoup argument to the TreeBuilderForHtml5lib
constructor is now required, not optional. It's unclear why it was
optional in the first place, so if you discover you need this,
contact us for possible un-deprecation.
# Compatibility notices
* This version drops support for Python 3.6. The minimum supported
major Python version for Beautiful Soup is now Python 3.7.
* Deprecation warnings have been added for all deprecated methods and
attributes (see above). Going forward, deprecated names will be
removed two feature releases or one major release after the
deprecation warning is added.
* If Tag.get_attribute_list() is used to access an attribute that's not set,
the return value is now an empty list rather than [None].
* If you pass an empty list as the attribute value when searching the
tree, you will now find all tags which have that attribute set to a value in
the empty list--that is, you will find nothing. This is consistent with other
situations where a list of acceptable values is provided. Previously, an
empty list was treated the same as None and False, and you would have
found the tags which did not have that attribute set at all. [bug=2045469]
* For similar reasons, if you pass in limit=0 to a find() method, you
will now get zero results. Previously, you would get all matching results.
* When using one of the find() methods or creating a SoupStrainer,
if you specify the same attribute value in ``attrs`` and the
keyword arguments, you'll end up with two different ways to match that
attribute. Previously the value in keyword arguments would override the
value in ``attrs``.
* All exceptions were moved to the bs4.exceptions module, and all
warnings to the bs4._warnings module (named so as not to shadow
Python's built-in warnings module). All warnings and exceptions are
exported from the bs4 module, which is probably the safest place to
import them from in your own code.
* As a side effect of this, the string constant
BeautifulSoup.NO_PARSER_SPECIFIED_WARNING was moved to
GuessedAtParserWarning.MESSAGE.
* append(), extend(), insert(), and unwrap() were moved from PageElement to
Tag. Those methods manipulate the 'contents' collection, so they would
only have ever worked on Tag objects.
* The BeautifulSoupHTMLParser constructor now requires a BeautifulSoup
object as its first argument. This almost certainly does not affect
you, since you probably use HTMLParserTreeBuilder, not
BeautifulSoupHTMLParser directly.
* The TreeBuilderForHtml5lib methods fragmentClass(), getFragment(),
and testSerializer() now raise NotImplementedError. These methods
are called only by html5lib's test suite, and Beautiful Soup isn't
integrated into that test suite, so this code was long since unused and
untested.
These methods are _not_ deprecated, since they are methods defined by
html5lib. They may one day have real implementations, as part of a future
effort to integrate Beautiful Soup into html5lib's test suite.
* AttributeValueWithCharsetSubstitution.encode() is renamed to
substitute_encoding, to avoid confusion with the much different str.encode()
* Using PageElement.replace_with() to replace an element with itself
returns the element instead of None.
* All TreeBuilder constructors now take the empty_element_tags
argument. The sets of tags found in HTMLTreeBuilder.empty_element_tags and
HTMLTreeBuilder.block_elements are now in
HTMLTreeBuilder.DEFAULT_EMPTY_ELEMENT_TAGS and
HTMLTreeBuilder.DEFAULT_BLOCK_ELEMENTS, to avoid confusing them with
instance variables.
* The unused constant LXMLTreeBuilderForXML.DEFAULT_PARSER_CLASS
has been removed.
* Some of the arguments in the methods of LXMLTreeBuilderForXML
have been renamed for consistency with the names lxml uses for those
arguments in the superclass. This won't affect you unless you were
calling methods like LXMLTreeBuilderForXML.start() directly.
* In particular, the arguments to LXMLTreeBuilderForXML.prepare_markup
have been changed to match the arguments to the superclass,
TreeBuilder.prepare_markup. Specifically, document_declared_encoding
now appears before exclude_encodings, not after. If you were calling
this method yourself, I recommend switching to using keyword
arguments instead.
# New features
* The new ElementFilter class encapsulates Beautiful Soup's rules
about matching elements and deciding which parts of a document to
parse. It's easy to override those rules with subclassing or
function composition. The SoupStrainer class, which contains all the
matching logic you're familiar with from the find_* methods, is now
a subclass of ElementFilter.
* The new PageElement.filter() method provides a fully general way of
finding elements in a Beautiful Soup parse tree. You can specify a
function to iterate over the tree and an ElementFilter to determine
what matches.
* The new_tag() method now takes a 'string' argument. This allows you to
set the string contents of a Tag when creating it. Patch by Chris
Papademetrious. [bug=2044599]
* The NavigableString class now has a .string property which returns the
string itself. This makes it easier to iterate over a mixed list
of Tag and NavigableString objects. [bug=2044794]
* Defined a new warning class, UnusualUsageWarning, which is a superclass
for all of the warnings issued when Beautiful Soup notices something
unusual but not guaranteed to be wrong, like markup that looks like
a URL (MarkupResemblesLocatorWarning) or XML being run through an HTML
parser (XMLParsedAsHTMLWarning).
The text of these warnings has been revamped to explain in more
detail what is going on, how to check if you've made a mistake,
and how to make the warning go away if you are acting deliberately.
If these warnings are interfering with your workflow, or simply
annoying you, you can filter all of them by filtering
UnusualUsageWarning, without worrying about losing the warnings
Beautiful Soup issues when there *definitely* is a problem you
need to correct.
# Improvements
* decompose() was moved from Tag to its superclass PageElement, since
there's no reason it won't also work on NavigableString objects.
* Emit an UnusualUsageWarning if the user tries to search for an attribute
called _class; they probably mean "class_". [bug=2025089]
* The MarkupResemblesLocatorWarning issued when the markup resembles a
filename is now issued less often, due to improvements in detecting
markup that's unlikely to be a filename. [bug=2052988]
* Emit a warning if a document is parsed using a SoupStrainer that's
set up to filter everything. In these cases, filtering everything is
the most consistent thing to do, but there was no indication that
this was happening, so the behavior may have seemed mysterious.
* When using one of the find() methods or creating a SoupStrainer, you can
pass a list of any accepted object (strings, regular expressions, etc.) for
any of the objects. Previously you could only pass in a list of strings.
* A SoupStrainer can now filter tag creation based on a tag's
namespaced name. Previously only the unqualified name could be used.
* Added the correct stacklevel to another instance of the
XMLParsedAsHTMLWarning. [bug=2034451]
# Bug fixes
* Fixed an error in the lookup table used when converting
ISO-Latin-1 to ASCII, which no one should do anyway.
* Corrected the markup that's output in the unlikely event that you
encode a document to a Python internal encoding (like "palmos")
that's not recognized by the HTML or XML standard.
* UnicodeDammit.markup is now always a bytestring representing the
*original* markup (sans BOM), and UnicodeDammit.unicode_markup is
always the converted Unicode equivalent of the original
markup. Previously, UnicodeDammit.markup was treated inconsistently
and would often end up containing Unicode. UnicodeDammit.markup was
not a documented attribute, but if you were using it, you probably
want to switch to using .unicode_markup instead.