Beautiful Soup 4.13.0 beta 2

259 views
Skip to first unread message

leonardr

unread,
Mar 20, 2024, 11:04:57 AMMar 20
to beautifulsoup
Hi, everyone,

For the past few months I've been working on adding type hints to the Beautiful Soup code base. This process exposed a number of very small inconsistencies which couldn't be fixed without changing behavior. Even in an install base the size of Beautiful Soup's, these inconsistencies were so small that I don't think many have ever noticed them, but there are enough of them that I've decided to run the next release through a beta process instead of just putting it out.

I'd also like to get feedback from people who use type hints when developing with Beautiful Soup, since I only use them for type checking the code base. There might be some changes to the type hint system it would be good to make before doing a full 4.13.0 release.

The current beta release is 4.13.0b2 (b1 is only on test PyPI, and doesn't work due to a dependency error; don't install it). To use it, you have to explicitly ask pip to install that version e.g. from the command line:

pip install beautifulsoup==4.13.0b2

or by putting something like this in your requirements.txt

beautifulsoup==4.13.0b2

The 4.13 branch also has a new feature (the ElementFilter class) which makes it easy to swap in your own matching logic for Beautiful Soup's default logic when calling methods like find(). The API for that class isn't finalized, so if this sounds interesting to you, the beta period is a good time to give feedback. The history of this feature can be found in bug #2047713 and the implementation is at the top of bs4/filter.py.

I'm hoping to do a full release of 4.13.0 before PyCon US in mid-May, where we will be celebrating the 20th anniversary of Beautiful Soup (please come if you can!) But I'm not going to rush it.

The changelog for the 4.13.0b2 release is below. Note that adopting 4.13.0b2 may cause your code to start giving lots of DeprecationWarnings. Most of these will be from methods that were described as deprecated in the documentation but didn't issue DeprecationWarning when you used them.

Leonard

---

= 4.13.0b2 (20240320)

This release introduces Python type hints to all public classes and
methods in Beautiful Soup. The addition of these type hints exposed a
large number of very small inconsistencies in the code, which I've
fixed, but the result is a larger-than-usual number of deprecations
and changes that may break backwards compatibility.

# Deprecation notices

These things now give DeprecationWarnings when you try to use them,
and are scheduled to be removed in Beautiful Soup 4.15.0.

* Every deprecated method, attribute and class from the 3.0 and 2.0
  major versions of Beautiful Soup. These have been deprecated for a
  very long time, but they didn't issue DeprecationWarning when you
  tried to use them. Now they do, and they're all going away soon.

  This mainly refers to methods and attributes with camelCase names,
  for example: renderContents, replaceWith, replaceWithChildren,
  findAll, findAllNext, findAllPrevious, findNext, findNextSibling,
  findNextSiblings, findParent, findParents, findPrevious,
  findPreviousSibling, findPreviousSiblings, getText, nextSibling,
  previousSibling, isSelfClosing, fetchNextSiblings,
  fetchPreviousSiblings, fetchPrevious, fetchPreviousSiblings,
  fetchParents, findChild, findChildren, childGenerator,
  nextGenerator, nextSiblingGenerator, previousGenerator,
  previousSiblingGenerator, recursiveChildGenerator, and
  parentGenerator.

  This also includes the BeautifulStoneSoup class.

* The SAXTreeBuilder class, which was never officially supported or tested.

* The private class method BeautifulSoup._decode_markup(), which has not
  been used inside Beautiful Soup for many years.

* The first argument to BeautifulSoup.decode has been changed from
  pretty_print:bool to indent_level:int, to match the signature of
  Tag.decode. Using a bool will still work but will give you a
  DeprecationWarning.

* SoupStrainer.text and SoupStrainer.string are both deprecated, since
  a single item can't capture all the possibilities of a SoupStrainer
  designed to match strings.

* SoupStrainer.search_tag(). It was never a documented method, but if
  you use it, you should start using SoupStrainer.allow_tag_creation()
  instead.

* The soup:BeautifulSoup argument to the TreeBuilderForHtml5lib
  constructor is now required, not optional. It's unclear why it was
  optional in the first place, so if you discover you need this,
  contact us for possible un-deprecation.

# Compatibility notices

* This version drops support for Python 3.6. The minimum supported
  major Python version for Beautiful Soup is now Python 3.7.

* Deprecation warnings have been added for all deprecated methods and
  attributes (see above). Going forward, deprecated names will be
  removed two feature releases or one major release after the
  deprecation warning is added.

* If Tag.get_attribute_list() is used to access an attribute that's not set,
  the return value is now an empty list rather than [None].

* If you pass an empty list as the attribute value when searching the
  tree, you will now find all tags which have that attribute set to a value in
  the empty list--that is, you will find nothing. This is consistent with other
  situations where a list of acceptable values is provided. Previously, an
  empty list was treated the same as None and False, and you would have
  found the tags which did not have that attribute set at all. [bug=2045469]

* For similar reasons, if you pass in limit=0 to a find() method, you
  will now get zero results. Previously, you would get all matching results.

* When using one of the find() methods or creating a SoupStrainer,
  if you specify the same attribute value in ``attrs`` and the
  keyword arguments, you'll end up with two different ways to match that
  attribute. Previously the value in keyword arguments would override the
  value in ``attrs``.

* All exceptions were moved to the bs4.exceptions module, and all
  warnings to the bs4._warnings module (named so as not to shadow
  Python's built-in warnings module). All warnings and exceptions are
  exported from the bs4 module, which is probably the safest place to
  import them from in your own code.

* As a side effect of this, the string constant
  BeautifulSoup.NO_PARSER_SPECIFIED_WARNING was moved to
  GuessedAtParserWarning.MESSAGE.

* append(), extend(), insert(), and unwrap() were moved from PageElement to
  Tag. Those methods manipulate the 'contents' collection, so they would
  only have ever worked on Tag objects.

* The BeautifulSoupHTMLParser constructor now requires a BeautifulSoup
  object as its first argument. This almost certainly does not affect
  you, since you probably use HTMLParserTreeBuilder, not
  BeautifulSoupHTMLParser directly.

* The TreeBuilderForHtml5lib methods fragmentClass(), getFragment(),
  and testSerializer() now raise NotImplementedError. These methods
  are called only by html5lib's test suite, and Beautiful Soup isn't
  integrated into that test suite, so this code was long since unused and
  untested.

  These methods are _not_ deprecated, since they are methods defined by
  html5lib. They may one day have real implementations, as part of a future
  effort to integrate Beautiful Soup into html5lib's test suite.

* AttributeValueWithCharsetSubstitution.encode() is renamed to
  substitute_encoding, to avoid confusion with the much different str.encode()

* Using PageElement.replace_with() to replace an element with itself
  returns the element instead of None.

* All TreeBuilder constructors now take the empty_element_tags
  argument. The sets of tags found in HTMLTreeBuilder.empty_element_tags and
  HTMLTreeBuilder.block_elements are now in
  HTMLTreeBuilder.DEFAULT_EMPTY_ELEMENT_TAGS and
  HTMLTreeBuilder.DEFAULT_BLOCK_ELEMENTS, to avoid confusing them with
  instance variables.

* The unused constant LXMLTreeBuilderForXML.DEFAULT_PARSER_CLASS
  has been removed.

* Some of the arguments in the methods of LXMLTreeBuilderForXML
  have been renamed for consistency with the names lxml uses for those
  arguments in the superclass. This won't affect you unless you were
  calling methods like LXMLTreeBuilderForXML.start() directly.

* In particular, the arguments to LXMLTreeBuilderForXML.prepare_markup
  have been changed to match the arguments to the superclass,
  TreeBuilder.prepare_markup. Specifically, document_declared_encoding
  now appears before exclude_encodings, not after. If you were calling
  this method yourself, I recommend switching to using keyword
  arguments instead.

# New features

* The new ElementFilter class encapsulates Beautiful Soup's rules
  about matching elements and deciding which parts of a document to
  parse. It's easy to override those rules with subclassing or
  function composition. The SoupStrainer class, which contains all the
  matching logic you're familiar with from the find_* methods, is now
  a subclass of ElementFilter.

* The new PageElement.filter() method provides a fully general way of
  finding elements in a Beautiful Soup parse tree. You can specify a
  function to iterate over the tree and an ElementFilter to determine
  what matches.

* The new_tag() method now takes a 'string' argument. This allows you to
  set the string contents of a Tag when creating it. Patch by Chris
  Papademetrious. [bug=2044599]

* The NavigableString class now has a .string property which returns the
  string itself. This makes it easier to iterate over a mixed list
  of Tag and NavigableString objects. [bug=2044794]

* Defined a new warning class, UnusualUsageWarning, which is a superclass
  for all of the warnings issued when Beautiful Soup notices something
  unusual but not guaranteed to be wrong, like markup that looks like
  a URL (MarkupResemblesLocatorWarning) or XML being run through an HTML
  parser (XMLParsedAsHTMLWarning).

  The text of these warnings has been revamped to explain in more
  detail what is going on, how to check if you've made a mistake,
  and how to make the warning go away if you are acting deliberately.

  If these warnings are interfering with your workflow, or simply
  annoying you, you can filter all of them by filtering
  UnusualUsageWarning, without worrying about losing the warnings
  Beautiful Soup issues when there *definitely* is a problem you
  need to correct.

# Improvements

* decompose() was moved from Tag to its superclass PageElement, since
  there's no reason it won't also work on NavigableString objects.

* Emit an UnusualUsageWarning if the user tries to search for an attribute
  called _class; they probably mean "class_". [bug=2025089]

* The MarkupResemblesLocatorWarning issued when the markup resembles a
  filename is now issued less often, due to improvements in detecting
  markup that's unlikely to be a filename. [bug=2052988]

* Emit a warning if a document is parsed using a SoupStrainer that's
  set up to filter everything. In these cases, filtering everything is
  the most consistent thing to do, but there was no indication that
  this was happening, so the behavior may have seemed mysterious.

* When using one of the find() methods or creating a SoupStrainer, you can
  pass a list of any accepted object (strings, regular expressions, etc.) for
  any of the objects. Previously you could only pass in a list of strings.

* A SoupStrainer can now filter tag creation based on a tag's
  namespaced name. Previously only the unqualified name could be used.

* Added the correct stacklevel to another instance of the
  XMLParsedAsHTMLWarning. [bug=2034451]

# Bug fixes

* Fixed an error in the lookup table used when converting
  ISO-Latin-1 to ASCII, which no one should do anyway.

* Corrected the markup that's output in the unlikely event that you
  encode a document to a Python internal encoding (like "palmos")
  that's not recognized by the HTML or XML standard.

* UnicodeDammit.markup is now always a bytestring representing the
  *original* markup (sans BOM), and UnicodeDammit.unicode_markup is
  always the converted Unicode equivalent of the original
  markup. Previously, UnicodeDammit.markup was treated inconsistently
  and would often end up containing Unicode. UnicodeDammit.markup was
  not a documented attribute, but if you were using it, you probably
  want to switch to using .unicode_markup instead.

Chris Papademetrious

unread,
Mar 24, 2024, 5:57:07 PMMar 24
to beautifulsoup
Hi Leonard,

Thanks for making the 4.13.0b2 package available for easier testing!

I ran our content processing pipeline on about ~50k HTML files and the results from 4.12.2 and 4.13.0b2 are identical. I haven't adapted my code to make use of the new ElementFilter feature yet, but I'll do that next.

The type hints add useful information in VSCode. What method of runtime type-checking are you using? I'd like to run my code with the same settings to see if I catch any invalid behaviors.

One thing I noticed in VSCode is that for methods that can return a PageElement (a Tag *or* a NavigableString), methods specific to Tag or NavigableString elements are not shown for those values. For example, consider the following code:

import bs4
soup = bs4.BeautifulSoup('<div></div>', 'lxml')
for aaa in soup.find_all(True, recursive=False):
    for bbb in aaa.find_all(True, recursive=False):
        for ccc in bbb.find_all(True, recursive=False):
            for ddd in ccc.find_all(True, recursive=False):
                ddd.decompose()
            ccc.decompose()
        bbb.decompose()
    aaa.decompose()


VSCode knows that soup is a BeautifulSoup object, and so it shows the find_all() usage in the tool tip of soup.find_all().

VSCode knows that aaa is a PageElement object, and so it shows the decompose() usage in the tool tip of aaa.decompose() (because all PageElement objects have a decompose() method).

But, VSCode does not show the find_all() usage in the tool tip of aaa.find_all(), bbb.find_all(), or ccc.find_all() because the find_all() method exists only for Tag objects, not for PageElement objects.

Likewise, VSCode does not show the decompose() usage in the tool tip of bbb.decompose(), ccc.decompose(), or ddd.decompose() because bbb, ccc, and ddd have type values of Any because the type of their iteration set is unknown.

If I explicitly resolve the ambiguity wherever a PageElement is returned:

import bs4
soup = bs4.BeautifulSoup('<div></div>', 'lxml')
for aaa in soup.find_all(True, recursive=False):
    aaa = bs4.Tag(aaa)
    for bbb in aaa.find_all(True, recursive=False):
        bbb = bs4.Tag(bbb)
        for ccc in bbb.find_all(True, recursive=False):
            ccc = bs4.Tag(ccc)
            for ddd in ccc.find_all(True, recursive=False):
                ddd = bs4.Tag(ddd)
                ddd.decompose()
            ccc.decompose()
        bbb.decompose()
    aaa.decompose()

then all the method usages become known.

If I go into BS4's find_all() method definition and temporarily hack it like this:

def find_all(
        self,
        name:_FindMethodName=None,
        attrs:_StrainableAttributes={},
        recursive:bool=True,
        string:Optional[_StrainableString]=None,
        limit:Optional[int]=None,
        _stacklevel:int=2,
        **kwargs:_StrainableAttribute) -> ResultSet[Tag | NavigableString]:

then the methods become known in the original code (*without* the explicit casting). Perhaps when multiple possible return types are defined like this, you get the union of the possible methods offered for the resulting values?

 - Chris


Chris Papademetrious

unread,
May 21, 2024, 9:08:23 AMMay 21
to beautifulsoup
Hi Leonard,

I know the previous email was a bit long... If you haven't used VSCode before, I can create a small testcase and share the step-by-step process to install VSCode and reproduce the observation (and possible enhancement?).

 - Chris



leonardr

unread,
May 22, 2024, 2:31:11 PMMay 22
to beautifulsoup
Chris,

It's all right, I get what's going on, it's just that changing this throughout element.py causes a whole lot of type checking failures that are tough to deal with.

I've pushed a branch 4.13-more-specific-than-pageelement to the Git repository, where I changed all of the return signatures of the find* methods, without changing the other stuff that causes the type checking failures. Try it out and see if VSCode behaves more like you're expecting. In the meantime I'm looking to see if there's a better way to say "Tag and NavigableString are the only supported direct subclasses of PageElement."

Leonard

Chris Papademetrious

unread,
May 23, 2024, 7:44:42 AMMay 23
to beautifulsoup
Hi Leonard,

The 4.13-more-specific-than-pageelement  branch resolves all the relevant unknown-method issues for find* methods - awesome! It was nice to see so many variables change from white (unknown methods) to yellow (known methods) in the VSCode editor.

I noticed a similar situation with generators:

soup = bs4.BeautifulSoup('<div></div>', 'lxml')
for aaa in soup.children:
    for bbb in aaa.children:
        for ccc in bbb.children:
            for ddd in ccc.children:
                ddd.decompose()
            ccc.decompose()
        bbb.decompose()
    aaa.decompose()

but mimicking your type change fixed that too - the variables showed both Tag and NavigableString (and str!) methods.

Of course, it sounds like a proper fix is more involved. Hopefully you find something clever and maintainable for it.

One more thing I noticed is that when I define my own Tag subclass with BeautifulSoup(element_classes=...) (which we now do), those extended methods aren't known, but honestly that is not worth considering right now; I only note it for completeness.

 - Chris
Reply all
Reply to author
Forward
0 new messages