Beautiful Soup 4.10.0

417 views

Skip to first unread message

leonardr

unread,

Sep 7, 2021, 8:33:39 PM9/7/21

to beautifulsoup

I just released version 4.10.0 of Beautiful Soup. Although there are a number of changes and improvements, which are all mentioned in the changelog below, the biggest change is that Beautiful Soup no longer supports Python 2. Beautiful Soup 4.9.3 is the last released version to support Python 2.

Since the official sunset of Beautiful Soup's Python 2 support at the end of 2020, I've continued doing development on a Python 2 baseline, basically waiting for some other part of the Python ecosystem to change and make that impractical. That change has now happened.

Version 58.0.0 of setuptools dropped support for the "2to3" script I use to convert the Python 2 code to Python 3. Rather than try to port Beautiful Soup to a unified code base, or disallow people from installing Beautiful Soup through new versions of setuptools, I converted the Python 2 code to Python 3 one last time and committed the conversion to the repository.

Here's the complete changelog for 4.10.0:

= 4.10.0 (20210907)

* This is the first release of Beautiful Soup to only support Python
3. I dropped Python 2 support to maintain support for newer versions
(58 and up) of setuptools. See:
https://github.com/pypa/setuptools/issues/2769 [bug=1942919]

* The behavior of methods like .get_text() and .strings now differs
depending on the type of tag. The change is visible with HTML tags
like <script>, <style>, and <template>. Starting in 4.9.0, methods
like get_text() returned no results on such tags, because the
contents of those tags are not considered 'text' within the document
as a whole.

But a user who calls script.get_text() is working from a different
definition of 'text' than a user who calls div.get_text()--otherwise
there would be no need to call script.get_text() at all. In 4.10.0,
the contents of (e.g.) a <script> tag are considered 'text' during a
get_text() call on the tag itself, but not considered 'text' during
a get_text() call on the tag's parent.

Because of this change, calling get_text() on each child of a tag
may now return a different result than calling get_text() on the tag
itself. That's because different tags now have different
understandings of what counts as 'text'. [bug=1906226] [bug=1868861]

* NavigableString and its subclasses now implement the get_text()
method, as well as the properties .strings and
.stripped_strings. These methods will either return the string
itself, or nothing, so the only reason to use this is when iterating
over a list of mixed Tag and NavigableString objects. [bug=1904309]

* The 'html5' formatter now treats attributes whose values are the
empty string as HTML boolean attributes. Previously (and in other
formatters), an attribute value must be set as None to be treated as
a boolean attribute. In a future release, I plan to also give this
behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424]

* The 'replace_with()' method now takes a variable number of arguments,
and can be used to replace a single element with a sequence of elements.
Patch by Bill Chandos. [rev=605]

* Corrected output when the namespace prefix associated with a
namespaced attribute is the empty string, as opposed to
None. [bug=1915583]

* Performance improvement when processing tags that speeds up overall
tree construction by 2%. Patch by Morotti. [bug=1899358]

* Corrected the use of special string container classes in cases when a
single tag may contain strings with different containers; such as
the <template> tag, which may contain both TemplateString objects
and Comment objects. [bug=1913406]

* The html.parser tree builder can now handle named entities
found in the HTML5 spec in much the same way that the html5lib
tree builder does. Note that the lxml HTML tree builder doesn't handle
named entities this way. [bug=1924908]

* Added a second way to pass specify encodings to UnicodeDammit and
EncodingDetector, based on the order of precedence defined in the
HTML5 spec, starting at:
https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding

Encodings in 'known_definite_encodings' are tried first, then
byte-order-mark sniffing is run, then encodings in 'user_encodings'
are tried. The old argument, 'override_encodings', is now a
deprecated alias for 'known_definite_encodings'.

This changes the default behavior of the html.parser and lxml tree
builders, in a way that may slightly improve encoding
detection but will probably have no effect. [bug=1889014]

* Improve the warning issued when a directory name (as opposed to
the name of a regular file) is passed as markup into the BeautifulSoup
constructor. [bug=1913628]