Beautiful Soup 4.2.0

Showing 1-4 of 4 messages
Beautiful Soup 4.2.0 Leonard Richardson 5/14/13 6:56 AM
I just released Beautiful Soup 4.2.0. The main new feature is a
rewrite of the CSS selector code, which greatly expands the set of
selectors you can use with select(). There are also new
troubleshooting features and lots of bug fixes. The complete changelog
is below.

This version changes the way some documents are presented, hopefully
unequivocably for the better. In particular, the contents of <script>
and <style> tags no longer undergo entity substitution.

I'm prepared to do a bugfix release later this week, but hopefully
this version will improve life for everyone.


* The method now supports a much wider variety of CSS

 - Added support for the adjacent sibling combinator (+) and the
   general sibling combinator (~). Tests by "liquider". [bug=1082144]

 - The combinators (>, +, and ~) can now combine with any supported
   selector, not just one that selects based on tag name.

 - Added limited support for the "nth-of-type" pseudo-class. Code
   by Sven Slootweg. [bug=1109952]

* The BeautifulSoup class is now aliased to "_s" and "_soup", making
  it quicker to type the import statement in an interactive session:

  from bs4 import _s
  from bs4 import _soup

  The alias may change in the future, so don't use this in code you're
  going to run more than once.

* Added the 'diagnose' submodule, which includes several useful
  functions for reporting problems and doing tech support.

  - diagnose(data) tries the given markup on every installed parser,
    reporting exceptions and displaying successes. If a parser is not
    installed, diagnose() mentions this fact.

  - lxml_trace(data, html=True) runs the given markup through lxml's
    XML parser or HTML parser, and prints out the parser events as
    they happen. This helps you quickly determine whether a given
    problem occurs in lxml code or Beautiful Soup code.

  - htmlparser_trace(data) is the same thing, but for Python's
    built-in HTMLParser class.

* In an HTML document, the contents of a <script> or <style> tag will
  no longer undergo entity substitution by default. XML documents work
  the same way they did before. [bug=1085953]

* Methods like get_text() and properties like .strings now only give
  you strings that are visible in the document--no comments or
  processing commands. [bug=1050164]

* The prettify() method now leaves the contents of <pre> tags
  alone. [bug=1095654]

* Fix a bug in the html5lib treebuilder which sometimes created
  disconnected trees. [bug=1039527]

* Fix a bug in the lxml treebuilder which crashed when a tag included
  an attribute from the predefined "xml:" namespace. [bug=1065617]

* Fix a bug by which keyword arguments to find_parent() were not
  being passed on. [bug=1126734]

* Stop a crash when unwisely messing with a tag that's been
  decomposed. [bug=1097699]

* Now that lxml's segfault on invalid doctype has been fixed, fixed a
  corresponding problem on the Beautiful Soup end that was previously
  invisible. [bug=984936]

* Fixed an exception when an overspecified CSS selector didn't match
  anything. Code by Stefaan Lippens. [bug=1168167]
Re: Beautiful Soup 4.2.0 Aaron DeVore 5/14/13 4:03 PM
The element "\ngit\n" has its next_element attribute set to None. It should be set to the <em> tag instead. I haven't figured out exactly where the root source of the problem is. The <em> tag is definitely there; it's just not linked up.

-Aaron DeVore

On Tue, May 14, 2013 at 12:37 PM, Andrew Roberts <> wrote:
I can't login to launchpad, but I found a bug, it's illustrated in this simple gist:



You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To post to this group, send email to
Visit this group at
For more options, visit

Re: Beautiful Soup 4.2.0 Leonard Richardson 5/15/13 6:09 AM

> The element "\ngit\n" has its next_element attribute set to None. It should
> be set to the <em> tag instead. I haven't figured out exactly where the root
> source of the problem is. The <em> tag is definitely there; it's just not
> linked up.

This sort of thing happens a lot with the html5lib treebuilder, which
a) moves the tree around a lot while constructing it, and b) has
historically done this using its own code, taken from html5lib,
instead of standard Beautiful Soup code.

Other bugs of this sort:

Do you want to take this? Otherwise I'll look at it. I think the key
is to follow "\ngit\n" and the following <em> tag from their creation.
At some point, the <em> tag is supposed to set
self.previous_element.next_element = self, and that doesn't happen.

I wouldn't be surprised if the same thing happened with the "\nand\n"
that separates the two <strong> tags. That would be the place where
the bug causes the specific problem that Andrew reported..

Re: Beautiful Soup 4.2.0 leonardr 5/20/13 8:35 AM
This will be fixed in the next release: