|Beautiful Soup 4.2.0||Leonard Richardson||5/14/13 6:56 AM|
I just released Beautiful Soup 4.2.0. The main new feature is a
rewrite of the CSS selector code, which greatly expands the set of
selectors you can use with select(). There are also new
troubleshooting features and lots of bug fixes. The complete changelog
This version changes the way some documents are presented, hopefully
unequivocably for the better. In particular, the contents of <script>
and <style> tags no longer undergo entity substitution.
I'm prepared to do a bugfix release later this week, but hopefully
this version will improve life for everyone.
* The Tag.select() method now supports a much wider variety of CSS
- Added support for the adjacent sibling combinator (+) and the
general sibling combinator (~). Tests by "liquider". [bug=1082144]
- The combinators (>, +, and ~) can now combine with any supported
selector, not just one that selects based on tag name.
- Added limited support for the "nth-of-type" pseudo-class. Code
by Sven Slootweg. [bug=1109952]
* The BeautifulSoup class is now aliased to "_s" and "_soup", making
it quicker to type the import statement in an interactive session:
from bs4 import _s
from bs4 import _soup
The alias may change in the future, so don't use this in code you're
going to run more than once.
* Added the 'diagnose' submodule, which includes several useful
functions for reporting problems and doing tech support.
- diagnose(data) tries the given markup on every installed parser,
reporting exceptions and displaying successes. If a parser is not
installed, diagnose() mentions this fact.
- lxml_trace(data, html=True) runs the given markup through lxml's
XML parser or HTML parser, and prints out the parser events as
they happen. This helps you quickly determine whether a given
problem occurs in lxml code or Beautiful Soup code.
- htmlparser_trace(data) is the same thing, but for Python's
built-in HTMLParser class.
* In an HTML document, the contents of a <script> or <style> tag will
no longer undergo entity substitution by default. XML documents work
the same way they did before. [bug=1085953]
* Methods like get_text() and properties like .strings now only give
you strings that are visible in the document--no comments or
processing commands. [bug=1050164]
* The prettify() method now leaves the contents of <pre> tags
* Fix a bug in the html5lib treebuilder which sometimes created
disconnected trees. [bug=1039527]
* Fix a bug in the lxml treebuilder which crashed when a tag included
an attribute from the predefined "xml:" namespace. [bug=1065617]
* Fix a bug by which keyword arguments to find_parent() were not
being passed on. [bug=1126734]
* Stop a crash when unwisely messing with a tag that's been
* Now that lxml's segfault on invalid doctype has been fixed, fixed a
corresponding problem on the Beautiful Soup end that was previously
* Fixed an exception when an overspecified CSS selector didn't match
anything. Code by Stefaan Lippens. [bug=1168167]
|Re: Beautiful Soup 4.2.0||Andrew Roberts||5/14/13 12:37 PM|
I can't login to launchpad, but I found a bug, it's illustrated in this simple gist:
|Re: Beautiful Soup 4.2.0||Aaron DeVore||5/14/13 4:03 PM|
The element "\ngit\n" has its next_element attribute set to None. It should be set to the <em> tag instead. I haven't figured out exactly where the root source of the problem is. The <em> tag is definitely there; it's just not linked up.
|Re: Beautiful Soup 4.2.0||Leonard Richardson||5/15/13 6:09 AM|
This sort of thing happens a lot with the html5lib treebuilder, which
a) moves the tree around a lot while constructing it, and b) has
historically done this using its own code, taken from html5lib,
instead of standard Beautiful Soup code.
Other bugs of this sort:
Do you want to take this? Otherwise I'll look at it. I think the key
is to follow "\ngit\n" and the following <em> tag from their creation.
At some point, the <em> tag is supposed to set
self.previous_element.next_element = self, and that doesn't happen.
I wouldn't be surprised if the same thing happened with the "\nand\n"
that separates the two <strong> tags. That would be the place where
the bug causes the specific problem that Andrew reported..
|Re: Beautiful Soup 4.2.0||leonardr||5/20/13 8:35 AM|
This will be fixed in the next release: