Beautiful Soup 4.2.0

Leonard Richardson

unread,

May 14, 2013, 9:56:20 AM5/14/13

to beautifulsoup

I just released Beautiful Soup 4.2.0. The main new feature is a
rewrite of the CSS selector code, which greatly expands the set of
selectors you can use with select(). There are also new
troubleshooting features and lots of bug fixes. The complete changelog
is below.

This version changes the way some documents are presented, hopefully
unequivocably for the better. In particular, the contents of <script>
and <style> tags no longer undergo entity substitution.

I'm prepared to do a bugfix release later this week, but hopefully
this version will improve life for everyone.

Leonard

* The Tag.select() method now supports a much wider variety of CSS
selectors.

- Added support for the adjacent sibling combinator (+) and the
general sibling combinator (~). Tests by "liquider". [bug=1082144]

- The combinators (>, +, and ~) can now combine with any supported
selector, not just one that selects based on tag name.

- Added limited support for the "nth-of-type" pseudo-class. Code
by Sven Slootweg. [bug=1109952]

* The BeautifulSoup class is now aliased to "_s" and "_soup", making
it quicker to type the import statement in an interactive session:

from bs4 import _s
or
from bs4 import _soup

The alias may change in the future, so don't use this in code you're
going to run more than once.

* Added the 'diagnose' submodule, which includes several useful
functions for reporting problems and doing tech support.

- diagnose(data) tries the given markup on every installed parser,
reporting exceptions and displaying successes. If a parser is not
installed, diagnose() mentions this fact.

- lxml_trace(data, html=True) runs the given markup through lxml's
XML parser or HTML parser, and prints out the parser events as
they happen. This helps you quickly determine whether a given
problem occurs in lxml code or Beautiful Soup code.

- htmlparser_trace(data) is the same thing, but for Python's
built-in HTMLParser class.

* In an HTML document, the contents of a <script> or <style> tag will
no longer undergo entity substitution by default. XML documents work
the same way they did before. [bug=1085953]

* Methods like get_text() and properties like .strings now only give
you strings that are visible in the document--no comments or
processing commands. [bug=1050164]

* The prettify() method now leaves the contents of <pre> tags
alone. [bug=1095654]

* Fix a bug in the html5lib treebuilder which sometimes created
disconnected trees. [bug=1039527]

* Fix a bug in the lxml treebuilder which crashed when a tag included
an attribute from the predefined "xml:" namespace. [bug=1065617]

* Fix a bug by which keyword arguments to find_parent() were not
being passed on. [bug=1126734]

* Stop a crash when unwisely messing with a tag that's been
decomposed. [bug=1097699]

* Now that lxml's segfault on invalid doctype has been fixed, fixed a
corresponding problem on the Beautiful Soup end that was previously
invisible. [bug=984936]

* Fixed an exception when an overspecified CSS selector didn't match
anything. Code by Stefaan Lippens. [bug=1168167]

Aaron DeVore

unread,

May 14, 2013, 7:03:06 PM5/14/13

to beauti...@googlegroups.com

The element "\ngit\n" has its next_element attribute set to None. It should be set to the <em> tag instead. I haven't figured out exactly where the root source of the problem is. The <em> tag is definitely there; it's just not linked up.

-Aaron DeVore

On Tue, May 14, 2013 at 12:37 PM, Andrew Roberts <adro...@gmail.com> wrote:

I can't login to launchpad, but I found a bug, it's illustrated in this simple gist:

https://gist.github.com/aroberts/5578807

Cheers,

Andrew

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To post to this group, send email to beauti...@googlegroups.com.
Visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Leonard Richardson

unread,

May 15, 2013, 9:09:48 AM5/15/13

to beauti...@googlegroups.com

Aaron,

> The element "\ngit\n" has its next_element attribute set to None. It should
> be set to the <em> tag instead. I haven't figured out exactly where the root
> source of the problem is. The <em> tag is definitely there; it's just not
> linked up.

This sort of thing happens a lot with the html5lib treebuilder, which
a) moves the tree around a lot while constructing it, and b) has
historically done this using its own code, taken from html5lib,
instead of standard Beautiful Soup code.

Other bugs of this sort:

https://bugs.launchpad.net/beautifulsoup/+bug/1039527
https://bugs.launchpad.net/beautifulsoup/+bug/1019603
https://bugs.launchpad.net/beautifulsoup/+bug/943246

Do you want to take this? Otherwise I'll look at it. I think the key
is to follow "\ngit\n" and the following <em> tag from their creation.
At some point, the <em> tag is supposed to set
self.previous_element.next_element = self, and that doesn't happen.

I wouldn't be surprised if the same thing happened with the "\nand\n"
that separates the two <strong> tags. That would be the place where
the bug causes the specific problem that Andrew reported..

Leonard

leonardr

unread,

May 20, 2013, 11:35:27 AM5/20/13

to beauti...@googlegroups.com, leon...@segfault.org

This will be fixed in the next release:

https://bugs.launchpad.net/beautifulsoup/+bug/1182089

Leonard

Reply all

Reply to author

Forward