Beautiful Soup 4.7.0

151 views
Skip to first unread message

leonardr

unread,
Dec 31, 2018, 1:21:18 PM12/31/18
to beautifulsoup
We're saying goodbye to 2018 with a major release of Beautiful Soup!

This release fixes one of the primary recent complaints about the project: its lackluster support for the CSS selector syntax. The homemade selector implementation (originally based on Simon Willison's soupselect project from the early 2010s) has been replaced with a dependency on Isaac Muse's SoupSieve project, which has a nearly complete implementation of CSS selectors. Beautiful Soup 4 started off by replacing a homemade HTML parser with dependencies on a variety of external parsers, so this is very much in the spirit of the project.

Isaac also did some very good work in chasing down some more problems with the html5lib tree builder, problems which led to trees that appeared well-structured but which started falling apart if you modified the tree.

We've also got a few minor features and bug fixes, which you can read about in the full changelog.

You can get the new release at PyPI or direct from the Beautiful Soup web site.

Leonard

Malik Rumi

unread,
Dec 31, 2018, 1:47:12 PM12/31/18
to beauti...@googlegroups.com
Congratulations! This is great news. I just took a quick scan of the changelog. I get that we have to use SoupSieve going forward, what is not clear is how using 4.7 with scripts that were written in <4.7 scripts will work now. Will those have to be re-written to work with SoupSieve?

“None of you has faith until he loves for his brother or his neighbor what he loves for himself.”


--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To post to this group, send email to beauti...@googlegroups.com.
Visit this group at https://groups.google.com/group/beautifulsoup.
For more options, visit https://groups.google.com/d/optout.

facelessuser

unread,
Dec 31, 2018, 2:20:53 PM12/31/18
to beautifulsoup
All tests that passed before with the old select implementation work with the new implementation, the difference is that the accepted selectors can that are allowed are far greater, and you are not limited in selector complexity.

If you were doing some hackery with with selects undocumented argument:

_candidate_generator

Then you may need to do some rewriting. This should be expected when tinkering with undocumented things.

facelessuser

unread,
Dec 31, 2018, 2:35:32 PM12/31/18
to beautifulsoup
If anyone has any questions or comments (good or bad) about the new SoupSieve library, I'd love to hear it.  SoupSieve was no small feat to get implemented with the wide range of support. It is something I've been using in one of my projects in an immature form for a while now.

I'm hoping everyone finds it useful.  I was careful to implement it in a way that should not disrupt old scripts. If anything, it should only enhance the experience of using select with Beautiful Soup.

As I am but a man reading the sometimes convoluted CSS spec, there may be some quirks in my implementations.  Feel free to bring your bugs and requests over to the SoupSieve issue tracker: https://github.com/facelessuser/soupsieve/issues.

Keep in mind, if you have some complicated parsing you can also use SoupSieve's API directly: https://facelessuser.github.io/soupsieve/api/. It allows for matching a tag to a selector, filter a tag's children by selectors, get the closest direct descendant, etc.

Malik Rumi

unread,
Dec 31, 2018, 2:43:13 PM12/31/18
to beauti...@googlegroups.com
Thanks. I'm sure it was long hard work, so know that it is appreciated. 

“None of you has faith until he loves for his brother or his neighbor what he loves for himself.”

leonardr

unread,
Dec 31, 2018, 3:14:57 PM12/31/18
to beautifulsoup
Thanks for providing those details, Isaac. I think SoupSieve is a major accomplishment and will benefit a lot of people.

The other situation I'd mention where upgrading to 4.7.0 could break a script is if you install Beautiful Soup by unzipping the source tarball (which won't automatically install SoupSieve), rather than installign it through pip (which will). If you do this, Beautiful Soup will issue a warning when you import it, but nothing bad will happen unless you actually call the select() method -- at that point you'll get a NotImplementedError.

Leonard

chad....@gmail.com

unread,
Jan 2, 2019, 6:31:46 PM1/2/19
to beautifulsoup
Just a quick note - renaming NEWS.txt to CHANGELOG has broken all existing links to the changelog, from the BeautifulSoup homepage, pypi project page, etc.

Malik Rumi

unread,
Jan 3, 2019, 11:16:54 AM1/3/19
to beauti...@googlegroups.com
Hey Leonardr:

I don't want to put too much on your plate, but is there an updated test suite for bs4? I found one for bs3 on your site, https://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.0.8/BeautifulSoupTests.py  but it is very much out of date.

“None of you has faith until he loves for his brother or his neighbor what he loves for himself.”

facelessuser

unread,
Jan 3, 2019, 11:27:34 AM1/3/19
to beautifulsoup
Since I recently provided a number a commits to the bs4 project, I can probably answer this.  The project has a number of unit tests that can be run to ensure sound operation. The tests are usually run via:

python -m unittest discover

To run on Python 3, you have to run 2to3 on the project and then run unittests.  Though, in the future, it would be nice to just abstract the differences for PY2 and PY3 to remove the need for running 2to3. Maybe if I get some time moving forward and Leonardr is agreeable to it, I can do the work.

Isaac

Malik Rumi

unread,
Jan 3, 2019, 11:32:25 AM1/3/19
to beauti...@googlegroups.com
Thanks, Isaac. I love the fast replies!

“None of you has faith until he loves for his brother or his neighbor what he loves for himself.”

facelessuser

unread,
Jan 3, 2019, 11:41:41 AM1/3/19
to beautifulsoup
No problem.  It helps break up my work day :).

Alex Krupp

unread,
Jan 6, 2019, 12:30:26 AM1/6/19
to beauti...@googlegroups.com
I found something that works in 4.6.3, but not 4.7.0, which I think is a bug:

text = '<br>'
soup = BeautifulSoup(text, "lxml")

for br_tag in soup.find_all('br'):
  br_tag.insert_before(soup.new_tag('br'))

In 4.6.3 I get `<html><body><br/><br/></body></html>`, whereas in 4.7.0 I get `ValueError: Can't insert an element before itself.`

It looks like there is also a minor API change that didn't make it into the changelog. In 4.6.3 I was using this hacky code snippet as a navigation-safe way of getting the class of a tag as a string:

`tag.get('class', [''])[0]`

In 4.6.3 that worked, whereas in 4.7.0 I'm getting `IndexError: list index out of range`. And it looks like now I can just do `tag.get('class', '')`.

Alex

On Thu, Jan 3, 2019 at 11:41 AM facelessuser <faceless...@gmail.com> wrote:
No problem.  It helps break up my work day :).

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To post to this group, send email to beauti...@googlegroups.com.
Visit this group at https://groups.google.com/group/beautifulsoup.
For more options, visit https://groups.google.com/d/optout.


--
Alex Krupp
Cell: (607) 351 2671
Read my Email: www.fwdeveryone.com/u/alex3917
Subscribe to my blog: http://alexkrupp.typepad.com/
My homepage: www.alexkrupp.com

facelessuser

unread,
Jan 6, 2019, 1:39:49 AM1/6/19
to beautifulsoup
The self check is definitely a bug.  insert_before and insert_after can now take multiple arguments, and in the original proposal, we looped through the arguments checking if the tag to be inserted **is** self, the check was moved out of the loop (which makes sense) so we could do the check once, but it is now checking if self is **in** args which is doing check of **equality**, which we do not want to do.  I can probably submit a patch for that.

As for the second difference, I'm not really sure. It may be a side affect, of this commit: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/revision/484. It specifically removes the chance of getting empty classes due to the way things were split.  I imagine this only occurs for cases like this?

>>> text = "<div class=""></div>"
>>> soup = BeautifulSoup(text, "lxml")

In this case, BS4 would store an empty list for class, while in the past it would store [''] which is a little confusing. I'm assuming that this is all it is, I'd have to see what the actual element looked like that you are calling this on though.  But yeah, it seems for those who've been working around this quirk of BS4, a note in the changelog may be helpful.

facelessuser

unread,
Jan 6, 2019, 1:54:38 AM1/6/19
to beautifulsoup
Thinking more about the class issue...I guess if people really relied on the old behavior, maybe if a class attribute has an empty class, we could just return ['']?  I don't know what is more intuitive, or if it makes sense to do it this way for backwards compatibility.  I guess leonardr would have to make that call if he wanted to go that route.


On Saturday, January 5, 2019 at 10:30:26 PM UTC-7, Alex Krupp wrote:

leonardr

unread,
Jan 6, 2019, 6:38:14 PM1/6/19
to beautifulsoup
I don't think we can go back to returning [''] if a class attribute has an empty value. That sounds like a special case of bug #1787453. The empty string should parse to an empty list of space-separated tokens.

It is annoying that that's not an easy way get a string value for an attribute; we could add a dictionary that acts like Tag.attrs but which always outputs strings.
Leonard

facelessuser

unread,
Jan 6, 2019, 6:49:09 PM1/6/19
to beautifulsoup
I've always kind of wanted a way to get the actual string for a class, not just a list. Even when you do a ' '.join() on it, it still might not match what it was in the HTML. If you are doing some selector matching '[class^="   someclass"]', you can't really actually get the original string.  So I fully support a way to get the string.

facelessuser

unread,
Jan 6, 2019, 7:11:21 PM1/6/19
to beautifulsoup
FYI, if anyone runs into a select error that matches this https://github.com/facelessuser/soupsieve/pull/69, a fix has been merged into BeautifulSoup to no longer index unprefixed namespaces.

Details:

Unprefixed namespaces were being passed to Soup Sieve with a key of None, when it should have been an empty string (at least for Soup Sieve's API).  I actually create a pull to allow Soup Sieve to handle the `None` key, but thinking about it more, it seemed we could run into all sorts of issues if we continued indexing unprefixed namespaces in Beautiful Soup. You can only track one in the dictionary, and they shouldn't always be treated as default namespaces.  The user should probably specify default namespaces as it can affect the implied namespace of certain selectors.

The actual prefix used in CSS selectors doesn't even need to match what is in an XML as it is the namespace that is actually checked, not the prefix (except in the case of attributes).
Reply all
Reply to author
Forward
0 new messages