Parsing issue with recent version (soupsieve exception)

1,222 views
Skip to first unread message

Simon Conseil

unread,
Jan 11, 2019, 7:39:17 AM1/11/19
to beautifulsoup
Hi,

Since the update to the latest version of beautifulsoup4, I'm getting a parsing error with the response from a web service (eso.org via the astroquery package, https://astroquery.readthedocs.io/en/latest/eso/eso.html).

beautifulsoup4                    4.7.1          
soupsieve                         1.7            

It works if I downgrade to beautifulsoup4-4.6.3.

Traceback (most recent call last):
 
File "/home/conseil/miniconda3/envs/py37/bin/musered", line 11, in <module>
    load_entry_point
('musered', 'console_scripts', 'musered')()
 
File "/home/conseil/lib/musered/musered/__main__.py", line 339, in main
    cli
(prog_name='musered')
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/click/core.py", line 764, in __call__
   
return self.main(*args, **kwargs)
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv
= self.invoke(ctx)
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
   
return _process_result(sub_ctx.command.invoke(sub_ctx))
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/click/core.py", line 956, in invoke
   
return ctx.invoke(self.callback, **ctx.params)
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/click/core.py", line 555, in invoke
   
return callback(*args, **kwargs)
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/click/decorators.py", line 27, in new_func
   
return f(get_current_context().obj, *args, **kwargs)
 
File "/home/conseil/lib/musered/musered/scripts/retrieve_data.py", line 109, in retrieve_data
   
Eso.retrieve_data(table['DP.ID'], **kw)
 
File "/home/conseil/lib/astroquery/astroquery/eso/core.py", line 767, in retrieve_data
    link
= root.select('a[href$=/script]')[0]
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/bs4/element.py", line 1376, in select
   
return soupsieve.select(selector, self, namespaces, limit, **kwargs)
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/soupsieve/__init__.py", line 108, in select
   
return compile(select, namespaces, flags).select(tag, limit)
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/soupsieve/__init__.py", line 59, in compile
   
return cp._cached_css_compile(pattern, namespaces, flags)
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/soupsieve/css_parser.py", line 183, in _cached_css_compile
   
CSSParser(pattern, flags).process_selectors(),
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/soupsieve/css_parser.py", line 854, in process_selectors
   
return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/soupsieve/css_parser.py", line 713, in parse_selectors
    key
, m = next(iselector)
 
File "/home/conseil/miniconda3/envs/py37/lib/python3.7/site-packages/soupsieve/css_parser.py", line 841, in selector_iter
   
raise SyntaxError(msg)
SyntaxError: Malformed attribute selector at position 1



facelessuser

unread,
Jan 11, 2019, 8:33:29 AM1/11/19
to beautifulsoup
According to CSS spec rules you need to quote the value.

facelessuser

unread,
Jan 11, 2019, 8:57:40 AM1/11/19
to beautifulsoup
I should probably be less terse :).

The old selector code was replaced by the third party Soup Sieve library. It aimed to fix the previous select methods poor selection support.  But beyond adding support for as complex selectors as you can think up and supporting many more selectors up through the CSS4 working draft, there is one difference that I see people running into, and it is with not quoting the attribute selector when they should.

Soup Sieve follows the CSS spec for CSS selectors, while the old select method would allow different bad selectors, some with surprising results.

1. You cannot treat a selector as relative without using the :scope pseudo-class:

# bad
soup
.select('> div')
# good
soup
.select(':scope > div')


2. Attributes values must be an identifier token or a string token.  In short, if you tried to use the same syntax to specify a type like "div", and it would fail, you must quote it.

# bad
soup
.select('[href=/value]'
soup
.select('[href={}]']

# good
soup
.select('[href=value]')
soup
.select('[href="/value"]')
soup
.select('[href="{}"]')


3. I've actually seen this in the wild.  If you are going to specify a `.` or a `#` make sure an actual class or id identifier is following it:

# bad
soup
.select('td.+.class')

# good
soup
.select('td+.class')
soup
.select('td.class')


----

I short, while the old selector method allowed some of these, it shouldn't have.  The Soup Sieve parser was written to do its best to follow the CSS spec, and this will not be changing.  So if your scripts exhibit malformed CSS selectors (according to the CSS spec), a SyntaxError will be raised.

Simon Conseil

unread,
Jan 11, 2019, 11:46:36 AM1/11/19
to beautifulsoup
Thanks for the explanations :)

This seems all reasonable, but it is also a bit harsh to get an obscure exception for something that was working previously. 
If it is feasible having a deprecation warning or at least a better exception message would be appreciable. I will report this one to astroquery but I guess that there are not the only one to use this non standard syntax.

facelessuser

unread,
Jan 11, 2019, 12:42:30 PM1/11/19
to beautifulsoup
Soup Sieve did not intentionally introduce this issue.  The old code that Soup Sieve replaced was hard to read and understand exactly everything that it supported and did not support.  Additionally, Beautiful Soup's test suite did not cover this case either. So these recent syntax errors are known to me only recently.

I don't think "Malformed attribute selector" is particularly obscure as in this case it is on the nose, but maybe not as specific as value should be quoted, but that is due to the current resolution of of how selectors are tokenized. Attributes are not tokenized into individual parts, at least not until a valid attribute selector is found. Attribute selectors are captured as a whole, valid selector. So currently, if the attribute selector doesn't match, we see an invalid `[` character, and since only attribute selectors start this way, we report a "Malformed attribute selector".

As for providing a deprecation path for invalid CSS syntax, I'm not entirely sure how I feel about it, but if you create an issue over at https://github.com/facelessuser/soupsieve, I will consider it. I've gone through great trouble to specifically follow the spec so I can just point people at the spec instead of documenting the nuances of the appropriate syntax, but maybe. To some extent, with a big transition as this is, these kinds of differences can be expected. Maybe in this case it makes sense, I will have to think about it, as I can see both sides of this argument.

Thanks for the feekback

leonardr

unread,
Jan 11, 2019, 12:54:57 PM1/11/19
to beautifulsoup
From a Beautiful Soup perspective, this is a regression -- a certain string used to work as input into select() and now it doesn't. From a Soup Sieve perspective there's no regression.

So how about putting the deprecation code inside Beautiful Soup rather than Soup Sieve? Either something generic that intercepts a SyntaxError and explains that maybe this worked before but things have changed; or something that uses regular expressions to check for the three specific cases you mentioned.

Maybe Soup Sieve could also give more helpful explanations of common syntax errors, but that would be a separate project, and not limited to the three incorrect syntaxes that worked in the old implementation of Tag.select().

Leonard

facelessuser

unread,
Jan 11, 2019, 1:24:32 PM1/11/19
to beautifulsoup
Recent iterations have added more helpful errors, but maybe not helpful enough.

Deprecating the `.` and `#` issue isn't hard, I could just raise a warning and ignore the bad character for now.  I personally view this issue as a bug as it doesn't feel like a choice by design as much as omission, and bad patterns, but I could still technically address this without too much issue.

Deprecating the attribute quote issue also isn't hard.  I simply need a more relaxed attribute pattern that I check after the strict one fails.  Simply raise an deprecation warning and parse it as usual.

As for deprecating the last issue in regards to relative, relational combinators, this one may not be as straight forward.  It isn't hard to inject ":scope" in these cases, but I would have to see how the old selector method handled leading whitespace: "  div".  If it treated that as a relational combinator, I simply cannot account for that in all the places it goes awry. Whitespace has no significance except between two selectors, and in the case of ":has()" it is assumed to be looking for a descendant unless ">", "+", or "~" is explicitly set, leading whitespace technically means nothing to the parser here, it is more the absence of other combinators that it triggers off of. I'll have to look into this.

It may be hard for BS4 to simply catch the exception and simply format it to not fail, as these patterns can be quite complex, but I could add a flag to SoupSieve that allows it to run in a...less strict state.  We could raise the error on the first attempt, and BS4 could use the less strict flag on the second attempt.

I guess had this been a BS5 release, I could argue that people are going to have to deal with it, but with a BS4.7 release it is harder to argue that...so maybe I should just add the deprecations...even if I die a little on the inside :).

facelessuser

unread,
Jan 11, 2019, 1:36:55 PM1/11/19
to beautifulsoup
Checking the old code real quick, I don't think it had a strong tie to the scope of the tag before it.  I think leading whitespace combinator should feel the same regardless of whether I insert `:scope` or not.  I think all of this is doable in a "quirks" mode.  Maybe we'll just allow a flag called QUIRKS. BS4 can run the selector, and if it fails, catch the error and save it and run it with QUIRKS. If it succeeds, then raise the error as a warning, or raise a more probably warning if Soup Sieve is not clear enough.  If it doesn't succeed the second time, just let the exception bubble up.

leonardr

unread,
Jan 11, 2019, 2:05:46 PM1/11/19
to beautifulsoup
I am actually closer to the 'deal with it' camp. This strikes me as analogous to the situation at https://groups.google.com/d/msg/beautifulsoup/C_2J31u6WdE/I_qFCg1CFwAJ, where a user was relying on the fact that an empty class attribute used to translate to [""] instead of []. In that case we treated deviation from the spec as a bug, even though some code depended on the deviation.

If it's easy to detect/repair an invalid selector, either on the Beautiful Soup end or in Soup Sieve as part of a QUIRKS mode, then let's go for it.

Leonard


facelessuser

unread,
Jan 11, 2019, 2:17:09 PM1/11/19
to beautifulsoup
I'll look it over this weekend and see how well it works for some of these cases. I'm in no big hurry.  Out of all the issues, the attribute issue is maybe the most understandable. The others feel more like exploitation of a lax algorithm. 

facelessuser

unread,
Jan 11, 2019, 2:53:44 PM1/11/19
to beautifulsoup
FYI, the old attribute regex looked like this:

# ^([a-zA-Z0-9][-.a-zA-Z0-9:_]*)\[(\w+)([=~\|\^\$\*]?)=?"?([^\]"]*)"?\]$

The value allowed anything that wasn't a double quote or closing square bracket. Whitespace? check. Even newlines? check. Anything.  I will not mimic this to that extend for this syntax as case insensitive and sensitive flags, as outlined in CSS4, simply will not work if I do:

[attribute=value i][attribute2=value s]

SoupSieve only allows whitespace if it immediately follows a hex escape as CSS allows this in order to signify that the parser should stop accumulating hex characters in. If you were relying on spaces in an unquoted value, and it is now broken, I say deal with it :). With that said though, I can at least accommodate a great deal of relaxing of the pattern here which should cover most cases that people would break this. Either that or, if I allowed whitespace, you would not be able to use the "i" and "s" flags unless the value was quoted. There would have to be some kind of compromise to support this kind of relaxed syntax.

If anything, I hope this illustrates better why Soup Sieve tries to follow the spec so closely. Additions to the spec in later revisions may depend on previously outlined requirements.

facelessuser

unread,
Jan 11, 2019, 3:01:11 PM1/11/19
to beautifulsoup
Link to the Soup Sieve issue covering this task. It is marked as "maybe" as it is yet to be determined if this will actually be released. If you are curious, or want to follow the progress, check or subscribe to the issue below:

facelessuser

unread,
Jan 11, 2019, 5:53:50 PM1/11/19
to beautifulsoup
I don't think we are going to support this case:

"td.+.class"

Looking at old BS4.6 code, I was wrong with my assumption of what it was doing. It appears that the old algorithm would just look for `.` and then split it and take whatever was at that index.  So in the case above, it would grab `+` as the class. To mimic this with Soup Sieve, we would have to change our approach because allowing anything for a class would break down our ability to parse things.  What would stop someone from having a class like this:

"td.[.class"

In Soup Sieve, you can still have weird classes, but you have to do them proper, via escapes or hex escapes, and this ability is already present:

r"td.\+.class"

Unfortunately, I have to draw a line somewhere. The old algorithm was quite naive in its approach. There are so many cases it did not account for. To mimic its behavior 100% is impossible and would only hurt our ability to use more complex selectors.

I will commit to handling relaxed attribute values ([attr=/value])

I will also commit to handling implicit :scope (> div) assuming I don't run into any corner cases that I can't code around, but implicit scope will only be handled outside of  pseudo-classes, you will not be able to use it in `:is(> div)`, instead you'd have to use it as `> :is(div)` or as :is(:scope > div). This is because quirks mode is meant to mitigate old patterns (which don't even support :is()) and also because it becomes more difficult to track when :scope can properly be used as you descend into nested pseudo classes that also are using relational combinators. So only allowing them at the root level can help us ensure we are inserting :scope appropriately.


facelessuser

unread,
Jan 18, 2019, 10:55:11 AM1/18/19
to beautifulsoup
I finally got around to releasing Soup Sieve 1.7.2 which introduces the aforementioned "quirks" mode. So, when the next BS4 is released, if Soup Sieve 1.7.2 is installed, it will allow a few of the old quirks, but with warnings. Quirks mode is not documented in the public API, and will not be either. It is meant only as a path to alert, instead of failing, during the transitional period. If using the Soup Sieve API directly, you should not rely on "quirks" mode. Eventually, "quirks" mode will be removed.

Reply all
Reply to author
Forward
0 new messages