Find grand-child with double colon + "name" in name?

74 views
Skip to first unread message

Heck Lennon

unread,
May 25, 2024, 7:24:38 AMMay 25
to beautifulsoup
Hello,

I need to find if extentions.gte:name exists.

What would a good way to do it?

==========
soup = BeautifulSoup(open("test.gpx", 'r'),features="xml")

segments = soup.find_all("trkseg")
for segment in segments:
    """
    <extensions>
    <gte:name>#1</gte:name>
    <gte:color>#fbaf00</gte:color>
    </extensions>
    """
    #BAD name = segment.find("/extensions/gte:name")
    #BAD name = segment.find("extensions/gte:name")
    #BAD name = segment.select_one("extensions.gte:name")
    #NotImplementedError: ':name' pseudo-class is not implemented at this time
    name = segment.select_one("extensions > gte:name")

    name = segment.find("extensions") #OK
    if name != None:
        print("Found:",name.text)
    else:
        print("Name not found")
==========

Thank you.

Heck Lennon

unread,
May 25, 2024, 9:21:19 AMMay 25
to beautifulsoup
I found a work-around:

=========
for segment in segments:
    """
    <extensions>
    <gte:name>#1</gte:name>
    <gte:color>#fbaf00</gte:color>
    </extensions>
    """

    """
    name = segment.find(r"gte\:name")
    if name:

        print("Found:",name.text)
    else:
        print("Name not found")
    """

    extension = segment.find("extensions").find(re.compile("name$"))
    if extension:
        print("Found:",extension.text)

    else:
        print("Name not found")

    break
=========

Chris Papademetrious

unread,
May 25, 2024, 10:54:01 AMMay 25
to beautifulsoup
Hi there,

According to "Namespaces in CSS selectors" in the Beautiful Soup documentation at


you want to use the namespace CSS selector syntax:

name = segment.select_one(r"extensions > gte|name")

 - Chris

Isaac Muse

unread,
May 26, 2024, 9:43:58 AMMay 26
to beautifulsoup
Yep, the CSS selector library follows CSS syntax conventions.

Heck Lennon

unread,
May 26, 2024, 11:30:09 AMMay 26
to beautifulsoup
Thanks for the info.

But it still chokes on the ":" sign (it's not a "|"):

#NotImplementedError: ':name' pseudo-class is not implemented at this time
name = segment.select_one(r"extensions > gte:name")

Isaac Muse

unread,
May 26, 2024, 12:10:38 PMMay 26
to beautifulsoup
I'm the author of the CSS library being used. If you provide a minimal reproducible example, I can guide you further when I'm in front of a computer again, but without one, I cannot help you. 

Remember, you have to obey CSS syntax for CSS selectors. That includes CSS escaping of you if want a literal semicolon. If you are trying access namespaces, you have to use namespace syntax, and may need to send in a namespaces map if Beautiful Soup is not able to capture it automatically. Lastly, not all HTML parsers are namespace aware.

Heck Lennon

unread,
May 26, 2024, 12:17:43 PMMay 26
to beautifulsoup
Here goes. It could be something simple to experts:

=========
import sys
import os
import re
from bs4 import BeautifulSoup, Comment
from bs4.builder import LXMLTreeBuilderForXML

"""
input file:
<?xml version="1.0" encoding="UTF-8"?>
<gpx version="1.1">
<trk>
<name>Blah</name>
<trkseg>
<trkpt lat="45.776910" lon="3.083850">
</trkpt>
<trkpt lat="45.777160" lon="3.083830">
</trkpt>
<extensions>
<gte:name>Some name</gte:name>
</extensions>
</trkseg>
</trk>
</gpx>
"""

soup = BeautifulSoup(open(item, 'r'),features="xml")

segments = soup.find_all("trkseg")
for segment in segments:
#CSS
#extension = segment.select_one(r"extensions > gte|name")

#NotImplementedError: ':name' pseudo-class is not implemented at this time
#extension = segment.select_one(r"extensions > gte:name")

extension = segment.find("extensions").find(re.compile("name$"))
if extension:
print("Found:",extension.text)
else:
print("Name not found")
=========

Isaac Muse

unread,
May 26, 2024, 12:29:19 PMMay 26
to beautifulsoup
Because I won't be near a computer until later today, and I don't see any declared namespaces, let me point you at the escape documentation: 
https://facelessuser.github.io/soupsieve/api/#soupsieveescape.

It also links to the CSS spec in regards to escapes.

leonardr

unread,
May 26, 2024, 1:18:38 PMMay 26
to beautifulsoup
As Isaac says, the issue is that the colon character  has special meaning to the CSS selector syntax, and it needs to be escaped to be interpreted as part of the tag name.

You can escape your selector with the css.escape method Isaac linked to, like so:

extension = segment.select_one(segment.css.escape(r"extensions > gte:name"))

Or you can escape the colon character yourself:

extension = segment.select_one(r"extensions > gte\:name")

That said, you got an error message that points down the wrong path: it makes it sound like there's a pseudo-class called ":name" which Soup Sieve doesn't support. But actually there is no such pseudo-class defined in the CSS spec.

Isaac, would it be a lot of work to give the current error message for known pseudo-classes that are unsupported for whatever reason (like, let's say, ':blank'), and another error for pseudo-classes that are completely unknown? The second error would allow for both possibilities but highlight the more likely chance that the problem is a missing escape.

Leonard

Isaac Muse

unread,
May 26, 2024, 2:15:11 PMMay 26
to beautifulsoup

I'll have to take a look. I believe we have insight into what is unsupported vs. what is invalid. Though at any time what was once invalid could become unsupported.

Isaac Muse

unread,
May 26, 2024, 11:04:04 PMMay 26
to beautifulsoup
Looking over the code, it seems we use `f"'{pseudo}' pseudo-class is not implemented at this time"` for invalid pseudo-class names in general. It is impossible for us to know if a pseudo-class is a real one or not because this can change. We could keep a list of currently known pseudo classes that we have not implemented (that would be a fairly small list), but that can become invalid at any time.

I think a more practical solution is to simply change the message to something like `f"'{pseudo}' is not a valid pseudo-class"`. This implies a couple of things. One is that the syntax is recognized as an attempt to specify a pseudo-class, and the second is that Soup Sieve does not recognize the pseudo-class. This is better because it makes no claim that the pseudo-class is a real, unsupported  pseudo-class or simply a bad pseudo-class name. This keeps it generic without implying anything about the specified pseudo-class name.

Heck Lennon

unread,
May 27, 2024, 4:21:29 AMMay 27
to beautifulsoup
Thanks for the help.

The two suggestions return nothing:

=======
extension = segment.find("extensions").find(re.compile("name$"))
#NO RESULT extension = segment.select_one(r"extensions > gte\:name")
#NO RESULT extension = segment.select_one(segment.css.escape(r"extensions > gte:name"))

if extension:
print("Found:",extension.text)
else:
print("Name not found")
=======

Isaac Muse

unread,
May 27, 2024, 9:11:48 AMMay 27
to beautifulsoup

Okay, I know why it is not finding anything. This is related to how the XML parser works. Notice that you use the namespace prefix gte, but it is defined nowhere in your XML. When the XML parser finds this prefix but sees that it is not defined, it throws it away. You are left with a tag name of name but no namespace prefix of gte. This means you can only find the tag name with name, because gte is dropped.

from bs4 import BeautifulSoup XML1 = """ <?xml version="1.0" encoding="UTF-8"?> <gpx version="1.1"> <extensions> <gte:name>Some name</gte:name> </extensions> </gpx> """ soup = BeautifulSoup(XML1, "xml") name = soup.select_one('name') print(f"prefix: {getattr(name, 'prefix', '')}, name: {name.name}") # Find namespace: will fail print(soup.select_one("gte|*")) # Results # prefix: None, name: name # None

Now, lets define gte as a proper namespace. Now we will see that since the namespace is defined, the XML parser will properly process gte and keep it as a prefix for the name tag. Now that element will have a name of name and a namespace prefix of prefix.

from bs4 import BeautifulSoup XML2 = """ <?xml version="1.0" encoding="UTF-8"?> <gpx version="1.1"> <extensions xmlns:gte="http://me.com/namespaces/gte"> <gte:name>Some name</gte:name> </extensions> </gpx> """ soup = BeautifulSoup(XML2, "xml") name = soup.select_one('name') print(f"prefix: {getattr(name, 'prefix', '')}, name: {name.name}") # Find namespace: will succeed print(soup.select_one("gte|*")) # Results prefix: gte, name: name <gte:name>Some name</gte:name>

Isaac Muse

unread,
May 27, 2024, 9:19:02 AMMay 27
to beautifulsoup

The escape doesn’t work because the name is stripped of gte in XML mode. So why did we suggest that? Well, if we use LXML in HTML mode, it will leave the prefix for abnormal, unsupported namespaces.

This time, we take the same example, but we use lxml to put it in LXML’s HTML mode. Now the element will have the name gte:name because there is no associated namespace. Using CSS escapes, we can target the name.

from bs4 import BeautifulSoup XML1 = """ <?xml version="1.0" encoding="UTF-8"?> <gpx version="1.1"> <extensions> <gte:name>Some name</gte:name> </extensions> </gpx> """ soup = BeautifulSoup(XML1, "lxml") name = soup.select_one(r'gte\:name') print(f"prefix: {getattr(name, 'prefix', '')}, name: {name.name}") print(name) # Results # prefix: None, name: gte:name # <gte:name>Some name</gte:name>

Heck Lennon

unread,
May 27, 2024, 9:38:02 AMMay 27
to beautifulsoup
Makes sense.

It works, although using "lxml" triggers a warning:

============
soup = BeautifulSoup(open(item, 'r'), "lxml")

segments = soup.find_all("trkseg")
for segment in segments:
#Python312\Lib\site-packages\bs4\builder\__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
name = segment.select_one(r'gte\:name')
if name:

print(f"prefix: {getattr(name, 'prefix', '')}, name: {name.name}")
print(name,name.text)

else:
print("Name not found")
============

Isaac Muse

unread,
May 27, 2024, 9:46:06 AMMay 27
to beautifulsoup
Yes, and the warning is valid. It looks like you are using a normal XML but using it in HTML mode. But it is also valid that when you put it in XML mode, you are not using qualified namespaces leading to other issues.

Heck Lennon

unread,
May 27, 2024, 9:51:44 AMMay 27
to beautifulsoup
All good. Thanks everyone.

leonardr

unread,
May 27, 2024, 10:37:14 AMMay 27
to beautifulsoup
Looking over the code, it seems we use `f"'{pseudo}' pseudo-class is not implemented at this time"` for invalid pseudo-class names in general. It is impossible for us to know if a pseudo-class is a real one or not because this can change. We could keep a list of currently known pseudo classes that we have not implemented (that would be a fairly small list), but that can become invalid at any time.

I think a more practical solution is to simply change the message to something like `f"'{pseudo}' is not a valid pseudo-class"`. This implies a couple of things. One is that the syntax is recognized as an attempt to specify a pseudo-class, and the second is that Soup Sieve does not recognize the pseudo-class. This is better because it makes no claim that the pseudo-class is a real, unsupported  pseudo-class or simply a bad pseudo-class name. This keeps it generic without implying anything about the specified pseudo-class name.


`f"'{pseudo}' is not a valid pseudo-class" is definitely a better error message. If this was Beautiful Soup code I'd spell out the options to give the user an extra nudge, but it's your call.

Leonard
Reply all
Reply to author
Forward
0 new messages