Getting div elements with custom attributes using nokogiri

1,293 views
Skip to first unread message

Mansi Barodia

unread,
Feb 28, 2017, 9:05:08 AM2/28/17
to nokogiri-talk
I have a html with the body as per below:

<body>
   
<div class="myclass" dd:meta1="meta data 1" dd:meta2="CD5503253E54"></div>
   
<div class="myclass" dd:meta1="meta data 11"></div>
</body>

Using nokogiri I want to get all the div elements which will have dd:meta2 attributes, so in the above body, I will get just 1 div. I wrote some logic to get the div element, but I am getting some error which looks related to the : I have in the attribute.

My logic:

page = Nokogiri::HTML(html_string)
meta_data_divs
= page.css('body').css('div[dd:meta2]')

Error:

unexpected ':' after '#<Nokogiri::CSS::Node:0x007fac6b986d58>'

How to handle the ':' in the attribute?

Mike Dalessio

unread,
Feb 28, 2017, 9:46:20 PM2/28/17
to nokogiri-talk
Hi Mansi,

This is a GREAT question. Thank you for asking!

The short answer is that you may not be able to do this with either a CSS or an XPath query. It has to do with the underlying parser's handling of the ":" character in HTML attributes, because the ":" character is used to indicate namespacing in XML.

Let's start by explaining the error you're seeing ... which requires an explanation of how Nokogiri implements CSS parsing.

Nokogiri parses CSS queries, and converts them into the equivalent XPath queries. Nokogiri does this because the underlying parsers (libxml/xerces) implement XPath, but they don't support CSS. This code, if you care to check it out, is here: https://github.com/sparklemotion/nokogiri/tree/master/lib/nokogiri/css

The error message you're seeing is the CSS parser complaining that it doesn't understand the ":". Usually, the ":" character is used to indicate pseudo-selectors in CSS, but we could (probably?) do some work and get the CSS parser to handle it differently in an attribute. In that case, the generated XPath would be something like:

css 'div[dd:meta2]' → xpath './/div[@dd:meta2]'

and this is where it gets really hairy. This isn't a valid XPath query; or at least, libxml2 throws up on it. Here's what happens if you execute this search:

page.css('body').xpath('.//div[@dd:meta2]')
... ERROR: Undefined namespace prefix: .//div[@dd:meta2] (Nokogiri::XML::XPath::SyntaxError)

That is, libxml2 thinks we're searching for an attribute with a namespace of `dd`. And there doesn't appear to be any way to escape it. All of these variations also fail, for a variety of reasons:

.//div[@dd\:meta2]
.//div[@dd\\:meta2]
.//div[attribute::dd:meta2]
.//div[attribute::dd\:meta2]
.//div[attribute::dd\\:meta2]

So, the unfortunate advice I have for you right now is: either change your attribute names to not contain a ":" character, or else do this in Ruby space, like this:

page.css('div').select { |node| node.attributes["dd:meta2"] }

You could get fancy and write a customer xpath function to do this within an XPath query, but in this case performance will likely be the same (or worse) and the code would be far more complicated.

Hope this helps,
-m




--
You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nokogiri-talk+unsubscribe@googlegroups.com.
To post to this group, send email to nokogi...@googlegroups.com.
Visit this group at https://groups.google.com/group/nokogiri-talk.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages