Nokogiri::HTML::DocumentFragment.parse(...) removes namespaces

541 views
Skip to first unread message

Justin

unread,
Apr 14, 2011, 1:24:39 AM4/14/11
to nokogiri-talk
I'm using using Nokogiri to parse an html snippet so that I can remove
empty attributes.

clean_me = "<div id=\"something\" class=\"\" style=\"\"><fb:like
show_faces=\"true\" width=10\"></fb:like></div>"
parsed = Nokogiri::HTML::DocumentFragment.parse(clean_me)
self.clean_empty_child_attributes(parsed)
parsed.to_html

The problem I am having is that the 'fb:' is removed and the width=10
gets foobared so the result looks something like this:
"<div id=\"something\"><like show_faces=\"true\"></like>10\"&gt;</
div>"

There was a github issue posted with a similar problem but I couldn't
find it in the discussion. Here's a link:
https://github.com/tenderlove/nokogiri/issues/436

Can anyone point out what I'm doing wrong?

Thanks,
Justin

Mike Dalessio

unread,
Apr 14, 2011, 11:00:27 AM4/14/11
to nokogi...@googlegroups.com
On Thu, Apr 14, 2011 at 1:24 AM, Justin <justi...@gmail.com> wrote:
I'm using using Nokogiri to parse an html snippet so that I can remove
empty attributes.

clean_me = "<div id=\"something\" class=\"\" style=\"\"><fb:like
show_faces=\"true\" width=10\"></fb:like></div>"
parsed = Nokogiri::HTML::DocumentFragment.parse(clean_me)
self.clean_empty_child_attributes(parsed)
parsed.to_html

The problem I am having is that the 'fb:' is removed and the width=10
gets foobared so the result looks something like this:
"<div id=\"something\"><like show_faces=\"true\"></like>10\"&gt;</
div>"

Look at `parsed.errors` in the above case:

    clean_me = "<div id=\"something\" class=\"\" style=\"\"><fb:like show_faces=\"true\" width=10\"></fb:like></div>"
    parsed = Nokogiri::HTML::DocumentFragment.parse(clean_me)
    puts parsed.to_html
    # => <div id="something" class="" style=""><like show_faces="true" width='10"'></like></div>
    
    puts parsed.errors
    # => Namespace prefix fb is not defined
    #    Tag fb:like invalid
    
Nokogiri has a good point. The namespace `fb` is not defined anywhere, therefore it is discarded. Also, namespaces aren't proper HTML.

Let's start over and try something different. Let's parse with XML and then parse your fragment within the context of the document:

    clean_me = "<div id=\"something\" class=\"\" style=\"\"><fb:like show_faces=\"true\" width=10\"></fb:like></div>"

    doc = Nokogiri::XML %Q(<html xmlns:fb='http://flavorjon.es/'><body></body></html>)
    body = doc.at_css("body")
    
    body.parse clean_me
    puts doc.errors
    # => AttValue: " or ' expected
    #    attributes construct error
    #    Couldn't find end of Start Tag like line 1
    #    Opening and ending tag mismatch: div line 1 and fb:like
    #    chunk is not well balanced

Whoops! Looks like your example has broken markup, which the in-context parser doesn't autocorrect, meaning that we'll fall back to non-contextual parsing. Let's fix the markup problem (missing quote around the width value) and try again:

    clean_me = "<div id=\"something\" class=\"\" style=\"\"><fb:like show_faces=\"true\" width=\"10\"></fb:like></div>"

    doc = Nokogiri::XML %Q(<html xmlns:fb='http://flavorjon.es/'><body></body></html>)
    body = doc.at_css("body")
    
    frag = body.parse clean_me
    puts doc.errors
    # => 

    puts frag.to_xml
    # => <div id="something" class="" style="">
    #      <fb:like show_faces="true" width="10"/>
    #    </div>

Boom.

---
mike dalessio / @flavorjones


Justin

unread,
Apr 14, 2011, 9:30:03 PM4/14/11
to nokogiri-talk
Thanks Mike. Your solution works perfectly. I was originally hoping
that nokogiri would clean up the bad html - width=10" - but it makes
sense that it returns an error.

Thanks,
Justin

On Apr 14, 9:00 am, Mike Dalessio <mike.dales...@gmail.com> wrote:

Mike Dalessio

unread,
Apr 15, 2011, 8:35:59 AM4/15/11
to nokogi...@googlegroups.com


On Apr 14, 2011 9:30 PM, "Justin" <justi...@gmail.com> wrote:
>
> Thanks Mike. Your solution works perfectly. I was originally hoping
> that nokogiri would clean up the bad html - width=10" - but it makes
> sense that it returns an error.

This is a shortcoming in libxml2: the in-context fragment parser does not correct markup. Ordinarily libxml2 handles this sort of thing just fine.

> --
> You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
> To post to this group, send email to nokogi...@googlegroups.com.
> To unsubscribe from this group, send email to nokogiri-tal...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nokogiri-talk?hl=en.
>

Reply all
Reply to author
Forward
0 new messages