Remove node but leave inner text/html?

3,600 views
Skip to first unread message

Scott Newman

unread,
Dec 7, 2010, 5:39:48 PM12/7/10
to nokogiri-talk
In a blob of XML or HTML, I'd like to remove some (not all) tags but
leave the tag's contents in place. For example, in the line below, I
want to remove the <strong> tags but leave the contents of the tag in
place, and leave the <em> tags alone:

Original:
<body>This is great but has <strong>a bunch</strong> of bold tags
<em>we</em> don't want to see.</body>

Desired:
<body>This is great but has a bunch of bold tags <em>we</em> don't
want to see.</body>

Below is a sample of what I've been playing with. I need to replace
"<strong>a bunch</strong>" with "a bunch".

require 'rubygems'
require 'nokogiri'

raw_xml = <<-eos
<body>This is great but has <strong>a bunch</strong> of bold tags
<em>we</em>
don't want to see.</body>
eos

doc = Nokogiri::XML(raw_xml)

doc.xpath(".//strong").each do |node|
guts = node.inner_html # This will get "a bunch"
node.remove # Now it's all gone
# How do I get guts back into my doc?
end

print doc.to_html

Mike Dalessio

unread,
Dec 7, 2010, 5:44:59 PM12/7/10
to nokogi...@googlegroups.com
Howdy!

Here's the most efficient way to do it:

    doc.xpath(".//strong).each do |node|
        replacement_killer = Nokogiri::XML::Text.new(node.to_s, node.document)
        node.add_next_sibling replacement_killer
        node.remove
    end

-m

---
mike dalessio / @flavorjones


Mike Dalessio

unread,
Dec 7, 2010, 6:22:40 PM12/7/10
to nokogi...@googlegroups.com
Nope, I take that back, I forgot we implemented Node#replace:

     doc.xpath(".//strong).each do |node|
       node.replace Nokogiri::XML::Text.new(node.to_s, node.document)
     end

Ahh, much nicer.

Scott Newman

unread,
Dec 7, 2010, 6:51:24 PM12/7/10
to nokogi...@googlegroups.com
Nope, I take that back, I forgot we implemented Node#replace:

     doc.xpath(".//strong).each do |node|
       node.replace Nokogiri::XML::Text.new(node.to_s, node.document)
     end

Ahh, much nicer.


Thanks, Mike.  I think I'm close, but I'm having a bit of trouble. I've actually extracted the body I'm trying to work with out of a nasty XML document, and after I've obtained it, I want to parse it. I don't have a complete doc, it's just a bunch of <p> tags in a string. (see the 'body' variable in my snippet below)

I took a shot at parsing it, but I'm having two problems: 

1) the <inlineTag> nodes aren't being removed
2) I can't figure out how to get my string back when done without it trying to put HTML or XML declaration tags around it. 

For #1, I tried parsing it with Nokogiri::XML, Nokogiri::HTML, and Nokogiri:Slop. I have a feeling that I don't want to replace the nodes with a Nokogiri::XML::Text object? 

For #2, I've tried returning doc with doc.to_s, doc.to_html, and doc.to_xml. All seem to wrap it in something.

Thank you very much!


----------------------------------

require 'rubygems'
require 'nokogiri'

# This actually came from something extracted from a document so it's not a complete doc
body = <<-eos

<p><inlineTag name="subhead">January 1:</inlineTag> <inlineTag name="body">Event 1.</inlineTag>
<strong>Title 1. </strong> This is the first paragraph&#xAD;with entites&#xAD; that we have.
We also have <a href="#">links</a></p>
<p><inlineTag name="subhead">January 2:</inlineTag> <inlineTag name="body">Event 2.</inlineTag>
<strong>Title 2. </strong>This is the second paragraph&#xAD;with entites&#xAD; that we have.
We also have <a href="#">more links</a></p>

eos

# What I'm trying to get:
#
# <p>January 1: Event 1. <strong>Title 1.</strong> This is the first paragraph&#xAD;with entites&#xAD; 
# that we have.We also have <a href="#">links</a></p>
# <p>January 2: Event 2. <strong>Title 2.</strong> This is the second paragraph&#xAD;with entites&#xAD; 
# that we have.We also have <a href="#">more links</a></p>


doc = Nokogiri::XML(body)
doc.search(".//inlineTag").each do |node|
    node.replace Nokogiri::XML::Text.new(node.to_s, node.document)
end

# I don't want <?xml> or <html> tags, I just want my <p> tags
sanitized_body = doc.to_s

Mike Dalessio

unread,
Dec 8, 2010, 11:42:03 PM12/8/10
to nokogi...@googlegroups.com
On Tue, Dec 7, 2010 at 6:51 PM, Scott Newman <snew...@gmail.com> wrote:
Nope, I take that back, I forgot we implemented Node#replace:

     doc.xpath(".//strong).each do |node|
       node.replace Nokogiri::XML::Text.new(node.to_s, node.document)
     end

Ahh, much nicer.


Thanks, Mike.  I think I'm close, but I'm having a bit of trouble. I've actually extracted the body I'm trying to work with out of a nasty XML document, and after I've obtained it, I want to parse it. I don't have a complete doc, it's just a bunch of <p> tags in a string. (see the 'body' variable in my snippet below)

I took a shot at parsing it, but I'm having two problems: 

1) the <inlineTag> nodes aren't being removed
2) I can't figure out how to get my string back when done without it trying to put HTML or XML declaration tags around it. 

For #1, I tried parsing it with Nokogiri::XML, Nokogiri::HTML, and Nokogiri:Slop. I have a feeling that I don't want to replace the nodes with a Nokogiri::XML::Text object? 

For #2, I've tried returning doc with doc.to_s, doc.to_html, and doc.to_xml. All seem to wrap it in something.

Ah, this code will work for you:

doc = Nokogiri::XML::DocumentFragment.parse(body)
doc.search(".//inlineTag").each do |node|
  node.replace Nokogiri::XML::Text.new(node.inner_html, node.document)
end

The first issue was my fault -- you should use node.inner_html, not node.to_s.

The second issue is solved by treating it as a DocumentFragment, which can have multiple root nodes, and won't generate doctype declarations.
Reply all
Reply to author
Forward
0 new messages