Remove node but leave inner text/html?

Scott Newman

unread,

Dec 7, 2010, 5:39:48 PM12/7/10

to nokogiri-talk

In a blob of XML or HTML, I'd like to remove some (not all) tags but
leave the tag's contents in place. For example, in the line below, I
want to remove the tags but leave the contents of the tag in
place, and leave the tags alone:

Original:
<body>This is great but has a bunch of bold tags
we don't want to see.</body>

Desired:
<body>This is great but has a bunch of bold tags we don't
want to see.</body>

Below is a sample of what I've been playing with. I need to replace
"a bunch" with "a bunch".

require 'rubygems'
require 'nokogiri'

raw_xml = <<-eos
<body>This is great but has a bunch of bold tags
we
don't want to see.</body>
eos

doc = Nokogiri::XML(raw_xml)

doc.xpath(".//strong").each do |node|
guts = node.inner_html # This will get "a bunch"
node.remove # Now it's all gone
# How do I get guts back into my doc?
end

print doc.to_html

Mike Dalessio

unread,

Dec 7, 2010, 5:44:59 PM12/7/10

to nokogi...@googlegroups.com

Howdy!

Here's the most efficient way to do it:

doc.xpath(".//strong).each do |node|

replacement_killer = Nokogiri::XML::Text.new(node.to_s, node.document)

node.add_next_sibling replacement_killer

node.remove

end

-m

---

mike dalessio / @flavorjones

Mike Dalessio

unread,

Dec 7, 2010, 6:22:40 PM12/7/10

to nokogi...@googlegroups.com

Nope, I take that back, I forgot we implemented Node#replace:

doc.xpath(".//strong).each do |node|

node.replace Nokogiri::XML::Text.new(node.to_s, node.document)

end

Ahh, much nicer.

Scott Newman

unread,

Dec 7, 2010, 6:51:24 PM12/7/10

to nokogi...@googlegroups.com

Nope, I take that back, I forgot we implemented Node#replace:

    doc.xpath(".//strong).each do |node|
   node.replace Nokogiri::XML::Text.new(node.to_s, node.document)
   end

Ahh, much nicer.

Thanks, Mike. I think I'm close, but I'm having a bit of trouble. I've actually extracted the body I'm trying to work with out of a nasty XML document, and after I've obtained it, I want to parse it. I don't have a complete doc, it's just a bunch of tags in a string. (see the 'body' variable in my snippet below)

I took a shot at parsing it, but I'm having two problems:

1) the <inlineTag> nodes aren't being removed

2) I can't figure out how to get my string back when done without it trying to put HTML or XML declaration tags around it.

For #1, I tried parsing it with Nokogiri::XML, Nokogiri::HTML, and Nokogiri:Slop. I have a feeling that I don't want to replace the nodes with a Nokogiri::XML::Text object?

For #2, I've tried returning doc with doc.to_s, doc.to_html, and doc.to_xml. All seem to wrap it in something.

Thank you very much!

----------------------------------

require 'rubygems'

require 'nokogiri'

# This actually came from something extracted from a document so it's not a complete doc

body = <<-eos

<inlineTag name="subhead">January 1:</inlineTag> <inlineTag name="body">Event 1.</inlineTag>

Title 1. This is the first paragraphwith entites that we have.

We also have <a href="#">links</a>

<inlineTag name="subhead">January 2:</inlineTag> <inlineTag name="body">Event 2.</inlineTag>

Title 2. This is the second paragraphwith entites that we have.

We also have <a href="#">more links</a>

eos

# What I'm trying to get:

#

# January 1: Event 1. Title 1. This is the first paragraphwith entites

# that we have.We also have <a href="#">links</a>

#

# January 2: Event 2. Title 2. This is the second paragraphwith entites

# that we have.We also have <a href="#">more links</a>

doc = Nokogiri::XML(body)

doc.search(".//inlineTag").each do |node|

node.replace Nokogiri::XML::Text.new(node.to_s, node.document)

end

# I don't want <?xml> or <html> tags, I just want my tags

sanitized_body = doc.to_s

Mike Dalessio

unread,

Dec 8, 2010, 11:42:03 PM12/8/10

to nokogi...@googlegroups.com

On Tue, Dec 7, 2010 at 6:51 PM, Scott Newman <snew...@gmail.com> wrote:

Nope, I take that back, I forgot we implemented Node#replace:

 doc.xpath(".//strong).each do |node|
 node.replace Nokogiri::XML::Text.new(node.to_s, node.document)
 end

Ahh, much nicer.

Thanks, Mike. I think I'm close, but I'm having a bit of trouble. I've actually extracted the body I'm trying to work with out of a nasty XML document, and after I've obtained it, I want to parse it. I don't have a complete doc, it's just a bunch of tags in a string. (see the 'body' variable in my snippet below)

I took a shot at parsing it, but I'm having two problems:

1) the <inlineTag> nodes aren't being removed
2) I can't figure out how to get my string back when done without it trying to put HTML or XML declaration tags around it.

For #1, I tried parsing it with Nokogiri::XML, Nokogiri::HTML, and Nokogiri:Slop. I have a feeling that I don't want to replace the nodes with a Nokogiri::XML::Text object?

For #2, I've tried returning doc with doc.to_s, doc.to_html, and doc.to_xml. All seem to wrap it in something.

Ah, this code will work for you:

doc = Nokogiri::XML::DocumentFragment.parse(body)

doc.search(".//inlineTag").each do |node|

node.replace Nokogiri::XML::Text.new(node.inner_html, node.document)
end

The first issue was my fault -- you should use node.inner_html, not node.to_s.

The second issue is solved by treating it as a DocumentFragment, which can have multiple root nodes, and won't generate doctype declarations.

Reply all

Reply to author

Forward