Nokogiri "replace" strips node.content of HTML

1,102 views
Skip to first unread message

Jez Gomez

unread,
Mar 15, 2011, 8:40:26 AM3/15/11
to nokogiri-talk
I would like to remove a tag from some HTML without stripping the
remaining content of any markup. For example, I have a file,
test.html:

<p class="P1"><span class="T2">Some text, goes to uppercase</span>
<p class="P4"><span class="T4"> </span><span class="T3">other text</
span>
<span class="T5">italics</span><span class="T3">‘more text with UTF-8
’</span>
</p></p>

I would like to get the following output:

SOME TEXT, GOES TO UPPERCASE
other text
<em>italics<em> ‘more text with UTF-8 ’

My code is:

f = File.open('raw/test.html',"r")
doc = Nokogiri::XML::DocumentFragment.parse(f.read.encode('UTF-8'))
f.close

doc.css("span.T2").each do |span|
span.replace span.content.upcase
end
doc.css("span.T5").each do |span|
span.replace "<em>"+span.content+"</em>"
end
doc.css("span").each do |span|
span.replace span.content
end
doc.css("p").each do |p|
p.replace Nokogiri::XML::Text.new(p.inner_html, p.document)
end

f = File.open('processed/test.html',"w")
f.write(doc)
f.close

However, the output I get is:

SOME TEXT, GOES TO UPPERCASE
&lt;p class="P4"&gt;
other text
&lt;em&gt;italics &lt;/em&gt;&amp;#x2018;more text with UTF-8
&amp;#x2019;
&amp;#x2018;our common mother&amp;#x2019;
&lt;/p&gt;

In summary, I would like to preserve the encoding (ie, not HTML
entities) and keep the new markup (<em> tags)

Many thanks in advance.

Mike Dalessio

unread,
Mar 16, 2011, 8:57:59 AM3/16/11
to nokogi...@googlegroups.com, Jez Gomez
Greetings.

You are passing "<em>...</em>" to Text.new. Therefore you create a text node that properly escapes the text you sent it.

If you want to interpret p.inner_html as html (not text), then just do this:

    doc.css("p").each do |p|
      p.replace p.inner_html
    end

HTH.

---
mike dalessio / @flavorjones


Jez Gomez

unread,
Mar 16, 2011, 11:12:49 AM3/16/11
to nokogiri-talk
Thanks Mike,

I eventually solved the problem as follows
(installed htmlentities gem - see http://htmlentities.rubyforge.org/):

coder = HTMLEntities.new

f = File.open('raw/test.html',"r")
doc = Nokogiri::XML::DocumentFragment.parse(f.read.encode('UTF-8'))
f.close

doc.css("p").each do |p|
p.replace p.inner_html
end
doc.css("span.T2").each do |span|
span.replace span.content.upcase
end
doc.css("span.T5").each do |span|
span.replace "<em>"+span.content+"</em>"
end
doc.css("span").each do |span|
span.replace span.inner_html
end

f = File.open('processed/test.html',"w")
f.write(coder.decode(doc))
f.close
Reply all
Reply to author
Forward
0 new messages