Nokogiri "replace" strips node.content of HTML

Jez Gomez

unread,

Mar 15, 2011, 8:40:26 AM3/15/11

to nokogiri-talk

I would like to remove a tag from some HTML without stripping the
remaining content of any markup. For example, I have a file,
test.html:

Some text, goes to uppercase
 other text
italics‘more text with UTF-8
’


I would like to get the following output:

SOME TEXT, GOES TO UPPERCASE
other text
italics ‘more text with UTF-8 ’

My code is:

f = File.open('raw/test.html',"r")
doc = Nokogiri::XML::DocumentFragment.parse(f.read.encode('UTF-8'))
f.close

doc.css("span.T2").each do |span|
span.replace span.content.upcase
end
doc.css("span.T5").each do |span|
span.replace ""+span.content+""
end
doc.css("span").each do |span|
span.replace span.content
end
doc.css("p").each do |p|
p.replace Nokogiri::XML::Text.new(p.inner_html, p.document)
end

f = File.open('processed/test.html',"w")
f.write(doc)
f.close

However, the output I get is:

SOME TEXT, GOES TO UPPERCASE

other text
italics &#x2018;more text with UTF-8
&#x2019;
&#x2018;our common mother&#x2019;


In summary, I would like to preserve the encoding (ie, not HTML
entities) and keep the new markup ( tags)

Many thanks in advance.

Mike Dalessio

unread,

Mar 16, 2011, 8:57:59 AM3/16/11

to nokogi...@googlegroups.com, Jez Gomez

Greetings.

You are passing "..." to Text.new. Therefore you create a text node that properly escapes the text you sent it.

If you want to interpret p.inner_html as html (not text), then just do this:

doc.css("p").each do |p|

p.replace p.inner_html

end

HTH.

---

mike dalessio / @flavorjones

Jez Gomez

unread,

Mar 16, 2011, 11:12:49 AM3/16/11

to nokogiri-talk

Thanks Mike,

I eventually solved the problem as follows
(installed htmlentities gem - see http://htmlentities.rubyforge.org/):

coder = HTMLEntities.new

f = File.open('raw/test.html',"r")
doc = Nokogiri::XML::DocumentFragment.parse(f.read.encode('UTF-8'))
f.close

doc.css("p").each do |p|
p.replace p.inner_html
end

span.replace span.inner_html

end

f = File.open('processed/test.html',"w")

f.write(coder.decode(doc))
f.close

Reply all

Reply to author

Forward