When you ask for "text", it is converted to the actual characters. If
you ask for inner_html, it will return the escaped values. Here is an
example:
require 'nokogiri'
f = Nokogiri.HTML('<body>hello world</body>')
node = f.css('body')
p node.text
p node.inner_html
Hope that helps!
--
Aaron Patterson
http://tenderlovemaking.com/
I'm really not sure what to tell you. I can't reproduce the problem.
Here is a screencast of what I did:
http://www.youtube.com/watch?v=YEY_xixuOes
Am I doing something different than you? Can you try the code that I
did in the video?
AHA! It's the ruby version. Seems different with 1.9, though I'm not sure why.
I'm investigating, but this seems like a bug.
Okay, I can explain the behavior now. Basically, the problem boils
down to encoding.
In Ruby 1.9, we examine the encoding of the string you're feeding to
Nokogiri. If the input string is "utf-8", the document is assumed to
be a UTF-8 document. When you output the document, since " " can
be represented as a UTF-8 character, it is output as that UTF-8
character.
In 1.8, since we cannot detect the encoding of the document, we assume
binary encoding and allow libxml2 to detect the encoding.
If you set the encoding of the input document to binary, it will give
you back the entities you want. Here is some code to demo:
require 'nokogiri'
html = '<body>hello world</body>'
f = Nokogiri.HTML(html)
node = f.css('body')
p node.inner_html
f = Nokogiri.HTML(html.encode('ASCII-8BIT'))
node = f.css('body')
p node.inner_html
I posted a youtube video too! :-)
http://www.youtube.com/watch?v=X2SzhXAt7V4