Preserving HTML entities

931 views
Skip to first unread message

Alex Dunae

unread,
Nov 4, 2010, 4:12:07 PM11/4/10
to nokogiri-talk
I've spent the morning trying to find a way for Nokogiri to preserve
HTML entities when parsing HTML.

  and friends are all converted to actual characters; I need them
to be preserved. I've tried various combinations of the
XML::ParseOptions Is there any way to preserve the entities?

Thanks in advance for any help.

Aaron Patterson

unread,
Nov 4, 2010, 4:32:11 PM11/4/10
to nokogi...@googlegroups.com

When you ask for "text", it is converted to the actual characters. If
you ask for inner_html, it will return the escaped values. Here is an
example:

require 'nokogiri'

f = Nokogiri.HTML('<body>hello &nbsp; world</body>')
node = f.css('body')
p node.text
p node.inner_html

Hope that helps!

--
Aaron Patterson
http://tenderlovemaking.com/

Alex Dunae

unread,
Nov 10, 2010, 12:50:28 PM11/10/10
to nokogiri-talk
Thanks for your reply Aaron. Running that code through irb,
both .text and .inner_html return the same thing:

=> "hello   world"


I've tried this on two machines:
$ nokogiri -v
---
warnings: []

nokogiri: 1.4.3.1
ruby:
version: 1.9.1
platform: i386-darwin10
engine: ruby
libxml:
binding: extension
compiled: 2.7.6
loaded: 2.7.6



$ nokogiri -v
---
warnings: []

ruby:
engine: mri
version: 1.8.7
platform: x86_64-linux
libxml:
loaded: 2.6.32
binding: extension
compiled: 2.6.32
nokogiri: 1.4.3.1


Any insight would be greatly appreciated.



On Nov 4, 12:32 pm, Aaron Patterson <aaron.patter...@gmail.com> wrote:

Aaron Patterson

unread,
Nov 10, 2010, 1:00:38 PM11/10/10
to nokogi...@googlegroups.com

I'm really not sure what to tell you. I can't reproduce the problem.
Here is a screencast of what I did:

http://www.youtube.com/watch?v=YEY_xixuOes

Am I doing something different than you? Can you try the code that I
did in the video?

Alex Dunae

unread,
Nov 10, 2010, 1:20:47 PM11/10/10
to nokogiri-talk
I think we're doing the same thing: http://www.youtube.com/watch?v=ZJvLh1pcmbo



On Nov 10, 10:00 am, Aaron Patterson <aaron.patter...@gmail.com>
wrote:

Aaron Patterson

unread,
Nov 10, 2010, 1:25:36 PM11/10/10
to nokogi...@googlegroups.com
On Wed, Nov 10, 2010 at 10:20 AM, Alex Dunae <al...@dunae.ca> wrote:
> I think we're doing the same thing: http://www.youtube.com/watch?v=ZJvLh1pcmbo

AHA! It's the ruby version. Seems different with 1.9, though I'm not sure why.

I'm investigating, but this seems like a bug.

Alex Dunae

unread,
Nov 10, 2010, 1:39:21 PM11/10/10
to nokogiri-talk
I experimented a bit more and have finally found a combination that
works.

The setup that did work was MRI 1.8.7 on Mac (run under RVM):
$ nokogiri -v
---
warnings: []

ruby:
engine: mri
version: 1.8.7
platform: i686-darwin10.4.0
libxml:
loaded: 2.7.6
binding: extension
compiled: 2.7.6
nokogiri: 1.4.3.1


Setups that did not work:

$ nokogiri -v
---
warnings: []

nokogiri: 1.4.3.1
ruby:
version: 1.9.2
platform: x86_64-darwin10.4.0

Aaron Patterson

unread,
Nov 11, 2010, 2:10:14 PM11/11/10
to nokogi...@googlegroups.com
On Wed, Nov 10, 2010 at 10:39 AM, Alex Dunae <al...@dunae.ca> wrote:
> I experimented a bit more and have finally found a combination that
> works.

Okay, I can explain the behavior now. Basically, the problem boils
down to encoding.

In Ruby 1.9, we examine the encoding of the string you're feeding to
Nokogiri. If the input string is "utf-8", the document is assumed to
be a UTF-8 document. When you output the document, since "&nbsp;" can
be represented as a UTF-8 character, it is output as that UTF-8
character.

In 1.8, since we cannot detect the encoding of the document, we assume
binary encoding and allow libxml2 to detect the encoding.

If you set the encoding of the input document to binary, it will give
you back the entities you want. Here is some code to demo:

require 'nokogiri'

html = '<body>hello &nbsp; world</body>'

f = Nokogiri.HTML(html)


node = f.css('body')

p node.inner_html

f = Nokogiri.HTML(html.encode('ASCII-8BIT'))


node = f.css('body')

p node.inner_html

I posted a youtube video too! :-)

http://www.youtube.com/watch?v=X2SzhXAt7V4

Alex Dunae

unread,
Nov 15, 2010, 1:52:31 PM11/15/10
to nokogiri-talk
Thanks for digging into this Aaron. Your "customer service" is truly
fantastic. Much appreciated.

On Nov 11, 11:10 am, Aaron Patterson <aaron.patter...@gmail.com>
wrote:
Reply all
Reply to author
Forward
0 new messages