Output encoding behavior with libxml2 2.6 vs 2.7

170 views
Skip to first unread message

JRF

unread,
Apr 7, 2010, 10:55:06 AM4/7/10
to nokogiri-talk
With nokogiri-1.4.1, I'm seing odd behavior on as (CentOS) system with
libxml2 2.6.26 where an input document with UTF-8 encoding and data
beyond ASCII has all the data replaced by numeric character entity
references in the output. With libxml 2.7 the UTF-8 input makes it to
the output without conversion.

simple = <<EOS
<?xml version="1.0" encoding="UTF-8"?>
<simple>
<data>Café</data>
</simple>
EOS

doc = Nokogiri::XML::Document.parse(simple, nil, 'UTF-8')
puts doc.serialize(:encoding => 'UTF-8')

The output is:

<?xml version="1.0" encoding="UTF-8"?>
<simple>
<data>Caf&#xE9;</data>
</simple>

I can solve this by linking to a newer libxml2 version, but I had
trouble verifying this because of other gems with their own libxml2
dependencies that all have to be re-linked with the same version. If
a workaround for this is possible in Nokogiri, that would be vastly
preferable to keeping coercing an entire gem universe and its C-
library dependencies to use an alternative libxml2 version.

Thanks,

-john

Aaron Patterson

unread,
Apr 7, 2010, 11:33:46 AM4/7/10
to nokogi...@googlegroups.com

Yes, this is a bug in libxml2. Unfortunately we don't have time to
code around every bug in every version of libxml2, so your best bet is
to upgrade. I hope that doesn't sound too harsh an answer, but we
just don't have time. :-(

--
Aaron Patterson
http://tenderlovemaking.com/

JRF

unread,
Apr 7, 2010, 1:43:02 PM4/7/10
to nokogiri-talk

On Apr 7, 8:33 am, Aaron Patterson <aaron.patter...@gmail.com> wrote:

> Yes, this is a bug in libxml2.  Unfortunately we don't have time to
> code around every bug in every version of libxml2, so your best bet is
> to upgrade.  I hope that doesn't sound too harsh an answer, but we
> just don't have time.  :-(

No, that is a perfectly reasonable answer, even if it would make *my*
life easier to have a workaround. Refusing to work around bugs is an
important strategy I enthusiastically support in the pursuit of
software quality.

-john

Marian Steinbach

unread,
Apr 12, 2010, 5:43:07 AM4/12/10
to nokogiri-talk
Hi! I am having odd differences between encoding interpretation on my
linux server and Mac OS X 10.6.3 dev box. This thread made me aware
that libxml might be the reason.

On the server I am running Debian lenny and libxml2 version 2.6.32.
The Nokogiri scripts I wrote and tested there work fine for me. It
seems as if HTML input in encoding "ISO-8859-1" is automatically
converted to utf-8 in Nokogiri, right? I process both UTF-8 and
ISO-8859-1 input and my output always is UTF-8 and it looks as
expected.

Now I pulled that code to my dev machine (as stated, Mac OS X Snow
Leopard 10.6.3) - currently don't know which libxml version I am
running there. I run into all sorts of character set issues. The
scripts break with fatal errors. My first impression is that there is
a problem with an input file in ISO-8859-1.

Does Nokogiri require me to tell which character set an input file
has? Might this trouble be due to different libxml versions?

Originally I wanted to continue writing and testing locally, but if
that's not possible, I could just continue to run the scripts on the
server.

Thanks!

Marian

Reply all
Reply to author
Forward
0 new messages