simple = <<EOS
<?xml version="1.0" encoding="UTF-8"?>
<simple>
<data>Café</data>
</simple>
EOS
doc = Nokogiri::XML::Document.parse(simple, nil, 'UTF-8')
puts doc.serialize(:encoding => 'UTF-8')
The output is:
<?xml version="1.0" encoding="UTF-8"?>
<simple>
<data>Café</data>
</simple>
I can solve this by linking to a newer libxml2 version, but I had
trouble verifying this because of other gems with their own libxml2
dependencies that all have to be re-linked with the same version. If
a workaround for this is possible in Nokogiri, that would be vastly
preferable to keeping coercing an entire gem universe and its C-
library dependencies to use an alternative libxml2 version.
Thanks,
-john
Yes, this is a bug in libxml2. Unfortunately we don't have time to
code around every bug in every version of libxml2, so your best bet is
to upgrade. I hope that doesn't sound too harsh an answer, but we
just don't have time. :-(
--
Aaron Patterson
http://tenderlovemaking.com/
On Apr 7, 8:33 am, Aaron Patterson <aaron.patter...@gmail.com> wrote:
> Yes, this is a bug in libxml2. Unfortunately we don't have time to
> code around every bug in every version of libxml2, so your best bet is
> to upgrade. I hope that doesn't sound too harsh an answer, but we
> just don't have time. :-(
No, that is a perfectly reasonable answer, even if it would make *my*
life easier to have a workaround. Refusing to work around bugs is an
important strategy I enthusiastically support in the pursuit of
software quality.
-john
On the server I am running Debian lenny and libxml2 version 2.6.32.
The Nokogiri scripts I wrote and tested there work fine for me. It
seems as if HTML input in encoding "ISO-8859-1" is automatically
converted to utf-8 in Nokogiri, right? I process both UTF-8 and
ISO-8859-1 input and my output always is UTF-8 and it looks as
expected.
Now I pulled that code to my dev machine (as stated, Mac OS X Snow
Leopard 10.6.3) - currently don't know which libxml version I am
running there. I run into all sorts of character set issues. The
scripts break with fatal errors. My first impression is that there is
a problem with an input file in ISO-8859-1.
Does Nokogiri require me to tell which character set an input file
has? Might this trouble be due to different libxml versions?
Originally I wanted to continue writing and testing locally, but if
that's not possible, I could just continue to run the scripts on the
server.
Thanks!
Marian