Re: Encoding difference between XML fragment and HTML fragment

186 views
Skip to first unread message

北市真

unread,
Nov 9, 2012, 11:26:43 AM11/9/12
to nokogi...@googlegroups.com
Hello,

I have also encountered this problem ever with handling Japanese characters.
How's it to use #to_xml with :encoding option?:

    doc.to_xml(encoding: 'UTF-8') # => decoded whole XML
    td = (doc/'td').first
    td.children.first # => 2012年4月23日至25日
    td.children.first.to_xml(encoding: 'UTF-8') # => 2012年4月23日至25日

Or you can use #content:

    td.content # => 2012年4月23日至25日

I hope this help for you.

Hi,

I am confused about the difference on encoding while parsing it as XML fragment or HTML fragment.

The below is the piece:

    html = "<h1>Training_zh_cn</h1><table><tr><th>Date</th><th>Name</th><th>Location</th><th>Info</th></tr>
    <tr><td>2012年4月23日至25日</td><td>ScrumMaster认证培训</td><td>香港</td><td>详情参见</td></tr>
    <tr><td>2012年5月14日至15日</td><td>ScrumMaster认证培训</td><td>中国杭州</td><td>详情参见</td></tr></table>"

When using:

    doc = Nokogiri::XML.fragment(html)

The chinese characters (e.g. 2012年4月23日至25日) become 2012&#x5E74;4&#x6708;23&#x65E5;&#x81F3;25&#x65E5;

When using:

    doc = Nokogiri::HTML.fragment(html)

The chinese characters are shown as normal, e.g. 2012年4月23日至25日

I do need to use XML fragment instead of HTML fragment for CDATA (not shown in the above piece), but now I stuck with handling chinese characters. Your help is much appreciated!

Thanks a lot!
Yi

Yi Lv

unread,
Nov 9, 2012, 8:43:56 PM11/9/12
to nokogi...@googlegroups.com
Hi,

    doc.to_xml(encoding: 'UTF-8') # => decoded whole XML
    td = (doc/'td').first
    td.children.first # => 2012&#x5E74;4&#x6708;23&#x65E5;&#x81F3;25&#x65E5;
    td.children.first.to_xml(encoding: 'UTF-8') # => 2012年4月23日至25日

Seems that line 1 doesn't change the encode, as you can see also from the result of line 3. Line 4 does work! And line 1 seems not necessary.

doc is an instance of Nokogiri::XML::DocumentFragment, and I checked http://nokogiri.org/Nokogiri/XML/DocumentFragment.html and found:

- Class Nokogiri::XML::DocumentFragment inherits from Nokogiri::XML::Node
- to_xml(*args) Convert this DocumentFragment to xml See Nokogiri::XML::NodeSet#to_xml

I am puzzled whether DocumentFragment is a node or nodeset, and how does it differ in terms of to_xml(encoding: 'UTF-8').

Thanks very much!
Yi

Yi Lv

unread,
Nov 9, 2012, 9:10:34 PM11/9/12
to nokogi...@googlegroups.com
I have figured this out, doc.to_xml(encoding: 'UTF-8') works as well. doc remains the same, but the output string has the correct encoding now. Thanks! Yi

s2_it

unread,
Nov 9, 2012, 10:12:16 PM11/9/12
to nokogi...@googlegroups.com
Hi,

>> doc is an instance of Nokogiri::XML::DocumentFragment, and I checked http://nokogiri.org/Nokogiri/XML/DocumentFragment.html and found:
>>
>> - Class Nokogiri::XML::DocumentFragment inherits from Nokogiri::XML::Node
>> - to_xml(*args) Convert this DocumentFragment to xml See Nokogiri::XML::NodeSet#to_xml
>>
>> I am puzzled whether DocumentFragment is a node or nodeset, and how does it differ in terms of to_xml(encoding: 'UTF-8').
As you say, DocumentFragment inherits Node,
but DocumentFragment#to_xml behaves like NodeSet#to_xml(which calls
#to_xml of children)
because it usually includes some child nodes, so the documentation
refers the latter.
I guess like above.

> I have figured this out, doc.to_xml(encoding: 'UTF-8') works as well. doc remains the same, but the output string has the correct encoding now. Thanks! Yi
Yes, doc is not changed because #to_xml is non-destructive method.
It was better if I wrote "puts doc.to_xml(encoding: 'UTF-8')."
Sorry for my description not enough to describe well,
and pleased to hear you resolved your problem.

KITAITI Makoto


2012/11/10 Yi Lv <yi...@yahoo.com>
> --
> You received this message because you are subscribed to the Google Groups "nokogiri-talk" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/nokogiri-talk/-/O4HXuM5GhQUJ.
>
> To post to this group, send email to nokogi...@googlegroups.com.
> To unsubscribe from this group, send email to nokogiri-tal...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nokogiri-talk?hl=en.

Yi Lv

unread,
Nov 11, 2012, 7:12:17 AM11/11/12
to nokogi...@googlegroups.com, s2...@yahoo.co.jp
Thanks for the further clarification and confirmation! Really appreciated. Yi
Reply all
Reply to author
Forward
0 new messages