[ruby-core:25702] [Bug #2130] incorrect UTF8 encoding in CGI.unescapeHTML

Larry Kyrala

unread,

Sep 21, 2009, 2:17:31 PM9/21/09

to ruby...@ruby-lang.org

Bug #2130: incorrect UTF8 encoding in CGI.unescapeHTML
http://redmine.ruby-lang.org/issues/show/2130

Author: Larry Kyrala
Status: Open, Priority: Normal
ruby -v: ruby 1.8.6 (2009-06-08 patchlevel 369) [x86_64-linux]

In CGI.unescapeHTML() in cgi.rb note that the html literal encoding is translated thus:
(from http://stdlib.rubyonrails.org/libdoc/cgi/rdoc/classes/CGI.html#M000105)

when /\A#x([0-9a-f]+)\z/ni then
if $1.hex < 256
$1.hex.chr
else
if $1.hex < 65536 and ($KCODE[0] == ?u or $KCODE[0] == ?U)
[$1.hex].pack("U")

The second line should be:
if $1.hex < 128

in order to conform with standards.

Explanation:
The inputs of the unescapeHTML() method are assumed to be valid HTML. Outputs are apparently intended to be valid UTF-8 ruby strings (see Array.pack("U")). However, for hex values 80-FF, pack is bypassed ($1.hex < 256 above), so these characters are incorrectly unescaped.

According to the 4.01 spec, single-byte hex entity encodings from 80-FF are valid HTML since they conform to the "ISO 10646 hexadecimal character number H". While this is a valid HTML entity, it is important to note that one-byte encodings above 7F are not valid UTF-8 encodings unless they are converted to their two-byte equivalents as per the UTF-8 specification (U+H). (Note that one-byte encodings from 80-FF are also not valid XML, since the XML spec requires entity encodings to be valid UTF-8 sequences.)

Background:
I found this error while debugging a java-based webservice that returns HTML escaped entities. The bug is partly on the webservice (since the webservice is XML-based, not HTML-based), but it led me to find the CGI.unescapeHTML bug while trying to implement a workaround. This is a borderline pedantic issue, but I figured it might help other people having this problem. Also, I might have made a mistake somewhere in the interpretation or the intent of the code, so feel free to comment. Thanks!

References:
http://www.w3.org/TR/html401/charset.html#h-5.3.1
http://www.w3.org/TR/2008/REC-xml-20081126/#sec-external-ent
http://en.wikipedia.org/wiki/UTF-8#Description
http://en.wikipedia.org/wiki/ISO_10646
http://corelib.rubyonrails.org/classes/Array.html#M000460
http://stdlib.rubyonrails.org/libdoc/cgi/rdoc/classes/CGI.html#M000105

----------------------------------------
http://redmine.ruby-lang.org

Larry Kyrala

unread,

Sep 21, 2009, 2:24:19 PM9/21/09

to ruby...@ruby-lang.org

Issue #2130 has been updated by Larry Kyrala.

A friend pointed me to the HTMLEntities gem as a workaround. Notice that the HTMLEntities.decode method works because it essentially runs all entities through Array.pack("U"):

# File lib/htmlentities.rb, line 45
def decode(source)
return source.to_s.gsub(named_entity_regexp) {
(cp = map[$1]) ? [cp].pack('U') : $&
}.gsub(/&#([0-9]{1,7});|&#x([0-9a-f]{1,6});/i) {
$1 ? [$1.to_i].pack('U') : [$2.to_i(16)].pack('U')
}
end

FYI. Thanks!

References:
http://htmlentities.rubyforge.org/
http://htmlentities.rubyforge.org/doc/classes/HTMLEntities.html#M000004
----------------------------------------
http://redmine.ruby-lang.org/issues/show/2130

----------------------------------------
http://redmine.ruby-lang.org

Larry Kyrala

unread,

Sep 22, 2009, 12:37:46 PM9/22/09

to ruby...@ruby-lang.org

Issue #2130 has been updated by Larry Kyrala.

More context about how I discovered this: I was passing the output of CGI.unescapeHTML() to ActiveSupport::Multibyte::Char.g_unpack() and received the following exception:
(ActiveSupport::Multibyte::EncodingError) "malformed UTF-8 character"

Investigating this problem led to finding the bug above.

Takeyuki Fujioka

unread,

Nov 12, 2009, 10:23:24 AM11/12/09

to ruby...@ruby-lang.org

Issue #2130 has been updated by Takeyuki Fujioka.

Status changed from Open to Closed

fixed in r25232

Reply all

Reply to author

Forward