> Seems that RFC 3986 has not been implemented correctly in
> Emacs. IMHO that is an important hole you have found there. The
> standard requires that all unreserved characters be encoded/decoded
> as UTF8 bytes.
If you are referring to the following part of RFC 3986, it doesn't say
anything about existing URI schemes (as opposed to "a new URI
scheme"), those defining a component that does NOT represent textual
data, or even for textual data, those NOT consisting of characters
from the Universal Character Sets.
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set
[UCS], the data should first be encoded as octets according to the
UTF-8 character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded.
(See also http://lists.gnu.org/archive/html/emacs-devel/2006-08/msg00065.html)
Though returning a multibyte string decoded as UTF-8 would be useful
for many cases, I think some "unhex"ing function should also provide a
functionality to return a unibyte string.
YAMAMOTO Mitsuharu
mitu...@math.s.chiba-u.ac.jp
2010/5/24 YAMAMOTO Mitsuharu <mitu...@math.s.chiba-u.ac.jp>:
>>>>>> On Sun, 23 May 2010 01:46:54 +0200, José A. Romero L. <escher...@gmail.com> said:
(...)
> If you are referring to the following part of RFC 3986, it doesn't say
> anything about existing URI schemes (as opposed to "a new URI
> scheme"), those defining a component that does NOT represent textual
> data, or even for textual data, those NOT consisting of characters
> from the Universal Character Sets.
You are right. The standard *doesn't say anything* about existing URI
schemes on that matter. Thus the question would be rather whether to
make the language more or less useful, especially on the light of the
fragment you've just quoted:
> When a new URI scheme defines a component that represents textual
> data consisting of characters from the Universal Character Set
> [UCS], the data should first be encoded as octets according to the
> UTF-8 character encoding [STD63]; then only those octets that do not
> correspond to characters in the unreserved set should be percent-
> encoded.
and the example that immediately follows:
(...) For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".
>
> (See also http://lists.gnu.org/archive/html/emacs-devel/2006-08/msg00065.html)
>
> Though returning a multibyte string decoded as UTF-8 would be useful
> for many cases, I think some "unhex"ing function should also provide a
> functionality to return a unibyte string.
(...)
That's perfectly valid. OTOH some other "unhex"-ing function (or even
the same) could also provide the functionality to return a multi-byte
string, and even allow to choose the character encoding (UCS or not)
for the resulting string. After all, don't you think there should be
a better way to decode a Katakana A than using a kludge like this?:
(decode-coding-string
(apply 'unibyte-string
(string-to-list
(url-unhex-string "%E3%82%A2")))
'utf-8)
Cheers,
--
José A. Romero L.
escher...@gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)
Yes, as I see it, that's definitely it.
> And please don't keep removing the debbugs address from the Cc list.
> Your messages aren't going to the bug tracker if you do that.
(...)
Oops, sorry, I didn't notice it before -- won't happen again.
Cheers,
--
José A. Romero L.
escher...@gmail.com
>> If this is just about `url-unhex-string', the obvious solution would be
>> to add a CODING-SYSTEM parameter to that function.
>
> Yes, as I see it, that's definitely it.
I think that's a reasonable thing to add, but Emacs is in a feature
freeze, so it'll probably have to wait until after Emacs 24 has been
released. I'll mark the bug report as "pending".
Cool, thanks a lot :)
Cheers,
--
José A. Romero L.
escher...@gmail.com