bug#6252: Emacs does not implement URL (aka "percent") decoding correctly.

YAMAMOTO Mitsuharu

unread,

May 23, 2010, 11:33:46 PM5/23/10

to José A. Romero L., 62...@debbugs.gnu.org

>>>>> On Sun, 23 May 2010 01:46:54 +0200, José A. Romero L. <escher...@gmail.com> said:

> Seems that RFC 3986 has not been implemented correctly in
> Emacs. IMHO that is an important hole you have found there. The
> standard requires that all unreserved characters be encoded/decoded
> as UTF8 bytes.

If you are referring to the following part of RFC 3986, it doesn't say
anything about existing URI schemes (as opposed to "a new URI
scheme"), those defining a component that does NOT represent textual
data, or even for textual data, those NOT consisting of characters
from the Universal Character Sets.

When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set
[UCS], the data should first be encoded as octets according to the
UTF-8 character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded.

Though returning a multibyte string decoded as UTF-8 would be useful
for many cases, I think some "unhex"ing function should also provide a
functionality to return a unibyte string.

YAMAMOTO Mitsuharu
mitu...@math.s.chiba-u.ac.jp

José A. Romero L.

unread,

May 25, 2010, 4:56:36 AM5/25/10

to 62...@debbugs.gnu.org

(sorry, forgot to fwd this to the bugtrack)
---------- Forwarded message ----------
From: José A. Romero L. <escher...@gmail.com>
Date: 2010/5/24
Subject: Re: bug#6252: Emacs does not implement URL (aka "percent")
decoding correctly.
To: YAMAMOTO Mitsuharu <mitu...@math.s.chiba-u.ac.jp>

2010/5/24 YAMAMOTO Mitsuharu <mitu...@math.s.chiba-u.ac.jp>:

>>>>>> On Sun, 23 May 2010 01:46:54 +0200, José A. Romero L. <escher...@gmail.com> said:

(...)

> If you are referring to the following part of RFC 3986, it doesn't say
> anything about existing URI schemes (as opposed to "a new URI
> scheme"), those defining a component that does NOT represent textual
> data, or even for textual data, those NOT consisting of characters
> from the Universal Character Sets.

You are right. The standard *doesn't say anything* about existing URI
schemes on that matter. Thus the question would be rather whether to
make the language more or less useful, especially on the light of the
fragment you've just quoted:

> When a new URI scheme defines a component that represents textual
> data consisting of characters from the Universal Character Set
> [UCS], the data should first be encoded as octets according to the
> UTF-8 character encoding [STD63]; then only those octets that do not
> correspond to characters in the unreserved set should be percent-
> encoded.

and the example that immediately follows:

(...) For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".

>
> (See also http://lists.gnu.org/archive/html/emacs-devel/2006-08/msg00065.html)
>
> Though returning a multibyte string decoded as UTF-8 would be useful
> for many cases, I think some "unhex"ing function should also provide a
> functionality to return a unibyte string.

(...)

That's perfectly valid. OTOH some other "unhex"-ing function (or even
the same) could also provide the functionality to return a multi-byte
string, and even allow to choose the character encoding (UCS or not)
for the resulting string. After all, don't you think there should be
a better way to decode a Katakana A than using a kludge like this?:

(decode-coding-string
(apply 'unibyte-string
(string-to-list
(url-unhex-string "%E3%82%A2")))
'utf-8)

Cheers,
--

José A. Romero L.
escher...@gmail.com

"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)

Lars Magne Ingebrigtsen

unread,

Sep 21, 2011, 4:17:52 PM9/21/11

to José A. Romero L., 62...@debbugs.gnu.org

José A. Romero L. <escher...@gmail.com> writes:

> On May 18, 20:14, Xah Lee <xah...@gmail.com> wrote:
>
>> is there emacs lisp function that decode the url percent encoding?
>> e.g.http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
>> should become
>> http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem
>> that's a EN DASH (unicode 8211, #o20023, #x2013).
>> I know there's a
>> (require 'gnus-util)
>> gnus-url-unhex-string
>> but that just unhex, and generate gibberish if the url contain unicode
>> chars.
> (...)

>
> Seems that RFC 3986 has not been implemented correctly in Emacs. IMHO
> that is an important hole you have found there. The standard requires

> that all unreserved characters be encoded/decoded as UTF8 bytes. Even
> though the encoding part looks OK (in url-util.el), the decoding does
> not go that last mile to interpret the decoded bytes as UTF-8.

I'm not quite sure I understand what the problem is. Do you have a test
case that illustrates what url.el does wrong?

--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog http://lars.ingebrigtsen.no/

Lars Magne Ingebrigtsen

unread,

Sep 22, 2011, 3:38:31 AM9/22/11

to José A. Romero L., 62...@debbugs.gnu.org

José A. Romero L. <escher...@gmail.com> writes:

> in short, there seems to be currently no way to perform the opposite
> of url-hexify-string for UTF-8 encoded strings:
>
> (url-unhex-string (url-hexify-string "ä"))
> => "Ã¤"

`url-unhex-string' can't know what encoding the %xx-encoding is in, can
it? The local part of an URL can use a different encoding, I think.

But is that the test case for the bug? I thought somebody had problems
retrieving something...

Lars Magne Ingebrigtsen

unread,

Sep 23, 2011, 4:34:00 AM9/23/11

to José A. Romero L., 62...@debbugs.gnu.org

José A. Romero L. <escher...@gmail.com> writes:

>>> (url-unhex-string (url-hexify-string "ä"))
>>> => "Ã¤"

[...]

> Well, if you write a script that transforms URLs to/from strings
> (especially round-trip) you will probably encouter problems
> retrieving stuff from the web if you're not aware of this issue.

So this bug report is purely about the return value of
`url-unhex-string'? It sounded at the beginning that url.el had
problems fetching something.

If this is just about `url-unhex-string', the obvious solution would be
to add a CODING-SYSTEM parameter to that function.

And please don't keep removing the debbugs address from the Cc list.
Your messages aren't going to the bug tracker if you do that.

José A. Romero L.

unread,

Sep 23, 2011, 7:12:11 AM9/23/11

to Lars Magne Ingebrigtsen, 62...@debbugs.gnu.org

2011/9/23 Lars Magne Ingebrigtsen <la...@gnus.org>:
(...)

> If this is just about `url-unhex-string', the obvious solution would be
> to add a CODING-SYSTEM parameter to that function.

Yes, as I see it, that's definitely it.

> And please don't keep removing the debbugs address from the Cc list.
> Your messages aren't going to the bug tracker if you do that.

(...)

Oops, sorry, I didn't notice it before -- won't happen again.

Cheers,
--

José A. Romero L.
escher...@gmail.com

Lars Magne Ingebrigtsen

unread,

Sep 25, 2011, 6:16:14 PM9/25/11

to José A. Romero L., 62...@debbugs.gnu.org

José A. Romero L. <escher...@gmail.com> writes:

>> If this is just about `url-unhex-string', the obvious solution would be
>> to add a CODING-SYSTEM parameter to that function.
>
> Yes, as I see it, that's definitely it.

I think that's a reasonable thing to add, but Emacs is in a feature
freeze, so it'll probably have to wait until after Emacs 24 has been
released. I'll mark the bug report as "pending".

José A. Romero L.

unread,

Sep 25, 2011, 6:25:39 PM9/25/11

to Lars Magne Ingebrigtsen, 62...@debbugs.gnu.org

2011/9/26 Lars Magne Ingebrigtsen <la...@gnus.org>:
(...)

> I think that's a reasonable thing to add, but Emacs is in a feature
> freeze, so it'll probably have to wait until after Emacs 24 has been
> released. I'll mark the bug report as "pending".

(...)

Cool, thanks a lot :)

Cheers,
--

José A. Romero L.
escher...@gmail.com

Lars Magne Ingebrigtsen

unread,

Apr 9, 2012, 10:14:58 PM4/9/12

to José A. Romero L., 62...@debbugs.gnu.org

Lars Magne Ingebrigtsen <la...@gnus.org> writes:

> I think that's a reasonable thing to add, but Emacs is in a feature
> freeze, so it'll probably have to wait until after Emacs 24 has been
> released. I'll mark the bug report as "pending".

I've now added an optional coding-system parameter to the function to
the Emacs trunk.