function to decode url percent encoding

Xah Lee

unread,

May 18, 2010, 2:12:36 PM5/18/10

to

is there a function that decode the url percent encoding?

e.g.
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

should become

http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem

that's a EN DASH, unicode 8211.

but that just unhex, and generate gibberish if the url contain unicode
chars.

thanks.

Xah
∑ http://xahlee.org/

☄

Xah Lee

unread,

May 18, 2010, 2:26:44 PM5/18/10

to

some missing section from my previous post...

I know there's a

(require 'gnus-util)
gnus-url-unhex-string

but that just unhex, and generate gibberish if the url contains
unicode chars.

some study shows that the “%E2%80%93” are hexdecimals E2 80 93, and is
the byte sequence of the en dash char by utf-8 encoding.

So, i guess i could parse the url then interpret the %x string as
utf-8 hex bytes then turn them back to unicode chars. Any idea if
there's built in function that helps this?

Xah

On May 18, 11:12 am, Xah Lee <xah...@gmail.com> wrote:
> is there a function that decode the url percent encoding?
>

> e.g.http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

Stefan Monnier

unread,

May 18, 2010, 5:08:41 PM5/18/10

to

> So, i guess i could parse the url then interpret the %x string as
> utf-8 hex bytes then turn them back to unicode chars. Any idea if
> there's built in function that helps this?

There should be one somewhere in the URL package, although I don't know
if it does work right. If you find it and discover it doesn't work right,
please report it as a bug.

Stefan

Andreas Politz

unread,

May 18, 2010, 5:19:20 PM5/18/10

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:

It's `url-unhex-string', but it does not work, I guess.

(url-unhex-string (url-hexify-string "ä"))

=> "Ã¤"

-ap

Stefan Monnier

unread,

May 19, 2010, 10:55:48 AM5/19/10

to

>>> So, i guess i could parse the url then interpret the %x string as
>>> utf-8 hex bytes then turn them back to unicode chars. Any idea if
>>> there's built in function that helps this?
>> There should be one somewhere in the URL package, although I don't know
>> if it does work right. If you find it and discover it doesn't work right,
>> please report it as a bug.

> It's `url-unhex-string', but it does not work, I guess.

> (url-unhex-string (url-hexify-string "�"))

Indeed, we have some problems here:
1- a bug in the implementation makes it unwittingly decode the bytes as
latin-1.
2- the function actually does not decode the result.

Point 1 will be fixed in Emacs-23.3.
In the meantime you can revert the accidental encoding:

(encode-coding-string (url-unhex-string (url-hexify-string "�")) 'latin-1)

As for point 2 you can do that manually after the call:

(decode-coding-string (url-unhex-string (url-hexify-string "�")) 'utf-8)

or if you need to work around point 1 as well (but note that if point
1 is fixed, then the below won't work right):

(decode-coding-string (encode-coding-string
(url-unhex-string (url-hexify-string "�"))
'latin-1)
'utf-8)

As for whether point 2 should be fixed or not, I'm not completely sure
yet (I'd tend to say yes, tho).

Stefan

Xah Lee

unread,

May 24, 2010, 10:33:54 AM5/24/10

to

On May 19, 7:55 am, Stefan Monnier <monn...@iro.umontreal.ca> wrote:
> >>> So, i guess i could parse the url then interpret the %x string as
> >>> utf-8 hex bytes then turn them back to unicode chars. Any idea if
> >>> there's built in function that helps this?
> >> There should be one somewhere in the URL package, although I don't know
> >> if it does work right. If you find it and discover it doesn't work right,
> >> please report it as a bug.
> > It's `url-unhex-string', but it does not work, I guess.

> > (url-unhex-string (url-hexify-string "ä"))

>
> Indeed, we have some problems here:
> 1- a bug in the implementation makes it unwittingly decode the bytes as
> latin-1.
> 2- the function actually does not decode the result.
>
> Point 1 will be fixed in Emacs-23.3.
> In the meantime you can revert the accidentalencoding:
>

> (encode-coding-string (url-unhex-string (url-hexify-string "ä")) 'latin-1)

>
> As for point 2 you can do that manually after the call:
>

> (decode-coding-string (url-unhex-string (url-hexify-string "ä")) 'utf-8)

>
> or if you need to work around point 1 as well (but note that if point
> 1 is fixed, then the below won't work right):
>
> (decode-coding-string (encode-coding-string

> (url-unhex-string (url-hexify-string "ä"))

> 'latin-1)
> 'utf-8)
>
> As for whether point 2 should be fixed or not, I'm not completely sure
> yet (I'd tend to say yes, tho).
>
> Stefan

Thanks all for the answers.

José A Romero L has filed a bug report. #6252

http://groups.google.com/group/gnu.emacs.bug/browse_frm/thread/908f7f39589a9014#
http://groups.google.com/group/gnu.emacs.bug/browse_frm/thread/e997a39309022a3d#

(i cant find the bug in the official site
http://emacsbugs.donarmstrong.com/cgi-bin/pkgreport.cgi?package=emacs
)

there's some temp solutions from Jose's post and Stefan Monnier's
post.

Xah
∑ http://xahlee.org/

☄

Xah Lee

unread,

May 24, 2010, 1:55:00 PM5/24/10

to

did some more testing about this.

the issue first of all seems to be that what characters should be
percent encoded.

i did some testing with browsers, IE, Firefox, Safari, Opera, all
behaved a bit differently in what they think should be percent
encoded.

• URL Percent Encoding and Unicode
http://xahlee.org/js/url_encoding_unicode.html

text version excerpt follows

---------------------
Summary

Here's some summary of the behavior as it appears from above tests:

* Firefox (v 3.6.3), is the most aggressive in turning characters
in url into the percent encoded form.
* Google Chrome (4.1.249.1064 (45376)) will change unicode chars
into percent encoded form, but not parenthesis chars.
* Safari (4.0.5 (531.22.7)) is better, in that it simply show the
characters as is, as much as it can.
* Opera (v 10.10, build 1893) is the best, it shows unicode and
paren and en-dash as is.
* IE (8.0.6001.18904), seems to take the approach that it doesn't
do anything to the url. Whatever you pasted in, remain unchanged

--------------

Xah
∑ http://xahlee.org/

☄

Jason Rumney

unread,

May 24, 2010, 8:26:15 PM5/24/10

to

On May 24, 10:33 pm, Xah Lee <xah...@gmail.com> wrote:
>
> (i cant find the bug in the official sitehttp://emacsbugs.donarmstrong.com/cgi-bin/pkgreport.cgi?package=emacs
> )

Try http://debbugs.gnu.org/emacs