Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

function to decode url percent encoding

349 views
Skip to first unread message

Xah Lee

unread,
May 18, 2010, 2:12:36 PM5/18/10
to
is there a function that decode the url percent encoding?

e.g.
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

should become

http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem

that's a EN DASH, unicode 8211.

but that just unhex, and generate gibberish if the url contain unicode
chars.

thanks.

Xah
http://xahlee.org/


Xah Lee

unread,
May 18, 2010, 2:26:44 PM5/18/10
to
some missing section from my previous post...

I know there's a

(require 'gnus-util)
gnus-url-unhex-string

but that just unhex, and generate gibberish if the url contains
unicode chars.

some study shows that the “%E2%80%93” are hexdecimals E2 80 93, and is
the byte sequence of the en dash char by utf-8 encoding.

So, i guess i could parse the url then interpret the %x string as
utf-8 hex bytes then turn them back to unicode chars. Any idea if
there's built in function that helps this?

Xah


On May 18, 11:12 am, Xah Lee <xah...@gmail.com> wrote:
> is there a function that decode the url percent encoding?
>

> e.g.http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem

Stefan Monnier

unread,
May 18, 2010, 5:08:41 PM5/18/10
to
> So, i guess i could parse the url then interpret the %x string as
> utf-8 hex bytes then turn them back to unicode chars. Any idea if
> there's built in function that helps this?

There should be one somewhere in the URL package, although I don't know
if it does work right. If you find it and discover it doesn't work right,
please report it as a bug.


Stefan

Andreas Politz

unread,
May 18, 2010, 5:19:20 PM5/18/10
to
Stefan Monnier <mon...@iro.umontreal.ca> writes:

It's `url-unhex-string', but it does not work, I guess.

(url-unhex-string (url-hexify-string "ä"))

=> "ä"

-ap

Stefan Monnier

unread,
May 19, 2010, 10:55:48 AM5/19/10
to
>>> So, i guess i could parse the url then interpret the %x string as
>>> utf-8 hex bytes then turn them back to unicode chars. Any idea if
>>> there's built in function that helps this?
>> There should be one somewhere in the URL package, although I don't know
>> if it does work right. If you find it and discover it doesn't work right,
>> please report it as a bug.
> It's `url-unhex-string', but it does not work, I guess.
> (url-unhex-string (url-hexify-string "�"))

Indeed, we have some problems here:
1- a bug in the implementation makes it unwittingly decode the bytes as
latin-1.
2- the function actually does not decode the result.

Point 1 will be fixed in Emacs-23.3.
In the meantime you can revert the accidental encoding:

(encode-coding-string (url-unhex-string (url-hexify-string "�")) 'latin-1)

As for point 2 you can do that manually after the call:

(decode-coding-string (url-unhex-string (url-hexify-string "�")) 'utf-8)

or if you need to work around point 1 as well (but note that if point
1 is fixed, then the below won't work right):

(decode-coding-string (encode-coding-string
(url-unhex-string (url-hexify-string "�"))
'latin-1)
'utf-8)

As for whether point 2 should be fixed or not, I'm not completely sure
yet (I'd tend to say yes, tho).


Stefan

Xah Lee

unread,
May 24, 2010, 10:33:54 AM5/24/10
to
On May 19, 7:55 am, Stefan Monnier <monn...@iro.umontreal.ca> wrote:
> >>> So, i guess i could parse the url then interpret the %x string as
> >>> utf-8 hex bytes then turn them back to unicode chars. Any idea if
> >>> there's built in function that helps this?
> >> There should be one somewhere in the URL package, although I don't know
> >> if it does work right.  If you find it and discover it doesn't work right,
> >> please report it as a bug.
> > It's `url-unhex-string', but it does not work, I guess.
> > (url-unhex-string (url-hexify-string "ä"))

>
> Indeed, we have some problems here:
> 1- a bug in the implementation makes it unwittingly decode the bytes as
>    latin-1.
> 2- the function actually does not decode the result.
>
> Point 1 will be fixed in Emacs-23.3.
> In the meantime you can revert the accidentalencoding:
>
>   (encode-coding-string (url-unhex-string (url-hexify-string "ä")) 'latin-1)

>
> As for point 2 you can do that manually after the call:
>
>   (decode-coding-string (url-unhex-string (url-hexify-string "ä")) 'utf-8)

>
> or if you need to work around point 1 as well (but note that if point
> 1 is fixed, then the below won't work right):
>
>   (decode-coding-string (encode-coding-string
>                          (url-unhex-string (url-hexify-string "ä"))

>                          'latin-1)
>                         'utf-8)
>
> As for whether point 2 should be fixed or not, I'm not completely sure
> yet (I'd tend to say yes, tho).
>
>         Stefan

Thanks all for the answers.

José A Romero L has filed a bug report. #6252

http://groups.google.com/group/gnu.emacs.bug/browse_frm/thread/908f7f39589a9014#
http://groups.google.com/group/gnu.emacs.bug/browse_frm/thread/e997a39309022a3d#

(i cant find the bug in the official site
http://emacsbugs.donarmstrong.com/cgi-bin/pkgreport.cgi?package=emacs
)

there's some temp solutions from Jose's post and Stefan Monnier's
post.

Xah
http://xahlee.org/


Xah Lee

unread,
May 24, 2010, 1:55:00 PM5/24/10
to
did some more testing about this.

the issue first of all seems to be that what characters should be
percent encoded.

i did some testing with browsers, IE, Firefox, Safari, Opera, all
behaved a bit differently in what they think should be percent
encoded.

• URL Percent Encoding and Unicode
http://xahlee.org/js/url_encoding_unicode.html

text version excerpt follows

---------------------
Summary

Here's some summary of the behavior as it appears from above tests:

* Firefox (v 3.6.3), is the most aggressive in turning characters
in url into the percent encoded form.
* Google Chrome (4.1.249.1064 (45376)) will change unicode chars
into percent encoded form, but not parenthesis chars.
* Safari (4.0.5 (531.22.7)) is better, in that it simply show the
characters as is, as much as it can.
* Opera (v 10.10, build 1893) is the best, it shows unicode and
paren and en-dash as is.
* IE (8.0.6001.18904), seems to take the approach that it doesn't
do anything to the url. Whatever you pasted in, remain unchanged

--------------

Xah
http://xahlee.org/


Jason Rumney

unread,
May 24, 2010, 8:26:15 PM5/24/10
to
On May 24, 10:33 pm, Xah Lee <xah...@gmail.com> wrote:
>
> (i cant find the bug in the official sitehttp://emacsbugs.donarmstrong.com/cgi-bin/pkgreport.cgi?package=emacs
> )

Try http://debbugs.gnu.org/emacs

0 new messages