e.g.
http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
should become
http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem
that's a EN DASH, unicode 8211.
but that just unhex, and generate gibberish if the url contain unicode
chars.
thanks.
Xah
∑ http://xahlee.org/
☄
I know there's a
(require 'gnus-util)
gnus-url-unhex-string
but that just unhex, and generate gibberish if the url contains
unicode chars.
some study shows that the “%E2%80%93” are hexdecimals E2 80 93, and is
the byte sequence of the en dash char by utf-8 encoding.
So, i guess i could parse the url then interpret the %x string as
utf-8 hex bytes then turn them back to unicode chars. Any idea if
there's built in function that helps this?
Xah
On May 18, 11:12 am, Xah Lee <xah...@gmail.com> wrote:
> is there a function that decode the url percent encoding?
>
> e.g.http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
There should be one somewhere in the URL package, although I don't know
if it does work right. If you find it and discover it doesn't work right,
please report it as a bug.
Stefan
It's `url-unhex-string', but it does not work, I guess.
(url-unhex-string (url-hexify-string "ä"))
=> "ä"
-ap
Indeed, we have some problems here:
1- a bug in the implementation makes it unwittingly decode the bytes as
latin-1.
2- the function actually does not decode the result.
Point 1 will be fixed in Emacs-23.3.
In the meantime you can revert the accidental encoding:
(encode-coding-string (url-unhex-string (url-hexify-string "�")) 'latin-1)
As for point 2 you can do that manually after the call:
(decode-coding-string (url-unhex-string (url-hexify-string "�")) 'utf-8)
or if you need to work around point 1 as well (but note that if point
1 is fixed, then the below won't work right):
(decode-coding-string (encode-coding-string
(url-unhex-string (url-hexify-string "�"))
'latin-1)
'utf-8)
As for whether point 2 should be fixed or not, I'm not completely sure
yet (I'd tend to say yes, tho).
Stefan
Thanks all for the answers.
José A Romero L has filed a bug report. #6252
http://groups.google.com/group/gnu.emacs.bug/browse_frm/thread/908f7f39589a9014#
http://groups.google.com/group/gnu.emacs.bug/browse_frm/thread/e997a39309022a3d#
(i cant find the bug in the official site
http://emacsbugs.donarmstrong.com/cgi-bin/pkgreport.cgi?package=emacs
)
there's some temp solutions from Jose's post and Stefan Monnier's
post.
Xah
∑ http://xahlee.org/
☄
the issue first of all seems to be that what characters should be
percent encoded.
i did some testing with browsers, IE, Firefox, Safari, Opera, all
behaved a bit differently in what they think should be percent
encoded.
• URL Percent Encoding and Unicode
http://xahlee.org/js/url_encoding_unicode.html
text version excerpt follows
---------------------
Summary
Here's some summary of the behavior as it appears from above tests:
* Firefox (v 3.6.3), is the most aggressive in turning characters
in url into the percent encoded form.
* Google Chrome (4.1.249.1064 (45376)) will change unicode chars
into percent encoded form, but not parenthesis chars.
* Safari (4.0.5 (531.22.7)) is better, in that it simply show the
characters as is, as much as it can.
* Opera (v 10.10, build 1893) is the best, it shows unicode and
paren and en-dash as is.
* IE (8.0.6001.18904), seems to take the approach that it doesn't
do anything to the url. Whatever you pasted in, remain unchanged
--------------
Xah
∑ http://xahlee.org/
☄