Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

percent encoding URLs

67 views
Skip to first unread message

charlemagn...@gmail.com

unread,
Nov 14, 2016, 7:36:37 PM11/14/16
to
I work with many different languages and would like a reliable url encode/decode function that is pure awk.

For example given the name "Władysław" (Polish) with these environment settings

LANGUAGE=en_US.UTF-8
LC_COLLATE=en_US.UTF-8
LC_ALL=en_US.UTF-8

And using some standard awk encode/decode functions:

http://www.shelldorado.com/scripts/cmds/urlencode
https://www.rosettacode.org/wiki/URL_encoding#AWK

It doesn't work because the environment is not Polish, the "ł" is not recognized.

So I call Python or PHP and that works, but it creates an external dependency to the awk program.

Is there a solution to do encoding/decoding across all languages purely in awk? I think I know the answer ("no") but wanted to verify before spending any more time. Maybe there is some shell and environment trickery.

Marc de Bourget

unread,
Nov 15, 2016, 4:13:17 AM11/15/16
to
In my opinion, AWK needs its own encoding for UTF-8 etc. like Python does.

Andre Majorel

unread,
Nov 16, 2016, 3:06:51 PM11/16/16
to
On 2016-11-15, charlemagn...@gmail.com <charlemagn...@gmail.com> wrote:

> I work with many different languages and would like a reliable
> url encode/decode function that is pure awk.
>
> For example given the name "W??adys??aw" (Polish) with these
> environment settings
>
> LANGUAGE=en_US.UTF-8
> LC_COLLATE=en_US.UTF-8
> LC_ALL=en_US.UTF-8
>
> And using some standard awk encode/decode functions:
>
> http://www.shelldorado.com/scripts/cmds/urlencode
> https://www.rosettacode.org/wiki/URL_encoding#AWK
>
> It doesn't work because the environment is not Polish, the
> "??" is not recognized.
>
> So I call Python or PHP and that works, but it creates an
> external dependency to the awk program.
>
> Is there a solution to do encoding/decoding across all
> languages purely in awk? I think I know the answer ("no") but
> wanted to verify before spending any more time. Maybe there is
> some shell and environment trickery.

How does it "not work" ? What do you get and what would you like
to get instead ?

--
André Majorel http://www.teaser.fr/~amajorel/
"Pauvreté n'est pas vice ! Parbleu ! Un vice est agréable."
-- _Passe-temps_

charlemagn...@gmail.com

unread,
Nov 16, 2016, 7:59:06 PM11/16/16
to
On Wednesday, November 16, 2016 at 3:06:51 PM UTC-5, Andre Majorel wrote:

> How does it "not work" ? What do you get and what would you like
> to get instead ?

Test string: ł

urlencodeawk() = %00
urlencodepython() = %C5%82

Python encodes correctly

----

#
# Credit: Rosetta Stone May 2015
#
function urlencodeawk(str, c, len, res, i, ord) {

for (i = 0; i <= 255; i++)
ord[sprintf("%c", i)] = i
len = length(str)
res = ""
for (i = 1; i <= len; i++) {
c = substr(str, i, 1);
if (c ~ /[0-9A-Za-z]/)
res = res c
else
res = res "%" sprintf("%02X", ord[c])
}
return res
}

#
# url-encode via Python
# Credit: https://askubuntu.com/questions/53770/how-can-i-encode-and-decode-percent-encoded-strings-on-the-command-line
#
function urlencodepython(str, command, safe) {

safe = str
gsub(/'/, "'\"'\"'", safe) # make safe for shell
gsub(/’/, "'\"’\"'", safe)

command = "python -c \"import urllib, sys; print urllib.quote(sys.argv[1])\" '" safe "'"
return sys2var(command)
}

function sys2var(command ,fish, scale, ship) {

# command = command " 2>/dev/null"
while ( (command | getline fish) > 0 ) {
if ( ++scale == 1 )
ship = fish
else
ship = ship "\n" fish
}
close(command)
return ship
}

function testurlendecode(str) {

print "Test string: " str
print ""
print "urlencodeawk() = " urlencodeawk(str)
print "urlencodepython() = " urlencodepython(str)

}

BEGIN {

testurlendecode("ł")

}

Andre Majorel

unread,
Nov 17, 2016, 3:54:44 AM11/17/16
to
On 2016-11-17, charlemagn...@gmail.com <charlemagn...@gmail.com> wrote:
> On Wednesday, November 16, 2016 at 3:06:51 PM UTC-5, Andre Majorel wrote:
>
>> How does it "not work" ? What do you get and what would you like
>> to get instead ?
>
> Test string: ??
>
> urlencodeawk() = %00
> urlencodepython() = %C5%82
>
> Python encodes correctly

Prints "%00" for "\xc5\x82" ? That's a good one.

urlencodeawk() works here, but I'm not using UTF-8. I suspect
what's happening is that, because your locale indicates UTF-8
encoding, substr() treats str as a string of Unicode characters
instead of a string of bytes. Therefore substr("\xc5\x82", 1, 1)
is not "\xc5" but U+0142 and ord[U+0142] does not exist, hence
"%00".

Possible avenues :
- forcing the locale to C,
- using Gawk's --characters-as-bytes option,
- fixing the %-encoding function to look for code points greater
than 0x7f, work out what their UTF-8 encoding is and dump that
instead (but this is not completely reliable as it assumes
that the original byte sequence was canonical UTF-8, which is
likely but not guaranteed).

charlemagn...@gmail.com

unread,
Nov 17, 2016, 1:03:00 PM11/17/16
to
On Thursday, November 17, 2016 at 3:54:44 AM UTC-5, Andre Majorel wrote:

>Possible avenues :
>- forcing the locale to C,
>- using Gawk's --characters-as-bytes option,
>- fixing the %-encoding function

I haven't tried the first or third option, because the second works:

awk -b -f urlencode.awk

Test string: ł

urlencodeawk() = %C5%82
urlencodepython() = %C5%82

That is great. Thank you very much. I'm going to run more in-depth testing of URLs but hopefully this will be the solution, and a simple one.

Kaz Kylheku

unread,
Nov 17, 2016, 3:05:18 PM11/17/16
to
$ txr -p '(url-encode "ł")'
"%C5%82"

Not influenced by any ISO C/Unix locale garbage:

$ LANG=hottentot_ZA.UTF-13 txr -p '(url-encode "ł")'
"%C5%82"
0 new messages