CGI.pm url_encoding problem

BÁRTHÁZI András

unread,

Apr 18, 2005, 5:16:25 AM4/18/05

to perl6-c...@perl.org

Hi!

This is the code:

use CGI;
set_url_encoding('utf-8');

The problem is that "use CGI" automagically initializes the parameters
*before* I set the encoding of them, so set_url_encoding will run too late.

Any idea?

Bye,
Andras

Stevan Little

unread,

Apr 18, 2005, 8:52:28 AM4/18/05

to BÁRTHÁZI András, perl6-c...@perl.org

Andras,

Well once we have a proper "use", we should be able to set the encoding
at compile time. But until then, I see a few possible options:

- setting the url encoding forces a re-encoding of any parameters
already encoded.

This means extra work if you change the encoding, but it will only
happen once.

- moving the decoding process to be "on demand" when fetching the params

This would slow do the param() function, but would mean you only
decoded exactly what you needed and nothing more.

Either one is a simple change.

- Stevan

BÁRTHÁZI András

unread,

Apr 18, 2005, 9:16:02 AM4/18/05

to Stevan Little, perl6-c...@perl.org

Stevan,

> Well once we have a proper "use", we should be able to set the encoding
> at compile time. But until then, I see a few possible options:

I think, it would be nice to find another solution.

> - setting the url encoding forces a re-encoding of any parameters
> already encoded.
>
> This means extra work if you change the encoding, but it will only
> happen once.

It can't work (or with a big overhead), because POST parameters coming
from the STDIN, and it's just readable once. If you would like to do it,
then you have to store the whole input, which can be large.

> - moving the decoding process to be "on demand" when fetching the params
>
> This would slow do the param() function, but would mean you only decoded
> exactly what you needed and nothing more.

It sounds good, and I have another idea. What, if the first param()
function call would trigger the whole paramter decoding? It's not an
overhead, because you have to do the process if you would like to get a
parameter, but an improvement, because if you don't want to query a
parameter (you just include the CGI.pm just for to print header(),
etc.), then there won't be processing + decoding.

Bye,
Andras

Randal L. Schwartz

unread,

Apr 18, 2005, 9:10:39 AM4/18/05

to perl6-c...@perl.org

>>>>> "BÁRTHÁZI" == BÁRTHÁZI András <and...@barthazi.hu> writes:

BÁRTHÁZI> use CGI;
BÁRTHÁZI> set_url_encoding('utf-8');

BÁRTHÁZI> The problem is that "use CGI" automagically initializes the parameters
BÁRTHÁZI> *before* I set the encoding of them, so set_url_encoding will run too
BÁRTHÁZI> late.

Did I miss the memo where anything outside the list of valid
URI characters needed to be hexified, hence there's no need
for such a URL encoding scheme? Where is this memo?

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<mer...@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

BÁRTHÁZI András

unread,

Apr 18, 2005, 10:44:06 AM4/18/05

to Randal L. Schwartz, perl6-c...@perl.org

Hi,

Randal L. Schwartz wrote:
>>>>>>"BÁRTHÁZI" == BÁRTHÁZI András <and...@barthazi.hu> writes:
>
>>>Did I miss the memo where anything outside the list of valid
>>>URI characters needed to be hexified, hence there's no need
>>>for such a URL encoding scheme? Where is this memo?
>
>

> BÁRTHÁZI> Can you write it again with other words? Both Stevan and me are not
> BÁRTHÁZI> understand.
>
> URLs are only 7 bit ASCII, according to the RFCs. Did I miss a new RFC
> where non-7-bit URLs are permitted? If so, please point to that.

You are right, in URLs just 7 bit ASCII is allowed. But you can store
any character in an URL, if you encode it with "URL encoding". For
example UTF-8 "á" is coded as "%C3%A1".

RFC 1738 [1], part 2.2 is writing about it (just about iso-8859-1
encoding). Or you can read a short tutorial about it at Blooberry[2].
Don't tell me, that you never heard this before. :)

Anyway, it's not just about URL encoding (the URL and the GET
parameters), but POST parameters working the same way.

Bye,
Andras

[1] http://www.rfc-editor.org/rfc/rfc1738.txt
[2] http://www.blooberry.com/indexdot/html/topics/urlencoding.htm

BÁRTHÁZI András

unread,

Apr 18, 2005, 10:25:57 AM4/18/05

to Randal L. Schwartz, perl6-c...@perl.org

Randal,

> BÁRTHÁZI> use CGI;
> BÁRTHÁZI> set_url_encoding('utf-8');
>
> BÁRTHÁZI> The problem is that "use CGI" automagically initializes the parameters
> BÁRTHÁZI> *before* I set the encoding of them, so set_url_encoding will run too
> BÁRTHÁZI> late.
>
> Did I miss the memo where anything outside the list of valid
> URI characters needed to be hexified, hence there's no need
> for such a URL encoding scheme? Where is this memo?

Can you write it again with other words? Both Stevan and me are not
understand.

Bye,
Andras

BÁRTHÁZI András

unread,

Apr 18, 2005, 10:50:54 AM4/18/05

to Mark A. Biggar, Randal L. Schwartz, perl6-c...@perl.org

Hi,

> I believe that the standard for URL's calls for always encoding in utf-8
> but that all non-ascii bytes (bytes with the high bit set) are to be
> further encoded using %xx hex notation. So the URL is always
> transmitted as an ascii string, but is easily converted into a utf-8
> string simply by converting the %xx codes back into binary bytes. Thus
> firewalls and proxies need only deal with ascii.

You're right, except one thing: when the standard was created, there
were no UTF-8 encoding, so it can't be the default. I think that the
standard is not talking about how the non-ASCII characters are encoded
(iso-8859-* or utf-8 or else). And I know and I'm sure in it, that
browsers are sending back non-ASCII characters by the same encoding as
the page of the form was coded - so no UTF-8 is the default, there is no
default.

Bye,
Andras

Randal L. Schwartz

unread,

Apr 18, 2005, 10:29:12 AM4/18/05

to BÁRTHÁZI András, perl6-c...@perl.org

>>>>> "BÁRTHÁZI" == BÁRTHÁZI András <and...@barthazi.hu> writes:

>> Did I miss the memo where anything outside the list of valid
>> URI characters needed to be hexified, hence there's no need
>> for such a URL encoding scheme? Where is this memo?

BÁRTHÁZI> Can you write it again with other words? Both Stevan and me are not
BÁRTHÁZI> understand.

URLs are only 7 bit ASCII, according to the RFCs. Did I miss a new RFC
where non-7-bit URLs are permitted? If so, please point to that.

--

Mark A. Biggar

unread,

Apr 18, 2005, 10:38:28 AM4/18/05

to BÁRTHÁZI András, Randal L. Schwartz, perl6-c...@perl.org

BÁRTHÁZI András wrote:

I believe that the standard for URL's calls for always encoding in utf-8

but that all non-ascii bytes (bytes with the high bit set) are to be
further encoded using %xx hex notation. So the URL is always
transmitted as an ascii string, but is easily converted into a utf-8
string simply by converting the %xx codes back into binary bytes. Thus
firewalls and proxies need only deal with ascii.

--
ma...@biggar.org
mark.a...@comcast.net