Re: pugs CGI.pm

Stevan Little

unread,

Apr 13, 2005, 9:52:41 AM4/13/05

to perl6-c...@perl.org, BÁRTHÁZI András

Andras,

I am CC-ing this to perl6-compiler in hopes that smarter people that I
can better answer this question.

On Apr 13, 2005, at 9:20 AM, BÁRTHÁZI András wrote:
> I'm trying to create a small web application, and hacking parameter
> handling now.
>
> As Pugs works in UTF-8, my page is coded in UTF-8, too (and there are
> some other reasons, too). When I try to send an accented charater to
> the server as parameter, for example the euro character, I get back an
> UTF-8 coded character:
>
> ...?test=%E2%82%AC
>
> It's OK, but when my code (and CGI.pm as well) try to decode it, it
> will give back three characters and not just one.
>
> The problem is with this line in sub url_decode():
>
> $decoded ~~ s:perl5:g/%([\da-fA-F][\da-fA-F])/{chr(hex($1))}/;
>
> Have any idea, how to solve it? I think I should transform this code
> to recognize multi-bytes, decode the character value, and after it use
> chr on this value. Or is there a way to do it by not creating
> character by chr(), but a byte with another function?

To be honest, my experience with multi-byte character sets is very
limited (my first real exposure is on the Pugs project). However, I
think/hope that maybe the chr() builtin will eventually be able to
handle multi-bytes itself. In the (non-working) port of CGI-Lite
(http://tpe.freepan.org/repos/iblech/CGI-Lite/lib/CGI/Lite.pm), I saw
code which did this:

/%(<[\da-fA-F]>**{2})/{chr :16($1)}/

Of course it was followed by this comment "# XXX -- correct?" so it may
not be anything official yet.

That is my best guess (and it is not very good), so I will leave you in
the capable hands of the perl6-compiler crew.

- Stevan

Ingo Blechschmidt

unread,

Apr 13, 2005, 10:45:23 AM4/13/05

to perl6-c...@perl.org

Hi,

Stevan Little wrote:
> On Apr 13, 2005, at 9:20 AM, BÁRTHÁZI András wrote:
>> The problem is with this line in sub url_decode():
>>
>> $decoded ~~ s:perl5:g/%([\da-fA-F][\da-fA-F])/{chr(hex($1))}/;
>>
>> Have any idea, how to solve it? I think I should transform this code
>> to recognize multi-bytes, decode the character value, and after it
>> use chr on this value. Or is there a way to do it by not creating
>> character by chr(), but a byte with another function?

> handle multi-bytes itself. In the (non-working) port of CGI-Lite
> (http://tpe.freepan.org/repos/iblech/CGI-Lite/lib/CGI/Lite.pm), I saw
> code which did this:
>
> /%(<[\da-fA-F]>**{2})/{chr :16($1)}/
>
> Of course it was followed by this comment "# XXX -- correct?" so it
> may not be anything official yet.

the "XXX -- correct" refers to the :16 (IIRC, Larry said on p6l that he
liked that, but I wasn't able to find it in the Synopses).

BTW, Pugs' chr does understand input > 255 correctly:
pugs> ord "€"
8364
pugs> chr 8364
'€'

$decoded does contain valid UTF-8, the problem is Pugs' print/say
builtin -- compare:
$ perl -we 'print "\xE2\x82\xAC\n"'
€
$ perl -we 'die "\xE2\x82\xAC"'
€ at -e line 1.
$ pugs -we 'say "\xE2\x82\xAC"'
â¬ # garbage!
$ pugs -we 'die "\xE2\x82\xAC"'
€ # perfectly fine €-sign!
Val (VList [VStr "\226\130\172"])

--Ingo

--
Linux, the choice of a GNU | Black holes result when God divides the
generation on a dual AMD | universe by zero.
Athlon! |

BÁRTHÁZI András

unread,

Apr 13, 2005, 12:39:27 PM4/13/05

to Ingo Blechschmidt, perl6-c...@perl.org

Hi!

> the "XXX -- correct" refers to the :16 (IIRC, Larry said on p6l that he
> liked that, but I wasn't able to find it in the Synopses).
>
> BTW, Pugs' chr does understand input > 255 correctly:
> pugs> ord "€"
> 8364
> pugs> chr 8364
> '€'

Yes, I know it.

> $decoded does contain valid UTF-8, the problem is Pugs' print/say
> builtin -- compare:

It's interesting, and it can be the problem, but I think, the CGI.pm way
is not the good solution to decode the URL encoded string: if you say
chr(0xE2)~chr(0x82)~chr(0xA2), then they are 3 characters, and chr(0xE2)
is a 2 byte coded character in UTF-8 (on a iso-8859-1 terminal, the
output can be good, but the internal storage and handling isn't). I mean
if you would like to handle the string in memory, and you query the
length of it, the in this way you get 3, but the right is 1.

So, if there isn't a trick there (for example a function called "byte"
that is usable as "chr"), then CGI.pm have to recognize %E2%82%AC as one
character and have to decode it with evaluating chr(8364).

Additionally, detecting character boundings is not so easy, because a
character can 2-4 bytes long, and two or more characters can be next to
each other.

Bye,
Andras

Mark A Biggar

unread,

Apr 13, 2005, 12:57:20 PM4/13/05

to BÁRTHÁZI András, Ingo Blechschmidt, perl6-c...@perl.org

The standard for URLs uses a double encoding: A URL is coded in UTF-8 and then all bytes with high bits set are written in the %xx format. Therefore, if you just convert each %xx to the proper byte, the result is a valid UTF-8 string. You don't need to worry about multi-byte codes, if UTF-8 is the result you want.

--
Mark Biggar
ma...@biggar.org
mark.a...@comcast.net
mbi...@paypal.com

Ingo Blechschmidt

unread,

Apr 13, 2005, 1:10:54 PM4/13/05

to perl6-c...@perl.org

Hi,

BÁRTHÁZI András wrote:
> It's interesting, and it can be the problem, but I think, the CGI.pm
> way is not the good solution to decode the URL encoded string: if you
> say chr(0xE2)~chr(0x82)~chr(0xA2), then they are 3 characters, and

s:g/A2/AC/?

I think we've discovered a bug in Pugs, but as I don't know that much
about UTF-8, I'd like to see the following confirmed first :).
# This is what *should* happen:
my $x = chr(0xE2)~chr(0x82)~chr(0xAC);
say $x.bytes; # 3
say $x.chars; # 1

# This is what currently happens:
my $x = chr(0xE2)~chr(0x82)~chr(0xAC);
say $x.bytes; # 6
say $x.chars; # 3

Comparision with perl5:
$ perl -MEncode -we '
my $x = decode "utf-8", chr(0xE2).chr(0x82).chr(0xAC);
print length $x;
'
1 # (chars)

$ perl -we '
my $x = chr(0xE2).chr(0x82).chr(0xAC);
print length $x;
'
3 # (bytes)

--Ingo

--
Linux, the choice of a GNU | The computer revolution is over. The
generation on a dual AMD | computers won. -- Eduard Bloch <e...@gmx.de>
Athlon! |

Nathan Gray

unread,

Apr 13, 2005, 11:08:12 AM4/13/05

to Stevan Little, perl6-c...@perl.org, BÁRTHÁZI András

On Wed, Apr 13, 2005 at 09:52:41AM -0400, Stevan Little wrote:
> On Apr 13, 2005, at 9:20 AM, BÁRTHÁZI András wrote:
> >As Pugs works in UTF-8, my page is coded in UTF-8, too (and there are
> >some other reasons, too). When I try to send an accented charater to
> >the server as parameter, for example the euro character, I get back an
> >UTF-8 coded character:
> >
> > ...?test=%E2%82%AC
> >
> >It's OK, but when my code (and CGI.pm as well) try to decode it, it
> >will give back three characters and not just one.
> >
> >The problem is with this line in sub url_decode():
> >
> > $decoded ~~ s:perl5:g/%([\da-fA-F][\da-fA-F])/{chr(hex($1))}/;
> >
> >Have any idea, how to solve it? I think I should transform this code
> >to recognize multi-bytes, decode the character value, and after it use
> >chr on this value. Or is there a way to do it by not creating
> >character by chr(), but a byte with another function?
>
> To be honest, my experience with multi-byte character sets is very
> limited (my first real exposure is on the Pugs project). However, I
> think/hope that maybe the chr() builtin will eventually be able to
> handle multi-bytes itself. In the (non-working) port of CGI-Lite
> (http://tpe.freepan.org/repos/iblech/CGI-Lite/lib/CGI/Lite.pm), I saw
> code which did this:
>
> /%(<[\da-fA-F]>**{2})/{chr :16($1)}/
>
> Of course it was followed by this comment "# XXX -- correct?" so it may
> not be anything official yet.

The trick is that URL encoding encodes bytes, not characters:

http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars

So in the regex we have to determine whether we are unencoding a
single-byte or multi-byte character.

Both

s:perl5:g/%([\da-fA-F][\da-fA-F])/{chr(hex($1))}/

and

/%(<[\da-fA-F]>**{2})/{chr :16($1)}/

read in a single byte and pass it to chr(). I do not have enough
experience with multi-byte characters to know when a byte can be
recognized as the first byte of a multi-byte character, and thus grab
the next byte before passing to chr().

-kolibrie

Mark A Biggar

unread,

Apr 13, 2005, 1:33:25 PM4/13/05

to Ingo Blechschmidt, perl6-c...@perl.org

No the bug is using chr() to convert the byte as it appears to be defined as taking a Unicode codepoint and returning a UTF-8 character (which will be multibyte if the arg is >127), not as taking an int and return an 8 bit char with the same value. If this were perl 5, I'd say you really wanted to use pack instead. We really need both conversion functions and chr() can't be both.

--
Mark Biggar
ma...@biggar.org
mark.a...@comcast.net
mbi...@paypal.com

BÁRTHÁZI András

unread,

Apr 13, 2005, 1:42:51 PM4/13/05

to Ingo Blechschmidt, perl6-c...@perl.org

Hi,

>>It's interesting, and it can be the problem, but I think, the CGI.pm
>>way is not the good solution to decode the URL encoded string: if you
>>say chr(0xE2)~chr(0x82)~chr(0xA2), then they are 3 characters, and
>
> s:g/A2/AC/?

Yes, don't care with it.

At first, I would like to tell you, that I'm not the master of encoding,
I just have some experiences and I try to think logically.

> I think we've discovered a bug in Pugs, but as I don't know that much
> about UTF-8, I'd like to see the following confirmed first :).
> # This is what *should* happen:
> my $x = chr(0xE2)~chr(0x82)~chr(0xAC);
> say $x.bytes; # 3
> say $x.chars; # 1

I don't agree.

> # This is what currently happens:
> my $x = chr(0xE2)~chr(0x82)~chr(0xAC);
> say $x.bytes; # 6
> say $x.chars; # 3

I think this is the good solution.

chr(0xE2)=chr(226) is a valid character in unicode, it's Ô as I think.
When I write chr(...), then it have to be mean, that I'm talking about
*character*, and not a byte. If I'm talking about the #226 character,
then it's internal representation will be 0x00E2 (I don't know why not
0x000000E2, but it's not so important). If I mean that, and I
concatenating three characters (not three bytes), then it will be three
characters and six bytes.

> Comparision with perl5:
> $ perl -MEncode -we '
> my $x = decode "utf-8", chr(0xE2).chr(0x82).chr(0xAC);
> print length $x;
> '
> 1 # (chars)
>
> $ perl -we '
> my $x = chr(0xE2).chr(0x82).chr(0xAC);
> print length $x;
> '
> 3 # (bytes)

Your example is about the same thing I'm talking about, if you just
concatenate characters, then their length will be 3 *characters* and
*not bytes*, as you're wrinting it in the second perl5 example. If you
do a decoding on this three characters, then it can be converted to one
character.

Bye,
Andras

Roie Marianer

unread,

Apr 13, 2005, 9:57:43 AM4/13/05

to perl6-c...@perl.org

> I think we've discovered a bug in Pugs, but as I don't know that much
> about UTF-8, I'd like to see the following confirmed first :).
> # This is what *should* happen:
> my $x = chr(0xE2)~chr(0x82)~chr(0xAC);
> say $x.bytes; # 3
> say $x.chars; # 1
>
> # This is what currently happens:
> my $x = chr(0xE2)~chr(0x82)~chr(0xAC);
> say $x.bytes; # 6
> say $x.chars; # 3

That doesn't make sense. If you read the first statement "my $x=..." out loud,
you'll see it says "character 0xE2, then character 0x82, then character
0xAC". Three characters. On the other hand,

my $x = chr(0x20AC); # Look ma, Unicode!
say $x.bytes; #3
say $x.chars; #1

--
-Roie
v2sw6+7CPhw5ln5pr4/6$ck2ma8+9u7/8LSw2l6Fi2e2+8t4TNDSb8/4Aen4+7g5Za22p7/8
[ http:www.hackerkey.com ]

Ingo Blechschmidt

unread,

Apr 13, 2005, 2:23:17 PM4/13/05

to perl6-c...@perl.org

Hi,

Roie Marianer wrote:
>> # This is what *should* happen:
>> my $x = chr(0xE2)~chr(0x82)~chr(0xAC);
>> say $x.bytes; # 3
>> say $x.chars; # 1
>>
>> # This is what currently happens:
>> my $x = chr(0xE2)~chr(0x82)~chr(0xAC);
>> say $x.bytes; # 6
>> say $x.chars; # 3
>
> That doesn't make sense. If you read the first statement "my $x=..."
> out loud, you'll see it says "character 0xE2, then character 0x82,
> then character 0xAC". Three characters. On the other hand,
>
> my $x = chr(0x20AC); # Look ma, Unicode!
> say $x.bytes; #3
> say $x.chars; #1

ah! That makes perfect sense, thanks for clarifying matters! :)

Ok, then it seems we need to have a builtin, such that:
new_builtin(0xE2) ~ new_builtin(0x82) ~ new_builtin(0xAC) eq
"\xE2\x82\xAC"

--Ingo

--
Linux, the choice of a GNU | "The future is here. It's just not widely
generation on a dual AMD | distributed yet." -- William Gibson
Athlon! |

BÁRTHÁZI András

unread,

Apr 13, 2005, 2:36:07 PM4/13/05

to Nathan Gray, Stevan Little, perl6-c...@perl.org

Hi,

> So in the regex we have to determine whether we are unencoding a
> single-byte or multi-byte character.

> read in a single byte and pass it to chr(). I do not have enough

> experience with multi-byte characters to know when a byte can be
> recognized as the first byte of a multi-byte character, and thus grab
> the next byte before passing to chr().

From RFC-2279 [1], with my comments:

0000 0000-0000 007F 0xxxxxxx
0-127 [0-7][0-9A-F]

0000 0080-0000 07FF 110xxxxx 10xxxxxx
128-2047 [C-D][0-9A-F] [8-B][0-9A-F]*1

0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
2048-65535 [E][0-9A-F] [8-B][0-9A-F]*2

0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
65536-2097151 [F][0-7] [8-B][0-9A-F]*3

0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
2097152-67108863 [F][8-B] [8-B][0-9A-F]*4

0400 0000-7FFF FFFF 1111110x 10xxxxxx 10xxxxxx [..] 10xxxxxx
67108864-2147483647 [F][C-D] [8-B][0-9A-F]*5

That is we should write an algorithm for. Character boundaries can be
detected easily: an UTF-8 character is always starts with a byte between
0xC0-0xFD, and follows with one to five bytes between 0x80-0BF.

Bye,
Andras

[1] http://www.faqs.org/rfcs/rfc2279.html

Jonathan Scott Duff

unread,

Apr 13, 2005, 2:52:33 PM4/13/05

to Ingo Blechschmidt, perl6-c...@perl.org

On Wed, Apr 13, 2005 at 08:23:17PM +0200, Ingo Blechschmidt wrote:
> ah! That makes perfect sense, thanks for clarifying matters! :)
>
> Ok, then it seems we need to have a builtin, such that:
> new_builtin(0xE2) ~ new_builtin(0x82) ~ new_builtin(0xAC) eq
> "\xE2\x82\xAC"

Hmm. Looks like you've just found that built-in ;-)

"\xE2" ~ "\x82" ~ "\xAC"

Though if "chr" is a vowel-challenged version of "char", then surely
"byte" (or "bte" if you're insane :) should be the name of that new
built-in. If it's even to have a name. I'm perfectly happy with
interpolation if that's a way that works.

-Scott
--
Jonathan Scott Duff
du...@pobox.com

BÁRTHÁZI András

unread,

Apr 13, 2005, 2:51:36 PM4/13/05

to Ingo Blechschmidt, perl6-c...@perl.org

Hi,

> ah! That makes perfect sense, thanks for clarifying matters! :)
>
> Ok, then it seems we need to have a builtin, such that:
> new_builtin(0xE2) ~ new_builtin(0x82) ~ new_builtin(0xAC) eq
> "\xE2\x82\xAC"

I think - conceptually - it cannot be done, because you cannot store a
byte in a character string, and ~ is for concatenating character
strings, not byte strings. In fact, you can do it, because Pugs' (and as
I know Parrot's) internal string representation is UTF-8 (but what about
other compiler destinations, like machine code, JVM, .Net?), and you can
put bytes into it. But I think it would be a bad decision to do so,
because what if in the future you would like to change this behaviour?
The system should be totally transparent and virtual, and it sounds like
a hack for me.

I think it should be done in CGI.pm.

Bye,
Andras

Dan Sugalski

unread,

Apr 13, 2005, 3:21:01 PM4/13/05

to BÁRTHÁZI András, Ingo Blechschmidt, perl6-c...@perl.org

At 8:51 PM +0200 4/13/05, BÁRTHÁZI András wrote:
>Hi,
>
>>ah! That makes perfect sense, thanks for clarifying matters! :)
>>
>>Ok, then it seems we need to have a builtin, such that:
>> new_builtin(0xE2) ~ new_builtin(0x82) ~ new_builtin(0xAC) eq
>> "\xE2\x82\xAC"
>
>I think - conceptually - it cannot be done,
>because you cannot store a byte in a character
>string, and ~ is for concatenating character
>strings, not byte strings. In fact, you can do
>it, because Pugs' (and as I know Parrot's)
>internal string representation is UTF-8

Parrot's not UTF-8 internally. It can do UTF-8 if
it must, but we prefer not, since UTF-8 sucks in
so very many ways.

Parrot's encoding-neutral. You can (or will, when
I finish some library code) be able to mix
unicode, Latin-3, Shift-JIS, EBCDIC, and EUC-KR
string data in a program if you wanted. (Though
I'd generally recommend against it)
--
Dan

--------------------------------------it's like this-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

BÁRTHÁZI András

unread,

Apr 13, 2005, 3:34:24 PM4/13/05

to Dan Sugalski, Ingo Blechschmidt, perl6-c...@perl.org

Hi,

>>> ah! That makes perfect sense, thanks for clarifying matters! :)
>>>
>>> Ok, then it seems we need to have a builtin, such that:
>>> new_builtin(0xE2) ~ new_builtin(0x82) ~ new_builtin(0xAC) eq
>>> "\xE2\x82\xAC"
>>
>> I think - conceptually - it cannot be done, because you cannot store a
>> byte in a character string, and ~ is for concatenating character
>> strings, not byte strings. In fact, you can do it, because Pugs' (and
>> as I know Parrot's) internal string representation is UTF-8
>
> Parrot's not UTF-8 internally. It can do UTF-8 if it must, but we prefer
> not, since UTF-8 sucks in so very many ways.
>
> Parrot's encoding-neutral. You can (or will, when I finish some library
> code) be able to mix unicode, Latin-3, Shift-JIS, EBCDIC, and EUC-KR
> string data in a program if you wanted. (Though I'd generally recommend
> against it)

So, then here's a solution:
http://barthazi.hu/decode.pugs

It wasn't heavily tested (euro sign, all the Hungarian letters and some
other works), but I think it can work in all possible situations.

Bye,
Andras

Stevan Little

unread,

Apr 13, 2005, 5:30:47 PM4/13/05

to BÁRTHÁZI András, perl6-c...@perl.org

Andras,

On Apr 13, 2005, at 3:34 PM, BÁRTHÁZI András wrote:
> So, then here's a solution:
> http://barthazi.hu/decode.pugs
>
> It wasn't heavily tested (euro sign, all the Hungarian letters and
> some other works), but I think it can work in all possible situations.

Let me start by saying this would be an excellent addition to CGI.pm!
And I will gladly give you commit rights to do so. Since I am not
familiar with this stuff I would prefer not to do it myself.

However, I would ask that you also include some unit tests to confirm
that the code is working in all (or at least as many as possible) cases
and does not break anything. And ideally you would be able to devise
some kind of test that would avoid the 5 extra reg-exps if they were
not needed. It could even be some kind of configuration variable, maybe
like this:

use v6;
require CGI;
use_multibyte_encoding();
....

I would also like to see an encode() version of this as well (I know
that the current version will not work correctly since the dec2hex hack
only handles 0 - 128).

Let me know what you would like to do.

- Stevan

Ovid

unread,

Apr 13, 2005, 5:49:25 PM4/13/05

to perl6-c...@perl.org

--- Stevan Little <ste...@iinteractive.com> wrote:
> Andras,
>
> On Apr 13, 2005, at 3:34 PM, BÁRTHÁZI András wrote:
> > So, then here's a solution:
> > http://barthazi.hu/decode.pugs
> >
> > It wasn't heavily tested (euro sign, all the Hungarian letters and
> > some other works), but I think it can work in all possible
> situations.
>
> Let me start by saying this would be an excellent addition to CGI.pm!

As the other guy who's been hacking on CGI.pm, I second this. I'd
rather not add it myself because I don't know the problem space well
enough.

Cheers,
Ovid

--
If this message is a response to a question on a mailing list, please send
follow up questions to the list.

Web Programming with Perl -- http://users.easystreet.com/ovid/cgi_course/

Roie Marianer

unread,

Apr 14, 2005, 1:26:59 AM4/14/05

to perl6-c...@perl.org

On Wednesday 13 April 2005 9:23 pm, Ingo Blechschmidt wrote:
> Ok, then it seems we need to have a builtin, such that:
> new_builtin(0xE2) ~ new_builtin(0x82) ~ new_builtin(0xAC) eq
> "\xE2\x82\xAC"

Doesn't it make more sense for a decode_utf8 function such that

decode_utf8(0xE2, 0x82, 0xAC) eq chr(0x20AC)

No reason why this function can't also handle multiple characters

decode_utf8((0xE2, 0x82, 0xAC) xx 3) eq chr(0x20AC) x 3

but probably can't handle partial characters

decode_utf8(0xE2, 0x82) # error

This function doesn't even have to be built in, by the way.