On Fri, 8 Jan 2010, PerlFAQ Server wrote in comp.lang.perl.misc:
> 9.16: How do I decode a CGI form?
>
> (contributed by brian d foy)
>
> Use the CGI.pm module that comes with Perl. It's quick, it's easy, and
> it actually does quite a bit of work to ensure things happen correctly.
> It handles GET, POST, and HEAD requests, multipart forms, multivalued
> fields, query string and message body combinations, and many other
> things you probably don't want to think about.
It even works with forms that are using UTF-8 code on input which is
needed for non-English language. However, the documentation warns against
usage of this feature and recommends fiddling around with encode and
decode instead, thus deprecating one of the really useful functionalities
of the CGI.pm module.
More detail in my recent contribution in newsgroup comp.lang.perl.modules.
--
Helmut Richter
> On Fri, 8 Jan 2010, PerlFAQ Server wrote in comp.lang.perl.misc:
>
> > 9.16: How do I decode a CGI form?
> >
> > (contributed by brian d foy)
> >
> > Use the CGI.pm module that comes with Perl. It's quick, it's easy, and
> > it actually does quite a bit of work to ensure things happen correctly.
> > It handles GET, POST, and HEAD requests, multipart forms, multivalued
> > fields, query string and message body combinations, and many other
> > things you probably don't want to think about.
>
> It even works with forms that are using UTF-8 code on input which is
> needed for non-English language. However, the documentation warns against
> usage of this feature and recommends fiddling around with encode and
> decode instead, thus deprecating one of the really useful functionalities
> of the CGI.pm module.
I think I made a mistake when testing. The CGI.pm module does *not* work
with forms in UTF-8, and the "-utf8" pragma has no effect, but at least no
detrimental effect on uploaded files. If you are not using a language with
restricted character set, you have to know what CGI.pm does (finding the
bytes representing the parameters in the filled-out forms) and what it
does not (interpreting them as characters).
--
Helmut Richter
> I think I made a mistake when testing. The CGI.pm module does *not* work
> with forms in UTF-8, and the "-utf8" pragma has no effect, but at least no
> detrimental effect on uploaded files. If you are not using a language with
> restricted character set, you have to know what CGI.pm does (finding the
> bytes representing the parameters in the filled-out forms) and what it
> does not (interpreting them as characters).
I should add an example script:
#! /usr/local/bin/perl
# The script asks for a file and a "location" and repeats the question until
# "M�nchen" is given as location. The file should be interpreted as binary;
# the test uses a .jpg file
use utf8;
use strict;
use CGI qw(-utf8);
use CGI::Carp qw(fatalsToBrowser);
use Encode;
binmode (STDOUT, ":utf8");
my $cgi = CGI->new;
my $fh;
if ($cgi->param('location') eq 'M�nchen') {
cgi_action();
} else {
cgi_form();
};
sub cgi_form {
print <<END;
Content-Type:text/html; charset=UTF-8
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Test for UTF-8 and binary input</title>
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
</head>
<body>
<h1>Test for UTF-8 and binary input</h1>
<form action="http://www.lrz-muenchen.de/cgi/richter/test-utf.html"
method="post" accept-charset="UTF-8" enctype="multipart/form-data">
<p>Select a file:<br>
<input name="uploaded_file" type="file" size="50">
</p>
<p>Location:
END
print $cgi->textfield ('location', 'N�rnberg', 30, 30);
print <<END;
</p>
<p><input type="submit" value="Formulardaten absenden"></p>
</form>
</body>
</html>
END
};
sub cgi_action {
open (OUT, ">/afs/lrz/info/www/CGI/richter/upload-test-copy") || die
"unable to open OUT";
$fh = $cgi->upload('uploaded_file');
while (<$fh>) {
print (OUT $_) || die "unable to print";
};
print <<END
Content-Type:text/html; charset=UTF-8
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Felder für Datei-Upload definieren</title>
</head>
<body>
<h1>Thank you!</h1>
</body>
</html>
END
};
I tried to change the usage of the -utf8 pragma (no effect whatsoever) and
the usage of ":utf8" on STDOUT (without, the first output is garbled;
with, the first output is in order but every subsequent is garbled).
If someone wants to run this code, be sure that the text in the file is
indeed UTF-8; otherwise the "use utf8;" lets the perl interpreter parse
the source code incorrectly.
--
Helmut Richter
I think you're trying to demonstrate a bug in CGI.pm, and you're doing
that quite well, but here are my thoughts about your program:
First, you can't post to an HTML file:
http://www.lrz-muenchen.de/cgi/richter/test-utf.html . I changed that
when I ran your program.
Second, mutipart/form-data is supposed to be for binary files, and as
such is not suitable for utf8 data /unless/ you wish to handle the
convertion to/from utf8 yourself. I don't think that CGI.pm is
misbehaving too badly.
FWIW, I placed your program with my changes here:
http://home.earthlink.net/~mumia.w.18.spam/docs/test-binary.txt
Despite appearances in the browser window, the file is utf8.
> I think you're trying to demonstrate a bug in CGI.pm, and you're doing that
> quite well,
Up to now, I was not able to say what exactly my problem is. This is in
part due to the fact that the CGI.pm documentation fails to say what
*exactly* CGI.pm does with UTF-8, so that I cannot check the behaviour
against the documentation. Let me try again.
The documentation http://perldoc.perl.org/CGI.html says:
| -utf8
|
|
| This makes CGI.pm treat all parameters as UTF-8 strings. Use this with care,
| as it will interfere with the processing of binary uploads. It is better to
| manually select which fields are expected to return utf-8 strings and
| convert them using code like this:
|
| 1. use Encode;
| 2. my $arg = decode utf8=>param('foo');
What does the first sentence mean: "This makes CGI.pm treat all parameters
as UTF-8 strings"? What is "treat as" in this context?
The only way I find for giving it a meaningful interpretation is the
following.
a. The handling of UTF-8 characters is documented in
http://perldoc.perl.org/perlunitut.html . The main lesson to be learnt
is that the writer of any perl program is responsible to know which of
his strings are bytestrings or textstrings. If he does not know, he
might try to output textstrings to a binary file (does not work for
"wide" characters), or output bytestrings to a UTF-8 encoded file
(renders "wide" characters as a sequence of two or more characters), or
compare a textstring with a bytestring (unpredictable whether equal
strings will be recognised as equal).
b. Hence, it is vital that the user of the CGI.pm module know whether the
result of param('param_name') is a bytestring or a textstring. If
bytestring, it has to be output with STDOUT as binary; if textstring,
as UTF-8 (that is, after binmode (STDOUT, ":utf8")). If bytestring, it
can be compared only with other bytestrings whose encoding is known to
be the same; if textstring, it can only be compared with other
textstrings. If bytestring, a default setting of a field in the form
must also be a bytestring; if textstring, it must be a textstring. If
one of the input string (the value of param() at the start) and the
output string (the default setting of the field in the form) is a
textstring and the other is a bytestring, then the user input will not
properly be reused in a subsequent iteration of the form.
c. The documentation does not say what kind of strings the param()
function yields or expects. It is plausible that they are bytestrings
in the absence of any specification, but what if the -utf8 pragma is
specified?
d. The first possible interpretation is that the param() still yields
bytestrings even with that pragma. Then the user of CGI.pm has to take
care that all other strings with which the parameter is defaulted or
compared (as described as item (b) above) are bytestrings as well. But
then the pragma simply has no effect at all.
e. The second possible interpretation is that, with the -utf8 pragma,
param() delivers the form input data as textstrings. Then they can also
be defaulted to textstrings, compared with other textstrings, and
output to a new form provided that STDOUT is in :utf8 mode. This would
not only be a reasonable behaviour but an extremely useful one.
Therefore I find this the most plausible interpretation.
f. Up to now, we have not considered uploading binary files. If behaviour
(e) is intended, which we do not know due to lack of unambiguous
documentation, then one could think of setting :utf8 mode also on
STDIN. However, this would be a very bad idea, as then the entire input
would have to be UTF-8 which is not the case for embedded binary files.
An alternative implementation would be to first extract uploaded files
from the input data, and then interpret the remainder as UTF-8 data
(which can be guaranteed if properly specified in the accept-charset
option of the <form> tag). Again: This would not only be a reasonable
behaviour but an extremely useful one. Therefore I find this the most
plausible interpretation.
That a call to decode() for decoding input data is mentioned in the
documentation as an *alternative* to the -utf8 pragma is another hint that
behavious (e) is intended, and that the documenation warns against
simultanous usage of -utf8 and binary upload is a hint in the same
direction, suggesting that the implementation is used which I called "a
very bad idea" under item (f).
Now what actually happens is:
1. The strings yielded by param() are unusable as textstrings. They are
UTF-8 bytestrings, exactly as if the -utf8 pragma would not have been
specified.
2. I was not able to produce any problems in the uploaded binary files
with or without the -utf8 pragma. The warning is futile.
The module thus works quite differently from what I expect from the
documentation. To sum up, the documentation says, it does something useful
with UTF-8 data but may garble binary data. In fact, it does nothing at
all with UTF-8 data, and because it does nothing, it does not do any harm.
The documentation is not clear enough to tell whether it is a deficiency
in the documentation or the program. The documentation should specify
exactly what to expect, and the actual behavious is far from optimal (so
fixing only the documentation is not the right solution).
> First, you can't post to an HTML file:
> http://www.lrz-muenchen.de/cgi/richter/test-utf.html . I changed that when I
> ran your program.
It's just the name of the file; it is the perl script I included. The
Apache where I ran it accepted the file as CGI script despite its name
(based on the directory where it is).
> Second, mutipart/form-data is supposed to be for binary files, and as such is
> not suitable for utf8 data /unless/ you wish to handle the convertion to/from
> utf8 yourself. I don't think that CGI.pm is misbehaving too badly.
Yes, this is how the documentation says, but that is not true, as
explained above in detail. And it *should* not be true: the CGI.pm module
has access to the entire input string *before* any code conversions are
done -- *this* is the right place for decoding the data, not in the perl
script on a parameter by parameter basis.
Thank you for looking into it.
--
Helmut Richter
After reading your message and testing a little more, I'm a little more
confused than before ;-)
I upgraded to CGI.pm 3.48; I had no idea that I wasn't using the correct
version; CGI.pm 3.29 (Debian Lenny) doesn't complain if you give it an
unused/invalid option:
use CGI qw/-utf44/; # There is no utf44; there is no complaint either.
Strangely, after I upgraded to 3.48, your program worked perfectly. All
I need do is type in "M�nchen" for the location, and the file is
accepted and is not corrupted. From my point of view, CGI.pm 3.48 does
option "e" described above--possibly with some magic to exclude binary
files from the utf8 conversion.
The changes I made to your program were modest (no real changes). I
placed a copy here:
http://home.earthlink.net/~mumia.w.18.spam/docs/try-binary1.txt
Perhaps it's a version/library problem; this is my environment:
O/S: Debian Lenny i386
CGI.pm: 3.48
FCGI.pm: 0.67
Apache2: 2.2.9
Firefox 3.5.6 (x86/Linux)
My environment is fully UTF-8: console, Xorg, everything I could set.
> I upgraded to CGI.pm 3.48; I had no idea that I wasn't using the correct
> version; CGI.pm 3.29 (Debian Lenny) doesn't complain if you give it an
> unused/invalid option:
>
> use CGI qw/-utf44/; # There is no utf44; there is no complaint either.
I have even 3.15 on the computer where the script runs (SuSE 10). There is
no mention of "utf8" in the source, and probably not in 3.29 either.
I will install the new module in another place and then test again.
> Strangely, after I upgraded to 3.48, your program worked perfectly. All I need
> do is type in "M�nchen" for the location, and the file is accepted and is not
> corrupted. From my point of view, CGI.pm 3.48 does option "e" described
> above--possibly with some magic to exclude binary files from the utf8
> conversion.
If so, the documentation could do with a little update. I'll make a
suggstion as soon as I have tested.
> The changes I made to your program were modest (no real changes). I placed a
> copy here:
> http://home.earthlink.net/~mumia.w.18.spam/docs/try-binary1.txt
Could you please leave it there for a while? Or should I get my own copy?
Thank you.
--
Helmut Richter
I can leave it there.
Jeez....marry somebody.
ferm--
> I upgraded to CGI.pm 3.48; I had no idea that I wasn't using the correct
> version; CGI.pm 3.29 (Debian Lenny) doesn't complain if you give it an
> unused/invalid option:
>
> use CGI qw/-utf44/; # There is no utf44; there is no complaint either.
There is no complaint because CGI.pm has a feature that turns
unrecognized symbols in the import list into new HTML generation
subroutines:
use CGI qw(-utf44);
print utf44( 'Foo' );
This gives you output with your new utf44 tag:
<utf44>Foo</utf44>
This feature, whether you like it or not, was there to support future
browser extensions.