Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UTF-8 mail encoding procedure?

480 views
Skip to first unread message

Tuxedo

unread,
Sep 15, 2021, 3:22:41 PM9/15/21
to
Hello,

How can I process the input of an HTML form in UTF-8 with CGI and pass it
through a MIME-Lite's sending procedure intact?

I would like all contents, including mailheaders (Subject, Reply-to and
From headers to be UTF-8 compatible. So far, I only managed to print a
form's input to the browser but not encode it correctly through the mail
procedure.

For example, "Commenter" (as in below 'From' string) could become
"Σχολιαστής" in case a commenter happened to input in his or her name in
localised characters in a relevant input form field.

--------- commenter.cgi ----------

#!/usr/bin/perl -w

use CGI;
use MIME::Lite;
use Encode qw(encode encode_utf8 );
use utf8;

$query = new CGI;

$comments = $query->param('comments');

# If I collect a UTF-8 charset subject line it becomes
# goobledegook once mailed

$subject_line = $query->param('subject');

# but if I define a UTF-8 character string here it works
# in the subject line of the resulting mail

# $subject_line = "μερικές ελληνικές λέξεις";


MIME::Lite->send ("sendmail", "/usr/bin/sendmail -t -oi");

$msg = MIME::Lite->new (
From => "\"Commenter\" <no-reply\@example.com>",
To => 'comm...@example.com',
Type =>'multipart/mixed',
Subject => encode( 'MIME-Header', $subject_line)
);

$body = "Comments: $comments";

$msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);

$msg->send ();

# either way, utf-8 character input print in the browser

print "Content-type: text/html\n\n";
print $comments;

------------- comment.html ------------

<form ENCTYPE="multipart/form-data" method="post" action="comment.cgi">

<input type="text" name="subject">

<textarea name="comments"></textarea>

<input type="submit" name="send" value="Submit">


--------------

Many thanks for any tips on the correct UTF-8 mail process.

Tuxedo

Eli the Bearded

unread,
Sep 15, 2021, 6:34:32 PM9/15/21
to
In comp.lang.perl.misc, Tuxedo <tux...@mailinator.net> wrote:
> --------- commenter.cgi ----------
>
> #!/usr/bin/perl -w
>
> use CGI;
> use MIME::Lite;
> use Encode qw(encode encode_utf8 );
> use utf8;

Versions? I have Perl v5.24.3 handy.

> $query = new CGI;
>
> $comments = $query->param('comments');
>
> # If I collect a UTF-8 charset subject line it becomes
> # goobledegook once mailed
>
> $subject_line = $query->param('subject');
>
> # but if I define a UTF-8 character string here it works
> # in the subject line of the resulting mail
>
> # $subject_line = "μερικές ελληνικές λέξεις";

That's curious. I'd look at what encoding your query string has.

> MIME::Lite->send ("sendmail", "/usr/bin/sendmail -t -oi");
>
> $msg = MIME::Lite->new (
> From => "\"Commenter\" <no-reply\@example.com>",
> To => 'comm...@example.com',
> Type =>'multipart/mixed',
> Subject => encode( 'MIME-Header', $subject_line)
> );

My Encode module does not document a 'MIME-Header' encoding. I use
MIME:EncWords for that.

use MIME::EncWords qw( encode_mimeword );

...
Subject => encode_mimeword( $subject_line, 'B', 'UTF-8')

Where "B" is for base64 and 'Q' woult be 'quoted-printable'.

> $body = "Comments: $comments";
>
> $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
>
> $msg->send ();
>
> # either way, utf-8 character input print in the browser
>
> print "Content-type: text/html\n\n";
> print $comments;


I also use taint checking on CGI. You'll need to clean up the PATH,
etc, for that.

Elijah
------
didn't check versions of the modules

Eli the Bearded

unread,
Sep 15, 2021, 6:36:31 PM9/15/21
to
In comp.lang.perl.misc, Tuxedo <tux...@mailinator.net> wrote:
> --------- commenter.cgi ----------
>
> #!/usr/bin/perl -w
>
> use CGI;
> use MIME::Lite;
> use Encode qw(encode encode_utf8 );
> use utf8;

Versions? I have Perl v5.24.3 handy.

> $query = new CGI;
>
> $comments = $query->param('comments');
>
> # If I collect a UTF-8 charset subject line it becomes
> # goobledegook once mailed
>
> $subject_line = $query->param('subject');
>
> # but if I define a UTF-8 character string here it works
> # in the subject line of the resulting mail
>
> # $subject_line = "Î¼ÎµÏ Î¹ÎºÎ­Ï‚ ελληνικές λέξεις";

That's curious. I'd look at what encoding your query string has.

> MIME::Lite->send ("sendmail", "/usr/bin/sendmail -t -oi");
>
> $msg = MIME::Lite->new (
> From => "\"Commenter\" <no-reply\@example.com>",
> To => 'comm...@example.com',
> Type =>'multipart/mixed',
> Subject => encode( 'MIME-Header', $subject_line)
> );

My Encode module does not document a 'MIME-Header' encoding. I use
MIME:EncWords for that.

use MIME::EncWords qw( encode_mimeword );

...
Subject => encode_mimeword( $subject_line, 'B', 'UTF-8')

Where "B" is for base64 and 'Q' woult be 'quoted-printable'.

> $body = "Comments: $comments";
>
> $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
>
> $msg->send ();
>
> # either way, utf-8 character input print in the browser
>
> print "Content-type: text/html\n\n";
> print $comments;


Tuxedo

unread,
Sep 16, 2021, 12:45:16 AM9/16/21
to
Eli the Bearded wrote:

> In comp.lang.perl.misc, Tuxedo <tux...@mailinator.net> wrote:
>> --------- commenter.cgi ----------
>>
>> #!/usr/bin/perl -w
>>
>> use CGI;
>> use MIME::Lite;
>> use Encode qw(encode encode_utf8 );
>> use utf8;
>
> Versions? I have Perl v5.24.3 handy.

Perl itself is v5.10.1 and it can't easily be updated on the target machine.

MIME::Lite is an ancient 1.147 version. The other modules are core modules
in the Perl 5.10.1 version as far as I know.

MIME::Lite is in an external directory and I haven't updated it since it
required updating dependency modules, which I think in turns requires
installing various dependency modules.

>
>> $query = new CGI;
>>
>> $comments = $query->param('comments');
>>
>> # If I collect a UTF-8 charset subject line it becomes
>> # goobledegook once mailed
>>
>> $subject_line = $query->param('subject');
>>
>> # but if I define a UTF-8 character string here it works
>> # in the subject line of the resulting mail
>>
>> # $subject_line = "Î¼ÎµÏ Î¹ÎºÎ­Ï‚ ελληνικές λέξεις";
>

It just reads some greek words. Odd that does not show in your end. My news
reader and news posting window is set to UTF-8.

I try again: μερικές ελληνικές λέξεις. Does anyone see the Greek UTF-8
characters?

> That's curious. I'd look at what encoding your query string has.
>
>> MIME::Lite->send ("sendmail", "/usr/bin/sendmail -t -oi");
>>
>> $msg = MIME::Lite->new (
>> From => "\"Commenter\" <no-reply\@example.com>",
>> To => 'comm...@example.com',
>> Type =>'multipart/mixed',
>> Subject => encode( 'MIME-Header', $subject_line)
>> );
>
> My Encode module does not document a 'MIME-Header' encoding. I use
> MIME:EncWords for that.
>
> use MIME::EncWords qw( encode_mimeword );
>
> ...
> Subject => encode_mimeword( $subject_line, 'B', 'UTF-8')
>
> Where "B" is for base64 and 'Q' woult be 'quoted-printable'.

Thanks for the tips. I will give MIME::EncWords a try.

Another tool I've used for something a similar is
Email::MIME::RFC2047::Encoder

>
>> $body = "Comments: $comments";
>>
>> $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
>>
>> $msg->send ();
>>
>> # either way, utf-8 character input print in the browser
>>
>> print "Content-type: text/html\n\n";
>> print $comments;
>
>
> I also use taint checking on CGI. You'll need to clean up the PATH,
> etc, for that.

I'm not sure where the UTF-8 conversion fails in the mail or CGI.

Otto J. Makela

unread,
Sep 16, 2021, 4:40:35 AM9/16/21
to
Tuxedo <tux...@mailinator.net> wrote:

> Mime-Version: 1.0
> Content-Type: text/plain; charset="UTF-8"
> Content-Transfer-Encoding: 8Bit
[...]
>
> It just reads some greek words. Odd that does not show in your end.
> My news reader and news posting window is set to UTF-8.
>
> I try again: μερικές ελληνικές λέξεις. Does anyone see the Greek UTF-8
> characters?

It is correctly formatted, and does show correctly here.
--
/* * * Otto J. Makela <o...@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

Otto J. Makela

unread,
Sep 16, 2021, 4:47:59 AM9/16/21
to
Tuxedo <tux...@mailinator.net> wrote:

> Mime-Version: 1.0
> Content-Type: text/plain; charset="UTF-8"
> Content-Transfer-Encoding: 8Bit
> User-Agent: KNode/4.14.10
[...]
> # $subject_line = "μερικές ελληνικές λέξεις";

Eli the Bearded <*@eli.users.panix.com> wrote:

> Mime-Version: 1.0
> Content-Type: text/plain; charset="UTF-8"
> User-Agent: Vectrex rn 2.1 (beta)
[...]
> In comp.lang.perl.misc, Tuxedo <tux...@mailinator.net> wrote:
>> # $subject_line = "μεÏικές ελληνικές λέξεις";
>
> That's curious. I'd look at what encoding your query string has.

I saw Tuxedo's UTF-8 characters correctly, it seems somewhere on
the way to you their encoding was borken?

Eli the Bearded

unread,
Sep 16, 2021, 3:09:57 PM9/16/21
to
In comp.lang.perl.misc, Tuxedo <tux...@mailinator.net> wrote:
> It just reads some greek words. Odd that does not show in your end. My news
> reader and news posting window is set to UTF-8.

I saw the Greek originally, but I had an editor hiccup that clearly
screwed that up. Sorry. That's also why I picked B instead of Q for
the encoding. Q is best suited for mostly ASCII content like French
or German.

> >> $msg = MIME::Lite->new (
> >> From => "\"Commenter\" <no-reply\@example.com>",
> >> To => 'comm...@example.com',
> >> Type =>'multipart/mixed',
> >> Subject => encode( 'MIME-Header', $subject_line)
> >> );
> >
> > My Encode module does not document a 'MIME-Header' encoding. I use
> > MIME:EncWords for that.
> >
> > use MIME::EncWords qw( encode_mimeword );
> >
> > ...
> > Subject => encode_mimeword( $subject_line, 'B', 'UTF-8')
> >
> > Where "B" is for base64 and 'Q' woult be 'quoted-printable'.
>
> Thanks for the tips. I will give MIME::EncWords a try.
>
> Another tool I've used for something a similar is
> Email::MIME::RFC2047::Encoder

I don't know that one, but from the name it's doing the same thing.
RFC-2047 defines "MIME encoded words" for putting non-ASCII content
into 7-bit clean mail headers.

> I'm not sure where the UTF-8 conversion fails in the mail or CGI.

Try adding some logging. Sometimes for CGI stuff I find it easiest
to open my own log file and write to that.

I see from other follow-ups this is Perl 5.10.x. I have a 5.10.1 here,
and I tried the code, but I don't have MIME::Lite or MIME::EncWords
for that install.

Elijah
------
only willing to try so hard to duplicate an environment

Eric Pozharski

unread,
Sep 17, 2021, 5:33:24 AM9/17/21
to
with <shth5p$caq$1...@solani.org> Tuxedo wrote:
> Hello,
>
> How can I process the input of an HTML form in UTF-8 with CGI and pass it
> through a MIME-Lite's sending procedure intact?

*SKIP*
> use Encode qw(encode encode_utf8 );
> use utf8;
>
> $query = new CGI;
>
> $comments = $query->param('comments');
>
> # If I collect a UTF-8 charset subject line it becomes
> # goobledegook once mailed
>
> $subject_line = $query->param('subject');
>
> # but if I define a UTF-8 character string here it works
> # in the subject line of the resulting mail
>
> # $subject_line = "μερικές ελληνικές λέξεις";

This suggests (because 'use utf8') MIME-Lite is fine. Anyway,

>
> MIME::Lite->send ("sendmail", "/usr/bin/sendmail -t -oi");
>
> $msg = MIME::Lite->new (
> From => "\"Commenter\" <no-reply\@example.com>",
> To => 'comm...@example.com',
> Type =>'multipart/mixed',
> Subject => encode( 'MIME-Header', $subject_line)

Wow! Encode can do 'MIME-Header'?! I see the light!

*SKIP*
> $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);

That looks like copy-paste, but "carset"?

Anyway, as you see for yourself: if you pass non-latin1 contents
properly stored in Perl's internal encoding (due 'use utf8') to
MIME-Lite (which is Perl's internal encoding aware, apparently) you are
fine. I don't remember 5.10 now (and digging through Changes isn't
feasable), *if* you'd be younger (like 5.14) I'd suggest to replace 'use
utf8' with 'use feature qw/ unicode_strings /' insted (but it might be
not an option).

Anyway, I suggest, (unless you absolutely need 'use utf8' for something)
drop 'use utf8' and add 'use Encode qw/ decode_utf8 /'. What you need
is *decoding* strings that come out of CGI.pm. Apparently, CGI.pm
doesn't decode whatever comes from network, turns out that's you who has
to do it (decoding). Better yet, 'use Encode qw/ decode /', figure out
what encoding was with the request that CGI.pm dealt with and then
decode properly (there are more encodings outside than just UTF-8).

*CUT*

--
Torvalds' goal for Linux is very simple: World Domination
Stallman's goal for GNU is even simpler: Freedom

Tuxedo

unread,
Sep 17, 2021, 10:00:21 AM9/17/21
to
Thanks for the tips and feedback.

I try to generate a message in UTF-8 as submitted via a form and pass it
through a mail sending procedure to an own address, which could include
possible UTF-8 in mail headers; the name part in From, To, Subject and
everything in the message body that may also receive various UTF-8
characters.

UTF-8 may also be needed in the (From) email address part to be compatible
with IDN strings. I think the input will need to be converted into Punycode
domain representations to work via any email module and while passing
through email address syntax checking.

While returning valid UTF-8 via CGI and into a browser is simple, I'm not
sure which mail sending module may best serve the purpose.

After all, MIME::Lite is depreciated. Alternately, I've used Mail::Sender
for a different application in the past but which is now also depreciated.
Both offer easy ways to include inline attachments, HTML and plain text
alternatives in case it will be needed. I will try with Mail::Sender unless
someone has another recommendation.

Tuxedo

Tuxedo

unread,
Sep 17, 2021, 10:20:29 AM9/17/21
to
Eric Pozharski wrote:

...

>
> *SKIP*
>> $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
>
> That looks like copy-paste, but "carset"?

I'm not sure where I got that from but yes, it's likely copy-paste :-)

>
> Anyway, as you see for yourself: if you pass non-latin1 contents
> properly stored in Perl's internal encoding (due 'use utf8') to
> MIME-Lite (which is Perl's internal encoding aware, apparently) you are
> fine. I don't remember 5.10 now (and digging through Changes isn't
> feasable), *if* you'd be younger (like 5.14) I'd suggest to replace 'use
> utf8' with 'use feature qw/ unicode_strings /' insted (but it might be
> not an option).
>
> Anyway, I suggest, (unless you absolutely need 'use utf8' for something)
> drop 'use utf8' and add 'use Encode qw/ decode_utf8 /'. What you need
> is *decoding* strings that come out of CGI.pm. Apparently, CGI.pm
> doesn't decode whatever comes from network, turns out that's you who has
> to do it (decoding). Better yet, 'use Encode qw/ decode /', figure out
> what encoding was with the request that CGI.pm dealt with and then
> decode properly (there are more encodings outside than just UTF-8).
>

Thanks for the above comments. I think they also highlight compatibility
issues I have with some other applications where I've so-far resorted to
using HTML entities as a cumbersome workaround to CGI generated HTML output.

Tuxedo

Grant Taylor

unread,
Sep 17, 2021, 11:44:58 AM9/17/21
to
On 9/15/21 2:19 PM, Tuxedo wrote:
> I would like all contents, including mailheaders (Subject, Reply-to
> and From headers to be UTF-8 compatible. So far, I only managed
> to print a form's input to the browser but not encode it correctly
> through the mail procedure.

I'm late to the party, but I wanted to add the following comment:

Email headers use different (and I believe incompatible) encoding than
the MIME body of the email.

I'd have to go back and (re)read the pertinent RFCs for how to correctly
encode non-ASCII characters in email headers. But I'm quite certain
that traditional MIME encoding methods will /not/ work.



--
Grant. . . .
unix || die

Tuxedo

unread,
Sep 17, 2021, 1:49:14 PM9/17/21
to
Grant Taylor wrote:

> On 9/15/21 2:19 PM, Tuxedo wrote:
>> I would like all contents, including mailheaders (Subject, Reply-to
>> and From headers to be UTF-8 compatible. So far, I only managed
>> to print a form's input to the browser but not encode it correctly
>> through the mail procedure.
>
> I'm late to the party, but I wanted to add the following comment:
>
> Email headers use different (and I believe incompatible) encoding than
> the MIME body of the email.

I just managed to submit data through CGI and send it through MIME::Lite
with UTF-8 intact but so for not including the subject line of the mail
message. Thank you for pointing this out.
For other parts, I realise my previous mistake was simply forgetting to
declare utf-8 above the HTML form:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

I did not show the full code of comments.html in my original post so no one
could have known.

>
> I'd have to go back and (re)read the pertinent RFCs for how to correctly
> encode non-ASCII characters in email headers. But I'm quite certain
> that traditional MIME encoding methods will /not/ work.

I will test with Email::MIME::RFC2047::Encoder

Tuxedo

Eli the Bearded

unread,
Sep 17, 2021, 3:00:57 PM9/17/21
to
In comp.lang.perl.misc, Grant Taylor <gta...@tnetconsulting.net> wrote:
> I'm late to the party, but I wanted to add the following comment:
>
> Email headers use different (and I believe incompatible) encoding than
> the MIME body of the email.

RFC2047 MIME "encoded words". It's very close to regular MIME encoding,
but not 100% the same. Each word looks like:

=?${charset}?${encoding}?${encoded_bit}?=

$charset = 'utf-8'; # this and next are case insensitive
$encoding = 'B'; # base64 (easier) or 'Q' for quasi-quoted printable
$encoded_bit = base64(encode($charset, $string));

> I'd have to go back and (re)read the pertinent RFCs for how to correctly
> encode non-ASCII characters in email headers. But I'm quite certain
> that traditional MIME encoding methods will /not/ work.

My top of the head recollection is encoded words add special rules for
whitespace in quoted printable and have some added rules about what
whitespace between each word means. It's enough of a nuissance that I
prefer to use a module rather than roll-my-own from the lower MIME
encoding parts.

Actual in-the-wild Subject: nasty example I have saved in comments:

=?utf-8?b?QXR0ZW50aW9uISBJbXBvcnRhbnQgUGFyZW50IEFubm91bmNlbWVudHMgZm9yIHRoZSB3ZWVrIG9mIE5vdmVtYmVyIDE0LTE5dGguIApQbGVhc2UgUmVhZCBDYXJlZnVsbHkh?=

You might think, oh, it's nasty because it's plain ASCII that's been
base64 encoded to make it unreadable to non-MIME aware readers (eg,
grep with a procmail.log file) or because it is so long instead of
being broken up in to several shorter MIME encoded words. No, those
are reasons it's ugly, not nasty. The nasty bit is there's a new line in
there, so if you decode it as is, you cna break message header parsing
that assumes properly formatted continued lines.

Elijah
------
now normalizes whitespace

Eric Pozharski

unread,
Sep 18, 2021, 1:33:25 PM9/18/21
to
with <si2876$1on$1...@solani.org> Tuxedo wrote:
> Eric Pozharski wrote:

vvvvvv
>>> $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
^^^^^^
>> That looks like copy-paste, but "carset"?
> I'm not sure where I got that from but yes, it's likely copy-paste :-)

What about "carset" then?

Tuxedo

unread,
Sep 19, 2021, 4:48:23 AM9/19/21
to
Eric Pozharski wrote:

> with <si2876$1on$1...@solani.org> Tuxedo wrote:
>> Eric Pozharski wrote:
>
> vvvvvv
>>>> $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
> ^^^^^^
>>> That looks like copy-paste, but "carset"?
>> I'm not sure where I got that from but yes, it's likely copy-paste :-)
>
> What about "carset" then?

I'm not sure what you mean?

> *CUT*

Meanwhile, I tested a sending procedure instead of MIME-Lite, namely
Mail::Sender but have the same difficultly with UTF-8 for email transmission
for data going through CGI.

I can however transmit a string intact via mail if it's hard-coded in the
perl script:

use Mail::Sender;
use utf8;

$subject = "μερικές ελληνικές λέξεις";

my $sender = new Mail::Sender;

from => $from_email,
to => $to_email,
subject => $subject,
charset => 'utf-8',
});

$sender->Close();

But if passed through a CGI form, like this:

use CGI;
use utf8;
use Email::MIME::RFC2047::Encoder;

$subject = $query->param('subject');

my $utf8_subject_encoder = Email::MIME::RFC2047::Encoder->new;
my $utf8_encoded_subject = $utf8_subject_encoder->encode_text($subject);

from => $from_email,
to => $to_email,
subject => $utf8_encoded_subject,
charset => 'utf-8',
});

$sender->Close();

... the subject will show something like follows in a resulting email
subject line:

Î¼ÎµÏ Î¹ÎºÎ­Ï‚ ελληνικές λέξεις


The form collecting the "ρικές ελληνικές λέξεις" string uses <meta http-
equiv="Content-Type" content="text/html; charset=utf-8">

And the proper "ρικές ελληνικές λέξεις" will print fine on the output of the
CGI generated HTML result page after being passed through a form.

The output page has <meta http-equiv="Content-Type" content="text/html;
charset=utf-8">

It just won't mail for some mysterious reason, maybe relating to CGI.

Use "Email::MIME::RFC2047::Encoder;" is meant to encode for email headers as
far as I understand.

Yet, I can pass "με ρικές ελληνικές λέξεις" into the subject line of an
email without the Encoder procedure, as long as I declare 'use utf8;' at the
top of the script. As said, only if the the string is literally coded into
the perl script and not passed as a variable through CGI, it will also work
to email intact.

The correct UTF-8 characters will display fine on a CGI result page whether
hard-coded in the script or passed through a form.

The result was the same with MIME-Lite, so it's not the mailer that's the
issue. I'm not sure exactly what is.

Tuxedo

Tuxedo

unread,
Sep 19, 2021, 11:20:01 AM9/19/21
to
My issue can be reduced to a difference in the submitted form data compared
with the fixed typed-in string in my perl code, although both flavors of
UTF-8 characters appear identical in a browser window through Perl.

One works to email and the other does not. For example, I test with a simple
HTML form submit:

<!DOCTYPE html>
<html><head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>

<body>

<form ENCTYPE="multipart/form-data" method="post" action="compare.pl">

<input type="text" name="subject" size="30" value="μερικές ελληνικές
λέξεις">

<input type="submit" value="Submit">

</form>
</body>
</html>

And it's submitted to the following compare.pl script:

#!/usr/bin/perl -w

use CGI;
use utf8;
use Email::MIME::RFC2047::Encoder;

my $fixed_subject;


# Only the following passed directly through an email
# subject intact:

$fixed_subject = "μερικές ελληνικές λέξεις";

my $query = new CGI;

# This value will display correctly in a web browser
# but not after having been sent in a subject
# line of an email via Mime-Lite or other:

my $submitted_subject = $query->param('subject');


# The following $utf8_encoded_submitted_subject will not display correctly
# in a browser or email subject line:

my $utf8_submitted_subject_encoder = Email::MIME::RFC2047::Encoder->new;
my $utf8_encoded_submitted_subject = $utf8_submitted_subject_encoder-
>encode_text($submitted_subject);

print "Content-type: text/html\n\n";
print "<!DOCTYPE html>\n";
print "<html><head>\n";
print "<title>Compare</title>\n";
print "<meta http-equiv=\"Content-Type\" content=\"text/html;
charset=utf-8\">\n";
print "</head>\n";
print "<body>\n";
print "\$fixed_subject: $fixed_subject\n";
print "<hr>";
print "\$submitted_subject: $submitted_subject\n";
print "<hr>";
print "\$utf8_encoded_submitted_subject: $utf8_encoded_submitted_subject\n";

print "</body></html>\n";


I leave out the email code here but as said the $fixed_subject typed
directly into the perl code works in a subject line of a mail transmission
through Mime::Lite or Mail::Sender while the $submitted_subject that was
corrected as a form value through CGI does not.

What exactly has happens to $submitted_subject in the process and how can it
be made identical to the $fixed_subject string?

In a browser, the $fixed_subject prints as:
μερικές ελληνικές λέξεις

And the $submitted_subject prints the same:
μερικές ελληνικές λέξεις

The $utf8_encoded_submitted_subject prints as:

=?utf-8?Q?=c3=8e=c2=bc=c3=8e=c2=b5=c3=8f=c2=81=c3=8e=c2=b9=c3=8e=c2=ba?=
=?utf-8?Q?=c3=8e=c2=ad=c3=8f=c2=82_=c3=8e=c2=b5=c3=8e=c2=bb=c3=8e=c2=bb?=
=?utf-8?Q?=c3=8e=c2=b7=c3=8e=c2=bd=c3=8e=c2=b9=c3=8e=c2=ba=c3=8e=c2=ad?=
=?utf-8?Q?=c3=8f=c2=82_=c3=8e=c2=bb=c3=8e=c2=ad=c3=8e=c2=be=c3=8e=c2=b5?=
=?utf-8?Q?=c3=8e=c2=b9=c3=8f=c2=82?=

If I send the "μερικές ελληνικές λέξεις" characters in the subject of an
email using Thunderbird, they displays fine in the email program.
Thunderbird's specific subject line source code appears as follows:

=?UTF-8?B?zrzOtc+BzrnOus6tz4IgzrXOu867zrfOvc65zrrOrc+CIM67zq3Ovs61?=
=?UTF-8?B?zrnPgg==?=

The source of the $fixed_subject line of the perl generated mail looks as
follows:

=?utf-8?Q?=ce=bc=ce=b5=cf=81=ce=b9=ce=ba=ce=ad=cf=82_=ce=b5=ce=bb=ce=bb?=
=?utf-8?Q?=ce=b7=ce=bd=ce=b9=ce=ba=ce=ad=cf=82_=ce=bb=ce=ad=ce=be=ce=b5?=
=?utf-8?Q?=ce=b9=cf=82?=

So the $fixed_subject displays fine. How can the $submitted_subject string
be be made or preserved identical? After all, it's the same set of
characters but with somewhat different encoding or copying in perl I guess.

Thanks in advance for any suggestions.

Tuxedo


Tuxedo

unread,
Sep 19, 2021, 12:45:45 PM9/19/21
to
I meant ... the $submitted_subject that was *submitted* as a form value
through CGI does not.

> What exactly has happens to $submitted_subject in the process and how can
> it be made identical to the $fixed_subject string?

As Eric Pozharski pointed earlier I think it's necessary to decode what
comes through CGI and re-encode correctly for use in email headers and
perhaps somewhat differently for the email body.

How can this be done with the single form submit field CGI example and with
a fairly old Perl version (5.10.1) and its relatively old set of modules?
Maybe with 'use Encode qw / decode_utf8 /'.

Thanks again for any suggestions and example code bits if possible.

Tuxedo

Eric Pozharski

unread,
Sep 19, 2021, 1:33:11 PM9/19/21
to
with <si6tgg$cq6$1...@solani.org> Tuxedo wrote:
> Eric Pozharski wrote:
>> with <si2876$1on$1...@solani.org> Tuxedo wrote:
>>> Eric Pozharski wrote:

>> vvvvvv
>>>>> $msg->attach (Type =>'text/plain; carset=utf-8', Data => $body);
>> ^^^^^^
>>>> That looks like copy-paste, but "carset"?
>>> I'm not sure where I got that from but yes, it's likely copy-paste
>>> :-)
>> What about "carset" then?
> I'm not sure what you mean?

OK, follow me on this. If it's copy-paste then it's (likely) your
running code. In your running code you have this

Type =>'text/plain; carset=utf-8'

This 'carset' can't be right.

*SKIP*
> It just won't mail for some mysterious reason, maybe relating to CGI.
*SKIP*
> The result was the same with MIME-Lite, so it's not the mailer that's
> the issue. I'm not sure exactly what is.

Told you so. Now you're back to $square{zero}

[[ out of order ]]
> $subject = $query->param('subject');

Just pouring modules and/or pragmas in your code is mad science.
Please, try this way:

# use utf8;
use Encode qw/ decode_utf8 /;
# ... boilerplate ...
$subject = decode_utf8( $query->param( 'subject' ));

Tuxedo

unread,
Sep 19, 2021, 11:31:15 PM9/19/21
to
Eric Pozharski wrote:

...

> That looks like copy-paste, but "carset"?
>
> Anyway, as you see for yourself: if you pass non-latin1 contents
> properly stored in Perl's internal encoding (due 'use utf8') to
> MIME-Lite (which is Perl's internal encoding aware, apparently) you are
> fine. I don't remember 5.10 now (and digging through Changes isn't
> feasable), *if* you'd be younger (like 5.14) I'd suggest to replace 'use
> utf8' with 'use feature qw/ unicode_strings /' insted (but it might be
> not an option).
>
> Anyway, I suggest, (unless you absolutely need 'use utf8' for something)
> drop 'use utf8' and add 'use Encode qw/ decode_utf8 /'.

It I drop 'use utf8;' and replace it with:

use Encode qw/ decode_utf8 /;

The fixed characters that were typed in directly in the perl script
("μερικές ελληνικές λέξεις") become "Î¼ÎµÏ Î¹ÎºÎ­Ï‚ ελληνικές λέ
ξεις" when passed through the email procedure in a subject line while
the source code of the subject line in the resulting email appears as
follows:

Subject:
=?utf-8?Q?=C3=8E=C2=BC=C3=8E=C2=B5=C3=8F=C2=81=C3=8E=C2=B9=C3=8E=C2=BA=C3=8E=C2=AD=C3=8F=C2=82=20?==?utf-8?Q?=C3=8E=C2=B5=C3=8E=C2=BB=C3=8E=C2=BB=C3=8E=C2=B7=C3=8E=C2=BD=C3=8E=C2=B9=C3=8E=C2=BA=C3=8E=C2=AD=C3=8F=C2=82=20?==?utf-8?Q?=C3=8E=C2=BB=C3=8E=C2=AD=C3=8E=C2=BE=C3=8E=C2=B5=C3=8E=C2=B9=C3=8F=C2=82?=

As for the form submitted submitted "μερικές ελληνικές λέξεις", it appears
the same (""Î¼ÎµÏ Î¹ÎºÎ­Ï‚ ελληνικές λέξεις") when 'use
Encode qw/ decode_utf8 /;' is in place, and as above in the source.

What can be done to properly decode/encode user-submitted UTF-8 data in a
way that the data can be the same as if typed directly in the perl code so
it can pass through email?

The:
use feature qw/ unicode_strings /;
.. caused and error on the perl version I have.

> What you need
> is *decoding* strings that come out of CGI.pm. Apparently, CGI.pm
> doesn't decode whatever comes from network, turns out that's you who has
> to do it (decoding). Better yet, 'use Encode qw/ decode /', figure out
> what encoding was with the request that CGI.pm dealt with and then
> decode properly (there are more encodings outside than just UTF-8).

How exacly can 'use Encode qw/ decode /' figure in perl what encoding was
used when it's user-submitted via CGI.pm? It can be any set of UTF-8
characters. On the HTML form I define:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Thanks,
Tuxedo


Tuxedo

unread,
Sep 20, 2021, 7:50:24 AM9/20/21
to
Tuxedo wrote:

...

>
>> What you need
>> is *decoding* strings that come out of CGI.pm. Apparently, CGI.pm
>> doesn't decode whatever comes from network, turns out that's you who has
>> to do it (decoding). Better yet, 'use Encode qw/ decode /', figure out
>> what encoding was with the request that CGI.pm dealt with and then
>> decode properly (there are more encodings outside than just UTF-8).
>
> How exacly can 'use Encode qw/ decode /' figure in perl what encoding was
> used when it's user-submitted via CGI.pm? It can be any set of UTF-8
> characters. On the HTML form I define:
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
>
> Thanks,
> Tuxedo

It was indeed CGI that changed things.

I finally got it working with sending input by email in UTF-8 intact format
by adding '-utf8' to the CGI call.

use CGI '-utf8';

This made the Greek words work in both mail (in headers and body) and when
generated onto a results page on a web browser.

But for some reason, that also turned French accented characters into little
black symbols with question marks in a web browser (although the email
result worked, both in headers and body).

Maybe it doesn't display the same for everyone here but it's basically just
the "ç", as with any caractères accentués :

Quelques mots fran�ais

But if within the perl script I then declare:

binmode(STDOUT, ':utf8');

Now the Greek words work, the French accented words work, the €-sign and
hopefully more or less everything else.

Thanks for your everyone's comments which in all led to me in the right
direction.

Tuxedo


Eric Pozharski

unread,
Sep 20, 2021, 1:33:21 PM9/20/21
to
with <si8v9s$lka$1...@solani.org> Tuxedo wrote:
> Eric Pozharski wrote:

*SKIP*
>> Anyway, as you see for yourself: if you pass non-latin1 contents
>> properly stored in Perl's internal encoding (due 'use utf8') to
>> MIME-Lite (which is Perl's internal encoding aware, apparently) you
>> are fine. I don't remember 5.10 now (and digging through Changes
>> isn't feasable), *if* you'd be younger (like 5.14) I'd suggest to
>> replace 'use utf8' with 'use feature qw/ unicode_strings /' insted
>> (but it might be not an option).
>>
>> Anyway, I suggest, (unless you absolutely need 'use utf8' for
>> something) drop 'use utf8' and add 'use Encode qw/ decode_utf8 /'.
>
> It I drop 'use utf8;' and replace it with:
> use Encode qw/ decode_utf8 /;
*SKIP*

I have disturbing feeling that you don't realise important distinction.
'utf8.pm' (per 'use utf8;') is a *pragma* (so are 'strict.pm',
'feature.pm', 'bytes.pm' and so on). Pragmas alter behaviour of perl
when compiling *your* script. As you have observed by yourserlf (no
'use utf8' and now your fancy strings (in *your* script!) result in
garbage).

'Encode.pm' is a *module* (for purists, yes, calling 'Encode.pm' a
"module" is a stretch and huge one). A module is just an addition to
your toolbox -- no more, no less. A hammer (or drill, or 3D-printer, or
nuclear reactor) without application will patiently sit where you've put
it (until it rots). *Without* application.

(I expect it to be rush, but whatever) Throwing random shit on your
screen is not a way to go through life.

> What can be done to properly decode/encode user-submitted UTF-8 data
> in a way that the data can be the same as if typed directly in the
> perl code so it can pass through email?

'decode_utf8' *must* be applied to whatever strings of *bytes* (or
'octets' might be more relevant) are taken out of 'CGI.pm' to make them
strings of *characters*. Bytes are not characters, characters are not
bytes -- it's a Perl thing.

> The: use feature qw/ unicode_strings /; .. caused and error on the
> perl version I have.

So my memories aren't faulty (in regard of 'unicode_strings').

>> What you need is *decoding* strings that come out of CGI.pm.
>> Apparently, CGI.pm doesn't decode whatever comes from network, turns
>> out that's you who has to do it (decoding). Better yet, 'use Encode
>> qw/ decode /', figure out what encoding was with the request that
>> CGI.pm dealt with and then decode properly (there are more encodings
>> outside than just UTF-8).
>
> How exacly can 'use Encode qw/ decode /' figure in perl what encoding
> was used when it's user-submitted via CGI.pm? It can be any set of
> UTF-8 characters. On the HTML form I define:
><meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Per "$string = decode(ENCODING, $octets [, CHECK])", 'decode' can't, it
must be told what encoding "$octets" are in. 'decode_utf8' is already
told encoding is utf-8, and decodes in-place.

As of "http-equiv", it might be an improvement, I guess. Like,
permissive applicaton takes whatever remote (through 'CGI.pm') has sent,
asks 'CGI.pm' what remote suggests encoding is, and decodes
appropriately. Repressive application tells remote to send encoded in
utf-8, decodes, and if it (decoding) fails throws input away (yup,
decoding might fail).

Still, 'CGI.pm' won't decode your inputs automagically.

p.s. Also, I'm no way 'CGI.pm' expert, but my perldoc-fu is good
enough,.. But I'd rather execute restraint ;)
0 new messages