Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

LWP and Unicode

33 views
Skip to first unread message

Dale

unread,
Oct 2, 2006, 4:59:35 AM10/2/06
to
I have a couple of questions/problems concerning LWP and
Unicode. Here's an ultra-simple program that goes to a web page,
downloads it's contents and prints them out in a semi-readable form:

----------------------------------

#!/.../perl-5.8.8/bin/perl -CSDA

use utf8;
use LWP;
use Encode;
use URI::Escape;

my $browser = LWP::UserAgent->new;
$browser->parse_head(0);

my $url =
'http://bg.wiktionary.org/wiki/Уикиречник:Български/Типове_думи/Глаголи';
my $response = $browser->get(encode("utf8", $url));

my $content = decode("utf8", uri_unescape($response->content));

print "$content\n";

----------------------------------

Question 1: Why do I need the line that says

$browser->parse_head(0);


Question 2: Why do I need to explicitly say:

decode("utf8", ...)

Isn't there a way to tell LWP that the content is utf8? Or more
precisely, that it is utf8 with some URI percent escapes.


Question 3: If you change the pragma "use utf8" to "use encoding
'utf8'" then you don't need the call to "decode("utf8", ...)". Why
should this be? What's the difference between "use utf8" and "use
encoding 'utf8'"? The perldoc:perlunicode is no help here.


Question 4: In the original program, replace the line

my $content = decode("utf8", uri_unescape($response->content));

with

my $content = $response->content;
utf8::upgrade($content);

The perldoc:perlunicode page says you should do this when, for some
reason, Unicode does not happen. But this does nothing for me. I still
end up with bytes.

Dale

unread,
Oct 2, 2006, 9:32:28 AM10/2/06
to
One more question in a similar vein. Using HTML::LinkExtor on a page
using Unicode, I can't seem to process the page without at least one
warning of the form:

Parsing of undecoded UTF-8 will give garbage when decoding entities
at ./verb_extorline 32.

The code I used was pretty straightforwardly modified from the
Cookbook:

--------------------------------

#!.../perl-5.8.8/bin/perl -w -CSDA

use utf8;
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;


use Encode;
use URI::Escape;

my $url =
'http://bg.wiktionary.org/wiki/Уикиречник:Български/Типове_думи/Глаголи';
my $encoded_url = encode("utf8", $url);

$ua = LWP::UserAgent->new;
$ua->parse_head(0); #### without this line, you get the error twice

# Set up a callback that collect image links
my @links = ();
sub callback {
my($tag, %attr) = @_;
my ($link) = values(%attr);
$link = url($link, $encoded_url)->abs;
$link = decode("utf8", uri_unescape($link));
push(@links, $link);
}

$p = HTML::LinkExtor->new(\&callback);

# Request document and parse it as it arrives
$ua->request(HTTP::Request->new(GET => encode("utf8", $url)),
sub {$p->parse($_[0])});


# Print them out
print join("\n", @links), "\n";

Ben Morrow

unread,
Oct 2, 2006, 5:40:23 PM10/2/06
to

Quoth "Dale" <dale.ge...@googlemail.com>:

> I have a couple of questions/problems concerning LWP and
> Unicode. Here's an ultra-simple program that goes to a web page,
> downloads it's contents and prints them out in a semi-readable form:
>
> ----------------------------------
>
> #!/.../perl-5.8.8/bin/perl -CSDA

I presume this isn't your real #! line...

Do you know what -CSDA does? In this case it is useless, unless it
interferes with LWP's filehandle encodings. It is probably best avoided
until you understand Perl's (slightly odd) Unicode handling better.

> use utf8;
> use LWP;
> use Encode;
> use URI::Escape;
>
> my $browser = LWP::UserAgent->new;
> $browser->parse_head(0);
>

> my $url = 'http://bg.wiktionary.org/wiki/LotsaCyrillic';

Please don't post 8-bit data (including UTF8) to Usenet unless the
group's charter explicitly permits it.

> my $response = $browser->get(encode("utf8", $url));
>
> my $content = decode("utf8", uri_unescape($response->content));
>
> print "$content\n";
>
> ----------------------------------
>
> Question 1: Why do I need the line that says
>
> $browser->parse_head(0);

You don't. The docs for this are (surprisingly) in perldoc
LWP::UserAgent.

> Question 2: Why do I need to explicitly say:
>
> decode("utf8", ...)
>
> Isn't there a way to tell LWP that the content is utf8? Or more
> precisely, that it is utf8 with some URI percent escapes.

Not AFAIK. You probably ought to decode the data before you uri_unescape
it; one of the virtues of UTF-8 is that this doesn't matter, but it
would for other encodings.

> Question 3: If you change the pragma "use utf8" to "use encoding
> 'utf8'" then you don't need the call to "decode("utf8", ...)". Why
> should this be? What's the difference between "use utf8" and "use
> encoding 'utf8'"? The perldoc:perlunicode is no help here.

The differences are

1. encoding supports many encodings.
2. encoding is probably negligbly slower.
3. encoding gives decent error recovery (as opposed to crashing
perl).
4. encoding sets a default PerlIO layer on STDIN and STDOUT, unless
you've already done so with the -C switch.

I can see no reason why the two should give different results in this
case; but perhaps your -CSDA is interfering.

> Question 4: In the original program, replace the line
>
> my $content = decode("utf8", uri_unescape($response->content));
>
> with
>
> my $content = $response->content;
> utf8::upgrade($content);
>
> The perldoc:perlunicode page says you should do this when, for some
> reason, Unicode does not happen. But this does nothing for me. I still
> end up with bytes.

IMHO perlunicode is wrong in this regard :). The utf8::* functions are
part of the internal implementation of utf8-handling; users should never
have cause to use them.

As of 5.8, Perl strings have an internal flag that marks them as being
stored in utf8. What utf8::upgrade does is

1. If the string already has the UTF8 flag on, quit.
2. For every top-bit-set byte in the string:
3. Look up the appropriate character in ISO8859-1, and
4. Replace the byte with that character's 2-byte encoding in
utf8.
5. Set the UTF8 flag on the string, so that Perl now sees those
2-byte sequences as one character each.

The net result, from the Perl level, is that *absolutely nothing has
changed*. The *only* Perl-visible change is that utf8::is_utf8 now
returns true, even if it returned false before; but you *shouldn't be
concerned with that*.

The correct function for 'this bunch of bytes happens to be a piece of
UTF8-encoded text; decode it and give me a string containing those
characters' is Encode::decode, as you have established.

Ben

--
For far more marvellous is the truth than any artists of the past imagined!
Why do the poets of the present not speak of it? What men are poets who can
speak of Jupiter if he were like a man, but if he is an immense spinning
sphere of methane and ammonia must be silent?~Feynmann~benm...@tiscali.co.uk

Dale

unread,
Oct 4, 2006, 4:00:33 AM10/4/06
to
Thanks Ben for the thorough answer. But there is still a difference
between "encoding 'utf8'" and "use utf8" that you are somehow missing.


Ben Morrow wrote:
> Quoth "Dale" <dale.ge...@googlemail.com>:
> > ...


> > Question 3: If you change the pragma "use utf8" to "use encoding
> > 'utf8'" then you don't need the call to "decode("utf8", ...)". Why
> > should this be? What's the difference between "use utf8" and "use
> > encoding 'utf8'"? The perldoc:perlunicode is no help here.
>
> The differences are
>
> 1. encoding supports many encodings.
> 2. encoding is probably negligbly slower.
> 3. encoding gives decent error recovery (as opposed to crashing
> perl).
> 4. encoding sets a default PerlIO layer on STDIN and STDOUT, unless
> you've already done so with the -C switch.
>
> I can see no reason why the two should give different results in this
> case; but perhaps your -CSDA is interfering.

I've eliminated the -CSDA and still get a major difference. Try this:

#!.../perl-5.8.8/bin/perl -w

# uncomment one of the following:
# use encoding 'utf8';
# use utf8;


use LWP;
use Encode;
use URI::Escape;

my $browser = LWP::UserAgent->new;
$browser->parse_head(0);

my $url =
'http://bg.wiktionary.org/wiki/%D0%A3%D0%B8%D0%BA%D0%B8%D1%80%D0%B5%D1%87%D0%BD%D0%B8%D0%BA:%D0%91%D1%8A%D0%BB%D0%B3%D0%B0%D1%80%D1%81%D0%BA%D0%B8/%D0%A2%D0%B8%D0%BF%D0%BE%D0%B2%D0%B5_%D0%B4%D1%83%D0%BC%D0%B8/%D0%93%D0%BB%D0%B0%D0%B3%D0%BE%D0%BB%D0%B8';

my $response = $browser->get($url);

my $content = uri_unescape($response->content);

print "$content\n";

---------------

The results are (for me) better with "use utf8'. In this case $content
is a character sequence with human-readable characters.

It appears that "use encoding 'utf8'" decodes the UTF-8 before the
call to uri_unescape and "use utf8" decodes after the
utf_unescape. But this is just my uneducated guess.

Dale

unread,
Oct 4, 2006, 4:02:52 AM10/4/06
to
Parsing of undecoded UTF-8 will give garbage ...

I (Dale Gerdemann) wrote:
> Question 1: Why do I need the line that says
>
> $browser->parse_head(0);

And Ben Morrow answered:


> You don't. The docs for this are (surprisingly) in perldoc
> LWP::UserAgent.

And the perldoc says:
> $ua->parse_head
> $ua->parse_head( $boolean )
> Get/set a value indicating whether we should initialize response head-
> ers from the <head> section of HTML documents. The default is TRUE.
> Do not turn this off, unless you know what you are doing.

Okay, I admit that I don't know what I'm doing. But I do know that


without the line, you get a warning that says:

Parsing of undecoded UTF-8 will give garbage when decoding entities

at /afs/sfs/lehre/dg/myperl/lib/LWP/Protocol.pm line 114.

I'm just trying to make Perl happy.

Dale Gerdemann

Dale

unread,
Oct 4, 2006, 6:32:51 AM10/4/06
to
Sorry to respond multiple time to my own question, but I keep testing
things and am getting results I can't explain.


Ben Morrow wrote (concerning "use 'utf8'" compared to "use encoding
'utf8'"):


>
> 1. encoding supports many encodings.
> 2. encoding is probably negligbly slower.
> 3. encoding gives decent error recovery (as opposed to crashing
> perl).
> 4. encoding sets a default PerlIO layer on STDIN and STDOUT, unless
> you've already done so with the -C switch.

>From this, I suppose that the -CIO switch and "use encoding 'utf8'"
should be interchangable, as long as there is no Unicode in the
program. The only point that is relevant here is number 4 from the
above list.

perldoc encoding says:
The encoding pragma also modifies the filehandle layers of STDIN

and STDOUT to the specified encoding. Therefore,

perldoc perlrun says:
The "-C" flag controls some Unicode of the Perl Unicode
features.

As of 5.8.1, the "-C" can be followed either by a number or
a list of option letters. The letters, their numeric
values, and effects are as follows; listing the letters is
equal to summing the numbers.

I 1 STDIN is assumed to be in UTF-8
O 2 STDOUT will be in UTF-8


BUT: Surprisingly, the two don't give the same results:

Here's my test program:
#!/afs/sfs/lehre/dg/perl-5.8.8/bin/perl # -CIO

# use encoding 'utf8';
use LWP;
use URI::Escape;

my $browser = LWP::UserAgent->new;
$browser->parse_head(0);

my $url =
'http://bg.wiktionary.org/wiki/%D0%A3%D0%B8%D0%BA%D0%B8%D1%80%D0%B5%D1%87%D0%BD%D0%B8%D0%BA:%D0%91%D1%8A%D0%BB%D0%B3%D0%B0%D1%80%D1%81%D0%BA%D0%B8/%D0%A2%D0%B8%D0%BF%D0%BE%D0%B2%D0%B5_%D0%B4%D1%83%D0%BC%D0%B8/%D0%93%D0%BB%D0%B0%D0%B3%D0%BE%D0%BB%D0%B8';


my $response = $browser->get($url);

my $content = uri_unescape($response->content);

print "$content\n";

-------------------

The -CIO switch and the encoding pragma are both commented out. There
are four possibilities to uncomment 0, 1 or 2 of these.

On the web page to be downloaded there is both:

1. utf8 encoded Unicode, and
2. escaped (percent encoded) utf8 encoded Unicode

So again there are 4 possibilities with one, the other, both or neither
of these Unicodes being correctly decoded.

And the results:

1. Using just the switch -CIO is a horrible failure. None of the
Unicode is decoded.
2. Using "use encoding 'utf8'" is better. The non-escaped Unicode is
decoded
3. Using both the switch -CIO and "use encoding 'utf8'" is the same as
just using the encoding pragma.
4. Using nothing at all gives the best result. All the Unicode is
correctly decoded.

Case 1 (the horrible failure) is in some ways better than cases 2 and
3. If none of the Unicode is decoded, then you can explicitly decode:

my $content = decode("utf8", uri_unescape($response->content));

In cases 2 and 3, this results in a failure:
Cannot decode string with wide characters at
/afs/sfs/lehre/dg/perl-5.8.8/lib/5.8.8/i686-linux/Encode.pm line 166.

Can anyone explain this behavior?

Dale Gerdemann

Dale

unread,
Oct 5, 2006, 11:10:14 AM10/5/06
to
How to read a web page containing partly utf8 and partly
percent-encoded utf8.

Assumption: We want, for whatever reason, to have "use encoding
'utf8'". It's not clear that this pragma helps, but this is the 21st
century and we just want to use Unicode.

Problem: The problem occurs in an unexpected place. Escape::unescape
does more than just decode the percent encoding. This doesn't seem to
be documented.

Alternatives: The obvious alternative is to parse more carefully and
apply the appropriate encoding/unencoding to the appropriate parts of
the document.

Questions: I still don't think I understand the difference between the
runtime switch -CIO and "use encoding 'utf8'". Okay, I know that
'encoding' allows you to put Unicode into your program. But beyond
that, there seems to be some difference in what happens with IO.


#!/afs/sfs/lehre/dg/perl-5.8.8/bin/perl

use encoding 'utf8';
use LWP;
use Encode qw(encode decode is_utf8);
use URI::Escape qw(uri_unescape);

my $browser = LWP::UserAgent->new;

## Tiny test web-page. Included just the line "h a h", where the h's
## are upside down (actually Cyrillic).
my $url
=
'http://www.sfs.uni-tuebingen.de/iscl/Kursmaterialien/Gerdemann/foo.html';

my $response = $browser->get(encode("utf8", $url));

# raw_content is a byte sequence, containing some utf8 encoded bits
# and some percent encoded utf8 encoded bits
my $raw_content = $response->content;

# $encoded_content adds a double utf8 encoding to the already utf8 bits
of
# raw_content
my $encoded_content = encode("utf8", $raw_content);

# $unescaped_content is different from $encoded_content in two
respects.
# 1. The percent-encoded parts of $encoded_content are decoded into
utf8
# 2. The doubly utf-encoded bits of $encoded_content lose a layer of
utf
# encoding. WARNING: This only happens if there are actually some
percent
# encoded bits that get decoded. Is this documented somewhere???
my $unescaped_content = uri_unescape($encoded_content);

# After the previous step, everything was utf8, so now finally we turn
# it into Unicode.
my $decoded_content = decode("utf8", $unescaped_content);

print "$decoded_content\n";

# Test: change $decoded_content to the level you want to see.
# while ($decoded_content =~ m/(.)/g) {
# print ord $1, "\n";
# }

Mumia W. (reading news)

unread,
Oct 5, 2006, 12:37:58 PM10/5/06
to
On 10/05/2006 10:10 AM, Dale wrote:
> How to read a web page containing partly utf8 and partly
> percent-encoded utf8.
>
> [...]
> 'http://www.sfs.uni-tuebingen.de/iscl/Kursmaterialien/Gerdemann/foo.html';
> [...]

403 Forbidden


--
paduille.4...@earthlink.net
Posting Guidelines for comp.lang.perl.misc:
http://www.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

Dale

unread,
Oct 6, 2006, 4:55:21 AM10/6/06
to
Mumia W. (reading news) wrote:
> 403 Forbidden

Whoops! I'm sure you managed to recreate the website yourself. But just
in case:

http://www.sfs.uni-tuebingen.de/~dg/fooo.html

The contents are:

%D1%86 a ц

Sorry again for violating the newsgroup charter by using Unicode here.
But sometime there ought to be a discussion of why the newsgroup has
such a charter. Perl allows programs to be written with Unicode, but
such programs cannot be discussed here. Does this make sense?

Mumia W. (reading news)

unread,
Oct 6, 2006, 1:57:30 PM10/6/06
to
On 10/06/2006 03:55 AM, Dale wrote:
> Mumia W. (reading news) wrote:
>> 403 Forbidden
>
> Whoops! I'm sure you managed to recreate the website yourself. But just
> in case:
>
> http://www.sfs.uni-tuebingen.de/~dg/fooo.html
>
> The contents are:
>
> %D1%86 a ц
> [...]

I used GNU Wget to download that file (while using -s to also save the
HTTP headers):


> HTTP/1.1 200 OK
> Date: Fri, 06 Oct 2006 16:20:56 GMT
> Server: Apache/1.3.33 (Debian GNU/Linux)
> Last-Modified: Fri, 06 Oct 2006 08:40:50 GMT
> ETag: "1d27db-b-45261692"
> Accept-Ranges: bytes
> Content-Length: 11
> Keep-Alive: timeout=15, max=100
> Connection: Keep-Alive
> Content-Type: text/html; charset=iso-8859-1
>
> %D1%86 a ц

Your data seems to be UTF8, but you advertise it as iso-8859-1. Don't
you think that will confuse user agents such as LWP::UserAgent?

Ben Morrow

unread,
Oct 6, 2006, 12:37:55 PM10/6/06
to

Quoth "Dale" <dale.ge...@googlemail.com>:

It's not a question of this group's charter, it applies generally on
Usenet. There is no header in a Usenet article that specifies a charset,
so no way to use anything other than the default ASCII.

I agree in principle: some form of charset header should be added, or
the charset should simply be specified to be UTF8. But until it is,
please refrain from using it.

Ben

--
#!/bin/sh
quine="echo 'eval \$quine' >> \$0; echo quined"
eval $quine
# [benm...@tiscali.co.uk]

Dr.Ruud

unread,
Oct 7, 2006, 9:36:02 AM10/7/06
to
Ben Morrow schreef:

> It's not a question of this group's charter, it applies generally on
> Usenet. There is no header in a Usenet article that specifies a
> charset, so no way to use anything other than the default ASCII.
>
> I agree in principle: some form of charset header should be added, or
> the charset should simply be specified to be UTF8. But until it is,
> please refrain from using it.

In practice there is no problem with headers like
"Content-Type: text/plain; charset=ISO-8859-1"
because most readers deal with them as expected.

Henry Spencer once (1994) created the Son-of-RFC-1036:
http://www.chemie.fu-berlin.de/outerspace/netnews/son-of-1036.html
to document the state of that moment, and stated MIME as relevant for
news articles.

See also USEFOR, the Grandson-of-RFC-1036:
http://www.ietf.org/html.charters/usefor-charter.html
("an urgent need has been identified to formalize and document
many of the current and proposed extensions to the Usenet
Article format")

--
Affijn, Ruud

"Gewoon is een tijger."


Alan J. Flavell

unread,
Oct 7, 2006, 10:31:11 AM10/7/06
to
On Sat, 7 Oct 2006, Dr.Ruud wrote:

> In practice there is no problem with headers like
> "Content-Type: text/plain; charset=ISO-8859-1"
> because most readers deal with them as expected.

That's not the whole story: such postings should also have valid
MIME headers, or else the client is required to treat it as pre-MIME
format, which probably isn't what was intended. And IIRC some news
clients do indeed apply that rule.

Of course this is all de-facto, but the valid RFC for usenet (1036) is
now hopelessly out of date, so we have to live with some de-facto
rules. Which can be very well based a best common factor between the
discussions for a grandson of RFC-1036, and the observed common
practice. (Common practice alone isn't good enough, since some
widely-used clients by default will violate what the rules are
expected to become).

Personal view: I wouldn't recommend using charset=utf-8 yet, except
perhaps on groups where its use is already widespread. iso-8859-1 is
very widely supported, and windows-1252, although proprietary and
therefore to be deprecated, is pretty widely supported; iso-8859-15 is
somewhat less supported, I'm disinclined to recommend it, but I note
that some folks use it for their usenet postings.

[ Totally OT: use of iso-8859-15 for HTML is utterly pointless. ]

IMHO and YMMV. But successful communication depends on a certain
conservatism in what one sends - not relying on the generosity of the
recipient to interpret it liberally.

Bart Van der Donck

unread,
Oct 7, 2006, 11:10:37 AM10/7/06
to
Dr.Ruud wrote:

> [...]


> In practice there is no problem with headers like
> "Content-Type: text/plain; charset=ISO-8859-1"
> because most readers deal with them as expected.
>
> Henry Spencer once (1994) created the Son-of-RFC-1036:
> http://www.chemie.fu-berlin.de/outerspace/netnews/son-of-1036.html
> to document the state of that moment, and stated MIME as relevant for
> news articles.
>
> See also USEFOR, the Grandson-of-RFC-1036:
> http://www.ietf.org/html.charters/usefor-charter.html
> ("an urgent need has been identified to formalize and document
> many of the current and proposed extensions to the Usenet
> Article format")

I think the current situation is about like this:

One can safely use ASCII on Usenet with or without a charset header.
One can safely use ISO-8859-1, but only when specifying it in the
header.
Other charsets work when supported, but should be used carefully
depending on the circumstances. For example, on a Russian discussion
group it's reasonable to use KOI8-R. But one should obviously always
specify the charset in such cases.

I think one should not rely on any Unicode charset on Usenet (yet).

Google Groups deals with this issue as follows:

(1) Default to ISO-8859-1 when possible, yes even with plain ASCII.
(2) Use custom charset if the offered characters can unambiguously be
represented in that charset and if ISO-8859-1 is too narrow; perhaps
also considering browser settings/preferences.
(3) Use UTF-8 if the above fails; I suppose mostly in charset
combinations, 'tricky' replies or really exotic stuff.

A good policy, IMO.

--
Bart

Dale

unread,
Oct 9, 2006, 3:03:29 AM10/9/06
to

Mumia W. (reading news) wrote:
> On 10/06/2006 03:55 AM, Dale wrote:
> > Mumia W. (reading news) wrote:

> Your data seems to be UTF8, but you advertise it as iso-8859-1. Don't
> you think that will confuse user agents such as LWP::UserAgent?
>
>

Yes, I know. It's not configured properly for serving UTF8. That's why
I at first put it at a different URL where UTF8 is handled correctly.
But I forgot that this site is only local.

Dale

unread,
Oct 9, 2006, 4:07:46 AM10/9/06
to
Bart Van der Donck wrote:
> (1) Default to ISO-8859-1 when possible, yes even with plain ASCII.
> (2) Use custom charset if the offered characters can unambiguously be
> represented in that charset and if ISO-8859-1 is too narrow; perhaps
> also considering browser settings/preferences.
> (3) Use UTF-8 if the above fails; I suppose mostly in charset
> combinations, 'tricky' replies or really exotic stuff.

and Alan J Flavell wrote:
> ... successful communication depends on a certain


> conservatism in what one sends - not relying on the generosity of the
> recipient to interpret it liberally.

Isn't UTF8 the most consertive choice nowadays? Look at Wikipedia or
Wiktionary. Massive international websites all in UTF8. And look at the
Russian Wikipedia. for example. It doesn't use a "custom charset" at
all.

The idea that UTF8 should be reserved for "really exotic stuff" seems
very weird. Look at any Wikipedia page dealing with mathematics, and
you're bound to find UTF8 used for quite normal things. Here, for
example, is the rule for the associativity of function composition:

f o (g o h) = (f o g) o h

Try to say that in ASCII or ISO-8859-1!

Dale

Dale

Dr.Ruud

unread,
Oct 9, 2006, 5:48:33 AM10/9/06
to
Dale schreef:

> Isn't UTF8 the most consertive choice nowadays? Look at Wikipedia or
> Wiktionary. Massive international websites all in UTF8. And look at
> the Russian Wikipedia. for example. It doesn't use a "custom charset"
> at all.

See Subject, this is about "Usenet and charsets", not about HTML.

Your newsclient doesn't remove the
/[[:blank:]]+[(]was: Re: .*[)]$/
part from the Subject header field,
so you need to do it by hand.

Your broken newsclient does remove the [anything] prefix
from the Subject header field, which is real bad.

My broken newsclient (OE6) does a lot of real bad things too, but used
together with OE-QuoteFix and Hamster it is almost OK.

Peter J. Holzer

unread,
Oct 9, 2006, 9:01:19 AM10/9/06
to
On 2006-10-09 09:48, Dr.Ruud <rvtol...@isolution.nl> wrote:
> Dale schreef:
>> Isn't UTF8 the most consertive choice nowadays? Look at Wikipedia or
>> Wiktionary. Massive international websites all in UTF8. And look at
>> the Russian Wikipedia. for example. It doesn't use a "custom charset"
>> at all.
>
> See Subject, this is about "Usenet and charsets", not about HTML.

Yup. Usenet is more conservative than the WWW. UTF-8 is only about 14
years old, so you can't expect all newsreaders to support it. Still, I
think that properly declared UTF-8 should be acceptable in international
newsgroups, and since nobody has complained about my postings yet, I
take it as evidence that my newsreader's inability to use ISO-8859-1
where sufficient is only a minor bug.


> Your newsclient doesn't remove the
> /[[:blank:]]+[(]was: Re: .*[)]$/
> part from the Subject header field,
> so you need to do it by hand.
>
> Your broken newsclient does remove the [anything] prefix
> from the Subject header field, which is real bad.

Weird. Dale seems to be using Mozilla 1.7.8 from Debian. I just
installed that (although a slightly newer version), and can't reproduce
this: [META] is preserved and (was: ...) is automatically removed.

hp


--
_ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
|_|_) | Sysadmin WSR | > ist?
| | | h...@hjp.at | Was sonst wäre der Sinn des Erfindens?
__/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd

Dr.Ruud

unread,
Oct 9, 2006, 9:21:31 AM10/9/06
to
Peter J. Holzer schreef:
> Dr.Ruud:

>> [to Dale]


>> Your newsclient doesn't remove the
>> /[[:blank:]]+[(]was: Re: .*[)]$/
>> part from the Subject header field,
>> so you need to do it by hand.
>>
>> Your broken newsclient does remove the [anything] prefix
>> from the Subject header field, which is real bad.
>
> Weird. Dale seems to be using Mozilla 1.7.8 from Debian. I just
> installed that (although a slightly newer version), and can't
> reproduce this: [META] is preserved and (was: ...) is automatically
> removed.

I assumed that Dale had used the googlegroups-interface for a
newsclient. The [META] was already removed with Bart's reply.

Peter J. Holzer

unread,
Oct 9, 2006, 10:37:37 AM10/9/06
to
On 2006-10-09 13:21, Dr.Ruud <rvtol...@isolution.nl> wrote:
> Peter J. Holzer schreef:
>> Dr.Ruud:
>>> Your broken newsclient does
[weird things to the subject]

>> Weird. Dale seems to be using Mozilla 1.7.8 from Debian. I just
>> installed that (although a slightly newer version), and can't
>> reproduce this: [META] is preserved and (was: ...) is automatically
>> removed.
>
> I assumed that Dale had used the googlegroups-interface for a
> newsclient.

You are right. I saw Mozilla 1.7.8 in what looked like a useragent header, and
didn't notice that it was really an 'X-HTTP-Useragent' header. Sorry for
the confusion.

Dale

unread,
Oct 10, 2006, 5:18:01 AM10/10/06
to
Dr.Ruud wrote:

> I assumed that Dale had used the googlegroups-interface for a
> newsclient. The [META] was already removed with Bart's reply.

Yes. I'm using googlegroups. Im not sure I understand in what way it's
broken, but I assume you guys are right

--- except of course when it comes to the virtue of using Unicode in
Usenet ☺

Dale

Ted Zlatanov

unread,
Oct 10, 2006, 1:33:17 PM10/10/06
to
On 6 Oct 2006, benm...@tiscali.co.uk wrote:

> It's not a question of this group's charter, it applies generally on
> Usenet. There is no header in a Usenet article that specifies a charset,
> so no way to use anything other than the default ASCII.
>
> I agree in principle: some form of charset header should be added, or
> the charset should simply be specified to be UTF8. But until it is,
> please refrain from using it.

So there are really two questions:

1) are there any newsreaders that would cause problems when they found
UTF-8 encoded text? Anything from a crash to an unusable session
would qualify (if Ctrl-L fixes it, I wouldn't consider it a
significant problem).

2) why not do a vote to change the charter to make UTF-8 the charset
for c.l.p.m? I personally see no problem with this, since UTF-8
works well in practice and is conservative in byte usage.

Ted

Jürgen Exner

unread,
Oct 10, 2006, 2:15:28 PM10/10/06
to
Ted Zlatanov wrote:
> On 6 Oct 2006, benm...@tiscali.co.uk wrote:
>
>> It's not a question of this group's charter, it applies generally on
>> Usenet. There is no header in a Usenet article that specifies a
>> charset, so no way to use anything other than the default ASCII.
>>
>> I agree in principle: some form of charset header should be added, or
>> the charset should simply be specified to be UTF8. But until it is,
>> please refrain from using it.
>
> So there are really two questions:
>
> 1) are there any newsreaders that would cause problems when they found
> UTF-8 encoded text? Anything from a crash to an unusable session
> would qualify (if Ctrl-L fixes it, I wouldn't consider it a
> significant problem).

It is not only a question of newsreaders supporting UTF-8 but because Usenet
was designed for ASCII (and that was never changed) there is no guarantee
the all gateways and switches are 8-bit clean and won't corrupt messages
that use 8 bits.

> 2) why not do a vote to change the charter to make UTF-8 the charset
> for c.l.p.m?

Well, that's like a small town deciding that within its boundaries everyone
should drive on the left side of the road.

jue


Peter J. Holzer

unread,
Oct 10, 2006, 3:38:15 PM10/10/06
to
On 2006-10-10 18:15, Jürgen Exner <jurg...@hotmail.com> wrote:
> Ted Zlatanov wrote:
>> On 6 Oct 2006, benm...@tiscali.co.uk wrote:
>>
>>> It's not a question of this group's charter, it applies generally on
>>> Usenet. There is no header in a Usenet article that specifies a
>>> charset, so no way to use anything other than the default ASCII.

Content-Type: text/plain; charset=...

Introduced in RFC 1341 (June 1992). It is true that RFC 1036 was never
updated, but MIME is current practice on usenet.


>>> I agree in principle: some form of charset header should be added, or
>>> the charset should simply be specified to be UTF8. But until it is,
>>> please refrain from using it.
>>
>> So there are really two questions:
>>
>> 1) are there any newsreaders that would cause problems when they found
>> UTF-8 encoded text?

There are some valid UTF-8 sequences which cause problems when they are
sent verbatim to a VT100-descended terminal. Of course there are also
ASCII sequences which will do that, so any newsreader which doesn't
filter unsafe characters is broken anyway.


> It is not only a question of newsreaders supporting UTF-8 but because Usenet
> was designed for ASCII (and that was never changed) there is no guarantee
> the all gateways and switches are 8-bit clean and won't corrupt messages
> that use 8 bits.

There is no guarantee, but in practice 7-bit-ASCII based newsservers
stopped being a problem in early nineties - even before MIME was invented
(this may even have slowed down adoption of MIME in usenet - if you could
use any 8-bit-charset by convention (like ISO-8859-1 in de.* and fr.*),
why implement a baroque encoding scheme? In E-Mail MIME was necessary
because of sendmail's, er, interesting interpretation of the robustness
principle). EBCDIC hosts were a problem a bit longer, but they weren't
nice to pure ASCII postings, either.

>> 2) why not do a vote to change the charter to make UTF-8 the charset
>> for c.l.p.m?
>
> Well, that's like a small town deciding that within its boundaries everyone
> should drive on the left side of the road.

In the rest of the country everybody drives whereever they please :-).

Seriously: As much as I liked the USEFOR proposal to make UTF-8 the
default charset (instead of ASCII) on usenet, and as much as I dislike
MIME, I don't think declaring UTF-8 to be the default charset for a
single group would be a good idea. Charsets should be properly declared
in a MIME Content-Type header. As long as the charset is correctly
encoded, I think any reasonably widespread charset (and that includes
UTF-8) should be acceptable.

Ben Morrow

unread,
Oct 10, 2006, 11:03:19 PM10/10/06
to

Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:

> On 2006-10-10 18:15, Jürgen Exner <jurg...@hotmail.com> wrote:
> > Ted Zlatanov wrote:
> >> On 6 Oct 2006, benm...@tiscali.co.uk wrote:
> >>
> >>> It's not a question of this group's charter, it applies generally on
> >>> Usenet. There is no header in a Usenet article that specifies a
> >>> charset, so no way to use anything other than the default ASCII.
>
> Content-Type: text/plain; charset=...
>
> Introduced in RFC 1341 (June 1992). It is true that RFC 1036 was never
> updated, but MIME is current practice on usenet.

Really??? I was under the impression it was considered rude.

> >>> I agree in principle: some form of charset header should be added, or
> >>> the charset should simply be specified to be UTF8. But until it is,
> >>> please refrain from using it.
> >>
> >> So there are really two questions:
> >>
> >> 1) are there any newsreaders that would cause problems when they found
> >> UTF-8 encoded text?

I can't read it: I get either lots of squares or sometimes (not sure
why) lots of question marks. I don't know if this is what you mean by
'cause problems', but I am much less likely to reply to people who post
requiring understanding of utf8.

> There are some valid UTF-8 sequences which cause problems when they are
> sent verbatim to a VT100-descended terminal. Of course there are also
> ASCII sequences which will do that, so any newsreader which doesn't
> filter unsafe characters is broken anyway.

Yes, of course.

Ben

--
The cosmos, at best, is like a rubbish heap scattered at random.
Heraclitus
benm...@tiscali.co.uk

Dr.Ruud

unread,
Oct 11, 2006, 2:17:48 AM10/11/06
to
Dale schreef:
> Dr.Ruud:

> Usenet :)

Most have no problems with Unicode being used on Usenet, it's just that
it still lacks some formal backing.

I do have problems with googlegroups being used as a news client,
because it corrupts the message. One thing is that it strips out a
[tag], as your followup demonstrates. You need to use "Show
Options"/"Show original" etc. to find out what the real value of the
Subject header field is.
http://www.google.co.uk/search?q=googlegroups+tags+subject
http://forumz.tomshardware.com/games/Google-Groups-bug-response-Google-ftopict64830.html
(Aug 2005)
http://www.gatago.com/misc/kids/moderated/2373480.html

Dr.Ruud

unread,
Oct 11, 2006, 3:37:53 AM10/11/06
to
Ben Morrow schreef:
> Peter J. Holzer:

>> MIME is current practice on usenet.
>
> Really??? I was under the impression it was considered rude.

I consider MIME is very much accepted. (And your "?"-key broken. :)

No doubt MIME is considered rude by some. As are articles in non-ASCII
without a declaring (MIME) Content-Type header field.
Nobody likes illegible or damaged messages.


> [UTF-8]


> I can't read it: I get either lots of squares or sometimes (not
> sure why) lots of question marks.

The squares point to missing characters in the current font.
Question marks are used for character codes that don't exist in the
locale of the text display.


> I don't know if this is what you mean by
> 'cause problems', but I am much less likely to reply to people
> who post requiring understanding of utf8.

Perl 6 has Latin1-alternatives for (most of?) its Unicode dependent
syntax.

Peter J. Holzer

unread,
Oct 11, 2006, 12:39:29 PM10/11/06
to
On 2006-10-11 03:03, Ben Morrow <benm...@tiscali.co.uk> wrote:
> Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:
>> On 2006-10-10 18:15, JÃŒrgen Exner <jurg...@hotmail.com> wrote:
>> > Ted Zlatanov wrote:
>> >> On 6 Oct 2006, benm...@tiscali.co.uk wrote:
>> >>> It's not a question of this group's charter, it applies generally on
>> >>> Usenet. There is no header in a Usenet article that specifies a
>> >>> charset, so no way to use anything other than the default ASCII.
>>
>> Content-Type: text/plain; charset=...
>>
>> Introduced in RFC 1341 (June 1992). It is true that RFC 1036 was never
>> updated, but MIME is current practice on usenet.
>
> Really??? I was under the impression it was considered rude.

If it is, then most posters in this group are rude (well, some people
think this is the case even without MIME :-)).

I have currently 2805 messages from this group in my spool, 1954 of
which (70%) are MIME messages. Of these, about 1400 use ISO-8859-1,
about 400 use US-ASCII, about 70 UTF-8, and about 30 ISO-8859-15.

MIME is really a rather generic framework for encapsulating (almost)
arbitrary content into RFC-822 messages. Among other things it provides
for are hierarchically structured multipart messages where each part can
have a different type (commonly called "attachments"), declaration of
charsets used in the body and some headers, and transfer encodings to
cope with non-8bit-clean transports.

Of these, only charset declarations are (AFAIK) widely used on usenet:

Multipart messages are not needed (except maybe in binary groups, but
even there uuencode and yenc are preferred), because usenet messages
usually contain only plain text - other text formats (like HTML) never
caught on.

Transfer encodings aren't needed because NNTP is in practice 8bit clean.

The charset parameter is needed whenever a message contains non-ASCII
characters. This is of course frequently the case when the language
isn't English, but it also happens in English - for example in names (I
notice your newsreader butchered Jürgen's name in the attribution line)
or because the topic of the conversation is about something which isn't
easy to express in US-ASCII (for example, unicode problems are a
frequent topic in this newsgroup). So a newsreader may not need to
support all 5 MIME RFCs, but if it doesn't at least support text/plain
with the most frequent charsets and RFC 2047, it's at least a bit of a
pain.

>> >>> I agree in principle: some form of charset header should be added, or
>> >>> the charset should simply be specified to be UTF8. But until it is,
>> >>> please refrain from using it.
>> >>
>> >> So there are really two questions:
>> >>
>> >> 1) are there any newsreaders that would cause problems when they found
>> >> UTF-8 encoded text?
>
> I can't read it: I get either lots of squares or sometimes (not sure
> why) lots of question marks. I don't know if this is what you mean by
> 'cause problems',

As I understood Ted he was talking about more serious problems than an
occasional unreadable character: Crashes of the Newsreader and/or the
terminal (or - more likely these days - the graphics subsystem of the
OS). I would also consider it a serious problem if the first appearance
of a UTF-8 character garbles the rest of the line or even the article.
Garbling the subject line is also bad (and depressingly frequent), but
not much of a problem in this group, because the subject is usually a
short English phrase and therefore unlikely to contain non-ascii
characters.

Ted Zlatanov

unread,
Oct 11, 2006, 2:08:10 PM10/11/06
to
On 10 Oct 2006, hjp-u...@hjp.at wrote:

On 2006-10-10 18:15, Jürgen Exner <jurg...@hotmail.com> wrote:
> Ted Zlatanov wrote:
>>> 2) why not do a vote to change the charter to make UTF-8 the charset
>>> for c.l.p.m?
>>
>> Well, that's like a small town deciding that within its boundaries everyone
>> should drive on the left side of the road.
>

> Seriously: As much as I liked the USEFOR proposal to make UTF-8 the
> default charset (instead of ASCII) on usenet, and as much as I dislike
> MIME, I don't think declaring UTF-8 to be the default charset for a
> single group would be a good idea. Charsets should be properly declared
> in a MIME Content-Type header. As long as the charset is correctly
> encoded, I think any reasonably widespread charset (and that includes
> UTF-8) should be acceptable.

Thanks, Peter. I agree with all you said, except I think UTF-8 is not
a charset, contrary to what MIME claims, right? UCS is the charset,
UTF-8 is an encoding. Is UCS the real charset when Content-Type
specifies "charset=utf-8"?

This layers bizarrely on top of the MIME Content-Transfer-Encoding, of
course. Will UCS data be encoded twice in the end?

Ted

Peter J. Holzer

unread,
Oct 11, 2006, 5:04:40 PM10/11/06
to
On 2006-10-11 18:08, Ted Zlatanov <t...@lifelogs.com> wrote:
> On 10 Oct 2006, hjp-u...@hjp.at wrote:
>> On 2006-10-10 18:15, Jürgen Exner <jurg...@hotmail.com> wrote:
>>> Ted Zlatanov wrote:
>>>> 2) why not do a vote to change the charter to make UTF-8 the charset
>>>> for c.l.p.m?
[...]

>> Seriously: As much as I liked the USEFOR proposal to make UTF-8 the
>> default charset (instead of ASCII) on usenet, and as much as I dislike
>> MIME, I don't think declaring UTF-8 to be the default charset for a
>> single group would be a good idea. Charsets should be properly declared
>> in a MIME Content-Type header. As long as the charset is correctly
>> encoded, I think any reasonably widespread charset (and that includes
>> UTF-8) should be acceptable.
>
> Thanks, Peter. I agree with all you said, except I think UTF-8 is not
> a charset, contrary to what MIME claims, right?

Right. The terminology is a mess. What MIME calls a "charset" is more
commonly known as a "character encoding". (In fact I thought that one of
the MIME RFCs mentions that, I can't find it right now)

When I try to explain that stuff I distinguish between

* character set - a set of characters in the mathematical sense, i.e.
unordered.

* coded character set - as above, but each character is associated with
a numerical code.

* character encoding - a particular mapping of a coded character set
onto sequences of octets (or bits). MIME calls this a "charset", the
Unicode standard calls it a "transformation format".

I use the term "charset" only when I talk about MIME, otherwise I talk
of "(coded) character sets" and don't abbreviate them to "charset".


> UCS is the charset, UTF-8 is an encoding. Is UCS the real charset
> when Content-Type specifies "charset=utf-8"?

Yes.

> This layers bizarrely on top of the MIME Content-Transfer-Encoding, of
> course. Will UCS data be encoded twice in the end?

This can happen, yes. If a message with UTF-8 content is to be
transmitted over a channel which isn't 8bit clean, a
Content-Transfer-Encoding of quoted-printable or base64 must be applied.
Think of UTF-8 as a mapping from a sequence of 16-bit (or 32-bit)
quantities onto a sequence of 8-bit quantities, and quoted-printable or
base64 as a mapping from a sequence of 8-bit quantities onto a sequence
of 7-bit quantities. (The remaining Content-Transfer-Encodings 7bit,
8bit and binary are transparent)

(Of course it doesn't stop there: SMTP and NNTP do a trivial bit of
extra encoding ("dot-stuffing"), TCP and IP only paste their headers
before chunks of data, but PPP for example is a bit more complicated,
and I don't really want to know what a DSL modem does to my precious
bits :-)).

Ben Morrow

unread,
Oct 11, 2006, 7:03:50 PM10/11/06
to

Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:
> On 2006-10-11 03:03, Ben Morrow <benm...@tiscali.co.uk> wrote:
> > Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:
> >> On 2006-10-10 18:15, JÃŒrgen Exner <jurg...@hotmail.com> wrote:
> >> > Ted Zlatanov wrote:
> >> >> On 6 Oct 2006, benm...@tiscali.co.uk wrote:
> >> >>> It's not a question of this group's charter, it applies generally on
> >> >>> Usenet. There is no header in a Usenet article that specifies a
> >> >>> charset, so no way to use anything other than the default ASCII.
> >>
> >> Content-Type: text/plain; charset=...
> >>
> >> Introduced in RFC 1341 (June 1992). It is true that RFC 1036 was never
> >> updated, but MIME is current practice on usenet.
> >
> > Really??? I was under the impression it was considered rude.
>
> If it is, then most posters in this group are rude (well, some people
> think this is the case even without MIME :-)).
>
> I have currently 2805 messages from this group in my spool, 1954 of
> which (70%) are MIME messages. Of these, about 1400 use ISO-8859-1,
> about 400 use US-ASCII, about 70 UTF-8, and about 30 ISO-8859-15.

OK, maybe I wasn't clear. What I meant was that actually making any
*use* of MIME (as opposed to simply being passively conformant with it)
such as multipart messages, signed messages; in fact any messages that
are not 'text/plain; charset=us-ascii' appears to me to be considered
rude. It really doesn't matter which of ISO-8859-* or UTF-8 people mark
their message as, as long as they stick to US-ASCII for anything they
want anyone else to be able to read.

> Of these, only charset declarations are (AFAIK) widely used on usenet:

...and those are, in practice, redundant.

> Multipart messages are not needed (except maybe in binary groups, but
> even there uuencode and yenc are preferred), because usenet messages
> usually contain only plain text - other text formats (like HTML) never
> caught on.

Multipart messages and HTML messages are rude.

> Transfer encodings aren't needed because NNTP is in practice 8bit clean.

Transfer encodings aren't needed because Usenet articles are in practice
in US-ASCII, which doesn't need encoding; alse because newsreaders don't
decode them.

> The charset parameter is needed whenever a message contains non-ASCII
> characters. This is of course frequently the case when the language
> isn't English, but it also happens in English - for example in names (I
> notice your newsreader butchered Jürgen's name in the attribution line)
> or because the topic of the conversation is about something which isn't
> easy to express in US-ASCII (for example, unicode problems are a
> frequent topic in this newsgroup). So a newsreader may not need to
> support all 5 MIME RFCs, but if it doesn't at least support text/plain
> with the most frequent charsets and RFC 2047, it's at least a bit of a
> pain.

I completely agree with the sentiment here: the ability to write in
other charsets would be very useful. However, my newsreader, although it
makes some attempt to handle MIME, doesn't appear to handle them at all;
while you could say 'that's your problem' I would respectfully disagree
until 1036 is updated.

Ben

--
Outside of a dog, a book is a man's best friend.
Inside of a dog, it's too dark to read.
benm...@tiscali.co.uk Groucho Marx

Jürgen Exner

unread,
Oct 11, 2006, 10:42:49 PM10/11/06
to
Ben Morrow wrote:
> Transfer encodings aren't needed because Usenet articles are in
> practice in US-ASCII, [...]

Oh, really? I wonder how you write messages in lets say French, German,
Norwegian, Greek, Russian, Chinese, Japanese, Hebrew, Arabic and most other
languages with US-ASCII only.

jue


Peter J. Holzer

unread,
Oct 12, 2006, 4:01:40 AM10/12/06
to
On 2006-10-11 23:03, Ben Morrow <benm...@tiscali.co.uk> wrote:
>
> Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:
>> On 2006-10-11 03:03, Ben Morrow <benm...@tiscali.co.uk> wrote:
>> > Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:
>> >> On 2006-10-10 18:15, JÃŒrgen Exner <jurg...@hotmail.com> wrote:
>> >> > Ted Zlatanov wrote:
>> >> >> On 6 Oct 2006, benm...@tiscali.co.uk wrote:
>> >> >>> It's not a question of this group's charter, it applies generally on
>> >> >>> Usenet. There is no header in a Usenet article that specifies a
>> >> >>> charset, so no way to use anything other than the default ASCII.
>> >>
>> >> Content-Type: text/plain; charset=...
>> >>
>> >> Introduced in RFC 1341 (June 1992). It is true that RFC 1036 was never
>> >> updated, but MIME is current practice on usenet.
>> >
>> > Really??? I was under the impression it was considered rude.
>>
>> If it is, then most posters in this group are rude (well, some people
>> think this is the case even without MIME :-)).
>>
>> I have currently 2805 messages from this group in my spool, 1954 of
>> which (70%) are MIME messages. Of these, about 1400 use ISO-8859-1,
>> about 400 use US-ASCII, about 70 UTF-8, and about 30 ISO-8859-15.
>
> OK, maybe I wasn't clear. What I meant was that actually making any
> *use* of MIME (as opposed to simply being passively conformant with it)
> such as multipart messages, signed messages;

Properly declaring the used charset *is* a use of MIME. There is no way
you can do that without using MIME.

(There are probably quite a few of those 1500 messages labelled as
non-ASCII which don't actually contain any non-ASCII characters. OTOH,
there are probably also quite few (for example yours in this thread)
which do contain non-ASCII characters and are missing a proper
declaration)

> in fact any messages that are not 'text/plain; charset=us-ascii'
> appears to me to be considered rude.

Given that about 50% of the messages in this group fit that description
and nobody except you is complaining, your perception appears to me to
be slightly off.

> It really doesn't matter which of ISO-8859-* or UTF-8 people mark
> their message as, as long as they stick to US-ASCII for anything they
> want anyone else to be able to read.

This is a different matter. The language of this newsgroup is English,
so people should stick to English if they want to be understood. If they
do, their text will also consist mostly of ASCII characters (unless they
use fancy dashes and quotes or something like that).

>> Of these, only charset declarations are (AFAIK) widely used on usenet:
>
> ...and those are, in practice, redundant.

No, they aren't. They are necessary for the correct interpretation of
any non-ASCII characters in the message. This is nicely illustrated by
Jürgen's name in one of the attribution lines which is garbled some
more with each exchange between us. Now in this case the message can
still be understood with the garbled name, but if we are discussing, for
example, some problem with Perl unicode handling, it becomes extremely
irritating and tedious you can't be sure if the characters you see on
your screen are the same as the ones the poster typed (or pasted).


>> Transfer encodings aren't needed because NNTP is in practice 8bit clean.
>
> Transfer encodings aren't needed because Usenet articles are in practice
> in US-ASCII, which doesn't need encoding;

It may have escaped your attention, but English is not the only language
used on Usenet, but it is the only language for which US-ASCII is
sufficient. I don't have any statistics at hand, but don't think I'm far
off when I estimate that about half of the (text) traffic on Usenet is
not in English, and therefore also not in US-ASCII.

>> The charset parameter is needed whenever a message contains non-ASCII
>> characters.

[...]


>> So a newsreader may not need to support all 5 MIME RFCs, but if it
>> doesn't at least support text/plain with the most frequent charsets
>> and RFC 2047, it's at least a bit of a pain.
>
> I completely agree with the sentiment here: the ability to write in
> other charsets would be very useful.

Not "would" - it *is* very useful. Indeed, for those of us who write in
other languages than English it is pretty much indispensable.

> However, my newsreader, although it makes some attempt to handle MIME,
> doesn't appear to handle them at all; while you could say 'that's your
> problem'

I do. You are using a newsreader which hasn't been maintained for 5
years.

> I would respectfully disagree until 1036 is updated.

I've almost given up hope on that. I don't recall when USEFOR was
established, but it must have been in the late 90's. So far they haven't
managed to produce a single RFC.

Dr.Ruud

unread,
Oct 12, 2006, 6:37:46 AM10/12/06
to
Ben Morrow schreef:

> Transfer encodings aren't needed because Usenet articles are in
> practice in US-ASCII, which doesn't need encoding; alse because
> newsreaders don't decode them.

In practice many articles include non-ASCII characters, more and more.
If an article is made of only ASCII, then the news client should refrain
from including MIME-headers, and the good news clients do it that way.


> the ability to write in
> other charsets would be very useful. However, my newsreader, although
> it makes some attempt to handle MIME, doesn't appear to handle them
> at all; while you could say 'that's your problem' I would
> respectfully disagree until 1036 is updated.

In practice it is updated by 'Son-of-RFC-1036' and even more by how the
majority of articles are structured. In practice, RFCs can't keep up,
but news clients can.

Ben Morrow

unread,
Oct 12, 2006, 7:29:24 PM10/12/06
to
[attributions tidied slightly]

Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:
> On 2006-10-11 23:03, Ben Morrow <benm...@tiscali.co.uk> wrote:
> > Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:
> >> On 2006-10-11 03:03, Ben Morrow <benm...@tiscali.co.uk> wrote:
> >> > Quoth "Peter J. Holzer" <hjp-u...@hjp.at>:

> >> >> >> On 6 Oct 2006, benm...@tiscali.co.uk wrote:
> >> >> >>>
> >> >> >>> It's not a question of this group's charter, it applies
> >> >> >>> generally on Usenet. There is no header in a Usenet article
> >> >> >>> that specifies a charset, so no way to use anything other
> >> >> >>> than the default ASCII.
> >> >>
> >> >> Content-Type: text/plain; charset=...
> >> >>
> >> >> Introduced in RFC 1341 (June 1992). It is true that RFC 1036 was never
> >> >> updated, but MIME is current practice on usenet.
> >> >
> >> > Really??? I was under the impression it was considered rude.
> >>
> >> If it is, then most posters in this group are rude (well, some people
> >> think this is the case even without MIME :-)).
> >>
> >> I have currently 2805 messages from this group in my spool, 1954 of
> >> which (70%) are MIME messages. Of these, about 1400 use ISO-8859-1,
> >> about 400 use US-ASCII, about 70 UTF-8, and about 30 ISO-8859-15.
> >
> > OK, maybe I wasn't clear. What I meant was that actually making any
> > *use* of MIME (as opposed to simply being passively conformant with it)
> > such as multipart messages, signed messages;
>
> Properly declaring the used charset *is* a use of MIME. There is no way
> you can do that without using MIME.

Of course. My position *was* that since MIME is not used on Usenet,
there is no way to declare the charset; and hence one should not use
charsets other than that generally prevailing in the group (which, in
this group, is US-ASCII); *however*...

> > in fact any messages that are not 'text/plain; charset=us-ascii'
> > appears to me to be considered rude.
>
> Given that about 50% of the messages in this group fit that description
> and nobody except you is complaining, your perception appears to me to
> be slightly off.

...given this...

> > However, my newsreader, although it makes some attempt to handle MIME,
> > doesn't appear to handle them at all; while you could say 'that's your
> > problem'
>
> I do. You are using a newsreader which hasn't been maintained for 5
> years.
>
> > I would respectfully disagree until 1036 is updated.
>
> I've almost given up hope on that. I don't recall when USEFOR was
> established, but it must have been in the late 90's. So far they haven't
> managed to produce a single RFC.

...and these, it appears I am in the wrong. I apologise. It looks like I
may have to start looking for a new newsreader: damn :). I'd just about
weaned myself off rn and onto trn, and now it seems that isn't adequate
either. Ah well, such is life.

[quotes slightly reordered for clarity]

> It may have escaped your attention, but English is not the only language
> used on Usenet, but it is the only language for which US-ASCII is
> sufficient. I don't have any statistics at hand, but don't think I'm far
> off when I estimate that about half of the (text) traffic on Usenet is
> not in English, and therefore also not in US-ASCII.

I realise it may not have been entirely clear from my posts in this
thread, but I am very well aware that much of Usenet is not conducted in
English :). However, since I know nothing about how groups that don't
speak English manage charsets and the like (I was under the impression
that there was some kind of implicit agreement to use, e.g. ISO8859-1 or
KOI8-R or whatever within one particular group or hierarchy, since I
wasn't aware newsreaders routinely 'did' MIME; again, it seems I was
wrong and I can only plead ignorance and too much faith in the RFCs in
that regard) I did not consider myself in a position to comment on that
situation.

If anyone got the impression that I think English should be somehow a
priviledged language on Usenet (except by the accidents of its origin,
which ought by now to have been erased) then I can only apologise
(again) for any offence caused and state for the record that that idea
is of course complete nonsense.

(Hmm, I seem to have dug myself into something of a hole here.
Whoops...)

Ben

--
You poor take courage, you rich take care:
The Earth was made a common treasury for everyone to share
All things in common, all people one. [benm...@tiscali.co.uk]
'We come in peace'---the order came to cut them down.

Ted Zlatanov

unread,
Oct 16, 2006, 12:18:38 PM10/16/06
to

Transliterate, my boy, transliterate...

:)

Ted

Ted Zlatanov

unread,
Oct 16, 2006, 12:23:20 PM10/16/06
to
So is there a consensus that MIME with charset=utf8 and a suitable
8-bit-safe content-transfer-encoding should be acceptable in
comp.lang.perl.misc? It's unlikely to do violent things to anyone's
news reader, and it would make it much easier to post Perl code that
uses UCS characters in source.

Ted

Dr.Ruud

unread,
Oct 16, 2006, 4:17:41 PM10/16/06
to
Ted Zlatanov schreef:

Yes, let's just agree on that.

See also
news://nntp.perl.org/200610082351...@x12.develooper.com
and many others like that one. (No CTE though, 8-bit implied somehow.)

Ian Wilson

unread,
Oct 17, 2006, 6:31:19 AM10/17/06
to
Ted Zlatanov wrote:
> So is there a consensus

How could you tell?

> that MIME with charset=utf8 and a suitable
> 8-bit-safe content-transfer-encoding should be acceptable in
> comp.lang.perl.misc?

It gets my vote. Now you just have to find out what the other 37,292
readers of this newsgroup think :-)

Bart Van der Donck

unread,
Oct 17, 2006, 7:19:05 AM10/17/06
to
Ian Wilson wrote:

My vote too. I think it's open-minded towards the future. 37291 to go.

--
Bart

Ted Zlatanov

unread,
Oct 17, 2006, 12:49:14 PM10/17/06
to
On 17 Oct 2006, scob...@infotop.co.uk wrote:

Ted Zlatanov wrote:
>> So is there a consensus
>
> How could you tell?

By the lack of dissent. Seriously, how else can you get 37K+
passive-aggressive people to agree? :)

>> that MIME with charset=utf8 and a suitable
>> 8-bit-safe content-transfer-encoding should be acceptable in
>> comp.lang.perl.misc?
>
> It gets my vote. Now you just have to find out what the other 37,292
> readers of this newsgroup think :-)

I think it's 2006, not 1996; let's just start a new thread to put it
in the posting guidelines and if people complain, we'll argue
viciously about it and then there will be a shootout in front of the
saloon.

Ted

Mumia W. (reading news)

unread,
Oct 17, 2006, 3:42:13 PM10/17/06
to

Allowing UTF8 with the proper content-transfer encoding gets my vote too.

In fact, MIME on usenet makes sense--despite the fact that the word
"Mail" is in MIME :-)


--
paduille.4...@earthlink.net

Randal L. Schwartz

unread,
Oct 17, 2006, 5:00:34 PM10/17/06
to
>>>>> "Mumia" == Mumia W (reading news) <paduille.4...@earthlink.net> writes:

Mumia> Allowing UTF8 with the proper content-transfer encoding gets my vote too.

An individual group cannot vote on this. This is a news.admin-level problem,
and the news-admins must be involved.

Otherwise, it's like a group of inmates "voting" to be allowed to run
outside the walls for 30 minutes a day. It's pointless.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<mer...@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

--
Posted via a free Usenet account from http://www.teranews.com

Ted Zlatanov

unread,
Oct 18, 2006, 11:09:04 AM10/18/06
to
On 17 Oct 2006, mer...@stonehenge.com wrote:

>>>>>> "Mumia" == Mumia W (reading news) <paduille.4...@earthlink.net>
>>>>>> writes:
>
> Mumia> Allowing UTF8 with the proper content-transfer encoding gets my vote
> too.
>
> An individual group cannot vote on this. This is a news.admin-level problem,
> and the news-admins must be involved.

The posting guidelines are not maintained by news admins, as far as I
know. MIME is already allowed in this newsgroup (the postings that
come through every day are evidence of that). The request, as stated
in the subject, is to modify the posting guidelines, not anything
about the newsgroup at the server level.

What do the news admins have to do?

> Otherwise, it's like a group of inmates "voting" to be allowed to run
> outside the walls for 30 minutes a day. It's pointless.

Usenet is not a prison. comp.lang.perl.misc is not even moderated.
Can you explain your analogy a little bit?

Ted

Ben Morrow

unread,
Oct 18, 2006, 11:51:59 AM10/18/06
to

Quoth Ted Zlatanov <t...@lifelogs.com>:

For my part, if I get a vote :), I'd say UTF8 with no CTE, since Usenet
is in practice 8-bit safe, Base64 is terribly wasteful and QP is
horribly messy to read if you don't grok MIME.

Ben

--
Although few may originate a policy, we are all able to judge it.
Pericles of Athens, c.430 B.C.
benm...@tiscali.co.uk

Ted Zlatanov

unread,
Oct 18, 2006, 3:29:44 PM10/18/06
to
On 18 Oct 2006, benm...@tiscali.co.uk wrote:

> Quoth Ted Zlatanov <t...@lifelogs.com>:
>> So is there a consensus that MIME with charset=utf8 and a suitable
>> 8-bit-safe content-transfer-encoding should be acceptable in
>> comp.lang.perl.misc? It's unlikely to do violent things to anyone's
>> news reader, and it would make it much easier to post Perl code that
>> uses UCS characters in source.
>
> For my part, if I get a vote :), I'd say UTF8 with no CTE, since Usenet
> is in practice 8-bit safe, Base64 is terribly wasteful and QP is
> horribly messy to read if you don't grok MIME.

Agreed on the above in theory, but in practice it's much easier to get
a consensus with my parameters than yours.

Ted

Dr.Ruud

unread,
Oct 18, 2006, 3:27:56 PM10/18/06
to
Ben Morrow schreef:

> I'd say UTF8 with no CTE

ITYM: UTF8, with a default CTE of 8-bit if absent.

But some news clients might assume 7-bit, if the CTE is absent.

Randal L. Schwartz

unread,
Oct 18, 2006, 3:50:54 PM10/18/06
to
>>>>> "Ted" == Ted Zlatanov <t...@lifelogs.com> writes:

Ted> What do the news admins have to do?

Agree.

Ted> Usenet is not a prison. comp.lang.perl.misc is not even moderated.
Ted> Can you explain your analogy a little bit?

The analogy is that *this* is not *yours* and not *mine*. *This* belongs
to the people who run Usenet. Unless you're also in charge of a spool
somewhere, you are *not* an admin and have *no* say about the format
of the messages or even the charter. Even if you *are* in charge of
a spool, you don't get interchange rights unless you play along well
with the *collective* *anarchy* known as "Usenet".

So, that's how it applies. You can vote all you want, but it's ultimately up
to the people running the show, and that's NOT US in THIS GROUP. It's the
people in news.admin. As far as I remember last time I looked, MIME is *not*
acceptable, here or anywhere on Usenet except in groups that match
"*.binaries.*".

Just another Usenet old-timer (since 1980 when it first began),

Dr.Ruud

unread,
Oct 18, 2006, 5:20:50 PM10/18/06
to
Randal L. Schwartz schreef:

> You can vote all you want, but it's
> ultimately up to the people running the show, and that's NOT US in
> THIS GROUP. It's the people in news.admin. As far as I remember
> last time I looked, MIME is *not* acceptable, here or anywhere on
> Usenet except in groups that match "*.binaries.*".

MIME is being used in text groups on Usenet a lot for ages now and
AFAICS it doesn't cause real problems. I have never noticed propagation
problems for such articles.

Without many complaints, it is safe to assume that using MIME (and
UTF-8) in this group is feasible.

The voting I parsed as little more than a channel for any complaints on
using (or even the promotion of using) MIME (and UTF-8) in clpm. So I
assume it is OK to promote it for textual data that contains non-ASCII
characters.

Ian Wilson

unread,
Oct 19, 2006, 6:16:13 AM10/19/06
to
Randal L. Schwartz wrote:
> As far as I remember last time I looked, MIME is *not*
> acceptable,

In case you fall under a bus and are not available for resolving such
issues at some point in the future, could you tell us where you looked?


I've used Google to search for relevant stuff but come up blank apart
from some seemingly unrelated discussion of the marking of certain
groups as binary. AFAIK Base64 encoded content wouldn't be considered
binary.

I've skimmed the relevant RFCs and didn't notice anything that would
preclude MIME.

RFC 977 says "the article should be presented in the format specified by
RFC850" and "No attempt shall be made by the server to filter
characters, fold or limit lines, or otherwise process incoming text."

Which suggests that the message content can be in any format so long as
it meets the rule for message termination (LF . LF?).

RFC 850 says "all USENET news articles must be formatted as valid
ARPANET mail messages, according to the ARPANET standard RFC 822."

Which suggests to me that the intent was that rules for NNTP message
content should follow the rules for SMTP message content. Current SMTP
standards for content (such as MIME) ought to be equally acceptable, in
principle, in NNTP.

Wikipedia says "With the header extensions and the Base64 and
Quoted-Printable MIME encodings, there was a new generation of binary
transport. In practice, MIME has seen increased adoption in text messages"

There is no mention of propogation problems with MIME text messages.

I can't find anything relevant in recent news.admin.{technical|misc} and
the only FAQ I can find relates to net.abuse (in which I didn't see any
mention of MIME).

Ted Zlatanov

unread,
Oct 19, 2006, 11:52:05 AM10/19/06
to
On 18 Oct 2006, mer...@stonehenge.com wrote:

>>>>>> "Ted" == Ted Zlatanov <t...@lifelogs.com> writes:
>
> Ted> Usenet is not a prison. comp.lang.perl.misc is not even moderated.
> Ted> Can you explain your analogy a little bit?
>
> The analogy is that *this* is not *yours* and not *mine*. *This* belongs
> to the people who run Usenet. Unless you're also in charge of a spool
> somewhere, you are *not* an admin and have *no* say about the format
> of the messages or even the charter. Even if you *are* in charge of
> a spool, you don't get interchange rights unless you play along well
> with the *collective* *anarchy* known as "Usenet".

I understand all that, but we are not "inmates." Both meanings of the
word in this context (lunatics and prisoners) don't sit well with the
"collective anarchy" image you mention, and I dislike the implications
of those meanings. Usenet is built on cooperation, not threats and
force.

> So, that's how it applies. You can vote all you want, but it's ultimately up
> to the people running the show, and that's NOT US in THIS GROUP. It's the
> people in news.admin. As far as I remember last time I looked, MIME is *not*
> acceptable, here or anywhere on Usenet except in groups that match
> "*.binaries.*".

MIME is de facto acceptable in comp.lang.perl.misc. MIME posts go
through every day, and I've never heard of a single one lost. Your
post, the one I'm quoting, it a MIME post:

MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

If you have proof of your statement that MIME is not acceptable
here--Usenet Cabal meeting minutes, dropped posts, news admin
opinions, whatever, please cite them. I would really appreciate it.
Also note Ian Wilson's note--I did a similar search and came up blank.

Ted

Peter J. Holzer

unread,
Oct 19, 2006, 3:48:17 PM10/19/06
to
On 2006-10-18 19:50, Randal L. Schwartz <mer...@stonehenge.com> wrote:
>>>>>> "Ted" == Ted Zlatanov <t...@lifelogs.com> writes:
>Ted> Usenet is not a prison. comp.lang.perl.misc is not even moderated.
>Ted> Can you explain your analogy a little bit?
>
> The analogy is that *this* is not *yours* and not *mine*. *This* belongs
> to the people who run Usenet. Unless you're also in charge of a spool
> somewhere, you are *not* an admin and have *no* say about the format
> of the messages or even the charter.

I am an admin but I don't have any more say about the format of message
or the charter than anybody else.

From my point of view as an admin details of the message content are
absolutely immaterial. Except for a very few headers (Newsgroups, Date,
Message-Id, Path, Control, ...) a message is just a blob of octets
which the server will receive, pass on to its neighbours and send to its
clients until it is expired.

I do care about the total traffic - a group with too many or too large
messages may get deleted locally because it eats valuable resources
which could be used by other groups.

As a reader and poster in this group I do care about the readability of
articles. That's where MIME comes in - and for me it's roughly
comparable to line lengths, proper English grammar, proper quoting,
using "strict" and and not flaming - write your articles in such a
fashion that they are acceptable to the readers.


> Even if you *are* in charge of a spool, you don't get interchange
> rights unless you play along well with the *collective* *anarchy*
> known as "Usenet".

I'm pretty sure that any news-admin who blocked MIME and/or 8bit content
would be considered as *not* playing along well on this side of the
Atlantic.


> So, that's how it applies. You can vote all you want, but it's ultimately up
> to the people running the show, and that's NOT US in THIS GROUP. It's the
> people in news.admin. As far as I remember last time I looked, MIME is *not*
> acceptable, here or anywhere on Usenet except in groups that match
> "*.binaries.*".

You probably looked last sometime around 1990. But MIME wasn't even
invented then.

To add a bit of historical perspective, even Henry Spencer's
"son-of-1036", which was last updated in 1994, strongly recommends the
use of MIME.

It forbids the use of 8bit and binary encodings except in "cooperating
subnets", but even at that time the whole usenet (except for a few
EBCDIC hosts) could be considered a cooperating subnet in that aspect:
B-News was dead and C-News and INN didn't have any problems with 8bit
characters; 8bit characters were common in non-English hierarchies
before MIME had reasonably wide-spread support.

Son-of-1036 never became an RFC. But among news admins it was long
considered to be the de-facto standard and preferred over 1036. By now
son-of-1036 also seems to be considered obsolete by many: Usenet has
evolved, but no formal successor to RFC 1036 was ever published, so by
now "what other netizens accept" is more important than what the RFC
says.

> Just another Usenet old-timer (since 1980 when it first began),

I can't compete with that. I only started using usenet in 1988.

0 new messages