Interpreting MS-ASCII - anyone have a filter?

Kenny McCormack

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to

In a recent religious war^W^Wthread about AWK vs. Perl, the subject of
the dreaded MS ASCII came up. Now, since a religious war would not be
complete w/o some MS bashing (probably the one thing upon which
AWKers and Perlies can agree on), I thought I'd continue the thread here.

Seriously, we all know that newbies frequently post text documents to
the net which have been prepared with MS-Word and have funny "High ASCII"
characters in them. I have, over the years, developed a filter (in GAWK)
to deal with this, by translating the two characters I know about - the
MS ASCII characters for the single and double quote marks. Then, I
simply delete all the other high ASCII chars.

I was wondering if anyone had actually done a full-implementation of
this idea - that is, one that is aware of all the MS funny chars, and
deals with each appropriately. I'm open to either an AWK or PERL (or,
lex, for matter) solution...

Abigail

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to

Kenny McCormack (gaz...@yin.interaccess.com) wrote on MMCXIX September
MCMXCIII in <URL:news:7kj81j$83q$1...@yin.interaccess.com>:
,,
,, I was wondering if anyone had actually done a full-implementation of
,, this idea - that is, one that is aware of all the MS funny chars, and
,, deals with each appropriately. I'm open to either an AWK or PERL (or,
,, lex, for matter) solution...

That would be Perl (see perlfaq1) and Awk (that's the way how Kernighan
seems to spell it, and since he's the k in Awk...).

In Perl, I would do:

#!/opt/perl/bin/perl -pw

BEGIN {
# A hash which contains the letters to be translated.
# I'm just making up some values, as I'm not familiar with
# the MS fonts.
%t = ("\x91" => "`",
"\x96" => "'",
"\x9F" => "[TM]");
}

s/([\x80-\x9F])/$t{$1} || ""/eg; # Replace all characters in the range
# 0x80 - 0x9F with their translation,
# squish it if there's no translation.

__END__

Note that this wouldn't work if there's a char you would want to
replace with 0.

Abigail
--
sub f{sprintf'%c%s',$_[0],$_[1]}print f(74,f(117,f(115,f(116,f(32,f(97,
f(110,f(111,f(116,f(104,f(0x65,f(114,f(32,f(80,f(101,f(114,f(0x6c,f(32,
f(0x48,f(97,f(99,f(107,f(101,f(114,f(10,q ff)))))))))))))))))))))))))

-----------== Posted via Newsfeeds.Com, Uncensored Usenet News ==----------
http://www.newsfeeds.com The Largest Usenet Servers in the World!
------== Over 73,000 Newsgroups - Including Dedicated Binaries Servers ==-----

Larry Rosler

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to

In article <slrn7mqgou....@alexandra.delanet.com> on 20 Jun 1999
14:38:11 -0500, Abigail <abi...@delanet.com> says...
...

> s/([\x80-\x9F])/$t{$1} || ""/eg; # Replace all characters in the range
> # 0x80 - 0x9F with their translation,
> # squish it if there's no translation.

...

> Note that this wouldn't work if there's a char you would want to
> replace with 0.

s/([\x80-\x9F])/defined $t{$1} && $t{$1}/eg;

Maybe someone should name this idiom after me. No one else seems to use
it! :-)

--
(Just Another Larry) Rosler
Hewlett-Packard Company
http://www.hpl.hp.com/personal/Larry_Rosler/
l...@hpl.hp.com

Uri Guttman

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to

>>>>> "LR" == Larry Rosler <l...@hpl.hp.com> writes:

LR> In article <slrn7mqgou....@alexandra.delanet.com> on 20 Jun 1999
LR> 14:38:11 -0500, Abigail <abi...@delanet.com> says...
LR> ...

>> s/([\x80-\x9F])/$t{$1} || ""/eg; # Replace all characters in the range
>> # 0x80 - 0x9F with their translation,
>> # squish it if there's no translation.

LR> ...

>> Note that this wouldn't work if there's a char you would want to
>> replace with 0.

LR> s/([\x80-\x9F])/defined $t{$1} && $t{$1}/eg;

i think s/defined/exists/ would look better. if someone mapped a hex
code to undef yours would fail but that is a stupid thing to do.

LR> Maybe someone should name this idiom after me. No one else seems
LR> to use it! :-)

i dub this the rosler substitution!

(but only if it uses exists)

uri

--
Uri Guttman ----------------- SYStems ARCHitecture and Software Engineering
u...@sysarch.com --------------------------- Perl, Internet, UNIX Consulting
Have Perl, Will Travel ----------------------------- http://www.sysarch.com
The Best Search Engine on the Net ------------- http://www.northernlight.com

Larry Rosler

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to

In article <x7ogia8...@home.sysarch.com> on 20 Jun 1999 22:41:44 -
0400, Uri Guttman <u...@sysarch.com> says...

> >>>>> "LR" == Larry Rosler <l...@hpl.hp.com> writes:

...

> LR> s/([\x80-\x9F])/defined $t{$1} && $t{$1}/eg;
>
> i think s/defined/exists/ would look better. if someone mapped a hex
> code to undef yours would fail but that is a stupid thing to do.
>
> LR> Maybe someone should name this idiom after me. No one else seems
> LR> to use it! :-)
>
> i dub this the rosler substitution!

That's a lot better than the RoslerIAN Substitution :-|

> (but only if it uses exists)

Sure. 'exists' is one character shorter than 'defined'.

On a slightly more serious note:

That construction is analogous in some way to this one:

my $x = $y || $z; # Use TRUE value or default value.

But it is harder to describe:

my $x = $y && $z; # Replace TRUE value by specified value.

So I prefer the Rosler Replacement -- which Rocks!

Abigail

unread,

Jun 20, 1999, 3:00:00 AM6/20/99

to

Larry Rosler (l...@hpl.hp.com) wrote on MMCXX September MCMXCIII in
<URL:news:MPG.11d73890a...@nntp.hpl.hp.com>:

"" In article <slrn7mqgou....@alexandra.delanet.com> on 20 Jun 1999

"" 14:38:11 -0500, Abigail <abi...@delanet.com> says...

"" ...
"" > s/([\x80-\x9F])/$t{$1} || ""/eg; # Replace all characters in the range
"" > # 0x80 - 0x9F with their translation,
"" > # squish it if there's no translation.

"" ...
"" > Note that this wouldn't work if there's a char you would want to
"" > replace with 0.
""

"" s/([\x80-\x9F])/defined $t{$1} && $t{$1}/eg;
""

"" Maybe someone should name this idiom after me. No one else seems to use
"" it! :-)

s/([\x80-\x9F])/exits $t{$1} && $t{$1}/eg;

would be more efficient.

Abigail
--
%0=map{reverse+chop,$_}ABC,ACB,BAC,BCA,CAB,CBA;$_=shift().AC;1while+s/(\d+)((.)
(.))/($0=$1-1)?"$0$3$0{$2}1$2$0$0{$2}$4":"$3 => $4\n"/xeg;print#Towers of Hanoi

Jim Monty

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

Uri Guttman <u...@sysarch.com> wrote:
> Larry Rosler <l...@hpl.hp.com> wrote:

> > Abigail <abi...@delanet.com> wrote:
> > > s/([\x80-\x9F])/$t{$1} || ""/eg; # Replace all characters in the range
> > > # 0x80 - 0x9F with their translation,
> > > # squish it if there's no translation.

> > > Note that this wouldn't work if there's a char you would want to
> > > replace with 0.
> >
> > s/([\x80-\x9F])/defined $t{$1} && $t{$1}/eg;
>

> i think s/defined/exists/ would look better. if someone mapped a hex
> code to undef yours would fail but that is a stupid thing to do.
>

> > Maybe someone should name this idiom after me. No one else seems
> > to use it! :-)
>

> i dub this the rosler substitution!
>

> (but only if it uses exists)

I continue to struggle to learn Perl and its countless popular
"idioms." What, pray tell, is wrong with this?

s/([\x80-\x9F])/exists $t{$1} ? $t{$1} : ''/eg;

I _do_, at least, know what's wrong with this--thanks to MRE:

s/[\x80-\x9F]/exists $t{$&} ? $t{$&} : ''/eg;

--
Jim Monty
mo...@primenet.com
Tempe, Arizona USA

Larry Rosler

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

In article <7kkk61$1cv$1...@nnrp02.primenet.com> on 21 Jun 1999 05:56:17
GMT, Jim Monty <mo...@primenet.com> says...
...

> I continue to struggle to learn Perl and its countless popular
> "idioms." What, pray tell, is wrong with this?
>
> s/([\x80-\x9F])/exists $t{$1} ? $t{$1} : ''/eg;

s/([\x80-\x9F])/exists $t{$1} && $t{$1}/eg;

is four characters shorter. Other than that, they are functionally
identical. Which (for me) makes the choice between them easy. :-)

As a small matter of style, I prefer "" to '' because the four hen-
scratches can't be mistaken for one double-quote, as the two hen-
scratches might be. No one would be confused by the secondary
implication of interpolation between double-quotes.

Matt Sergeant

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

Abigail wrote:
>
> s/([\x80-\x9F])/exits $t{$1} && $t{$1}/eg;
>
> would be more efficient.

Especially if accompanied by:

sub exits {
exit(0);
}

;-)

Matt.

Kenny McCormack

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

In article <MPG.11d73890a...@nntp.hpl.hp.com>,

Larry Rosler <l...@hpl.hp.com> wrote:
>In article <slrn7mqgou....@alexandra.delanet.com> on 20 Jun 1999
>14:38:11 -0500, Abigail <abi...@delanet.com> says...
>...

>> s/([\x80-\x9F])/$t{$1} || ""/eg; # Replace all characters in the range
>> # 0x80 - 0x9F with their translation,
>> # squish it if there's no translation.

Maybe I missed a post or two (I am reading this in comp.lang.awk -
maybe Abigail posted only to comp.lang.perl.misc?), but the point is
not "How to write a program to do the translations". Anybody can
write such a program; it is trivial.

The real point of my post is: What *are* the translations?
There is probably a document at microsoft.com that explains their
weird creation and tells what each funny characters means. I was
hoping that someone somewhere had already done that work.

Tad McClellan

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

Larry Rosler (l...@hpl.hp.com) wrote:
: In article <slrn7mqgou....@alexandra.delanet.com> on 20 Jun 1999
: 14:38:11 -0500, Abigail <abi...@delanet.com> says...

: ....
: > s/([\x80-\x9F])/$t{$1} || ""/eg; # Replace all characters in the range

: > # 0x80 - 0x9F with their translation,
: > # squish it if there's no translation.

: ....
: > Note that this wouldn't work if there's a char you would want to
: > replace with 0.

: s/([\x80-\x9F])/defined $t{$1} && $t{$1}/eg;

: Maybe someone should name this idiom after me.

You're right.

It is hereby dubbed Just Another Transform.

:-)

--
Tad McClellan SGML Consulting
ta...@metronet.com Perl programming
Fort Worth, Texas

Alan J. Flavell

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

On 21 Jun 1999, Kenny McCormack wrote:

> The real point of my post is: What *are* the translations?

Is _that_ all you wanted? Then visit
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/
in particular
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/
for the basic data that you need. The files that you find there are
machine parseable, so you can pop them into your code without risking
typos. No doubt you'll already realise that the Western coding is in
CP1252.TXT

Or use one of the excellent Perl modules that you'll find at CPAN
(no, I haven't checked specifically, but I'm confident you'll find
something good. Just check that the "euro" has been included, as it's
a relatively recent update.)

I suppose I should register a routine protest at your use of the term
"MS-ASCII". ASCII is a 7-bit code (what was called "us-ascii" back in
the days when people used national variant 7-bit codes). MS have a
tendency to refer to their 8-bit codings as "ANSI", but I've never found
anything from the ANSI that justifies this usage, either. Safest to
call them "MS Windows" codes, IMHO. (Standard _DOS_ 8-bit codings are
different, OK? CP850 for Western European locale).

Kenny McCormack

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

In article <Pine.HPP.3.95a.99062...@hpplus03.cern.ch>,

Alan J. Flavell <fla...@mail.cern.ch> wrote:
>On 21 Jun 1999, Kenny McCormack wrote:
>
>> The real point of my post is: What *are* the translations?
>
>Is _that_ all you wanted? Then visit

Yes, thank you very much for the URLs.

I know it may be hard to believe, but I'm actually one of those old
line programmers who can whip up code w/no problems, but still finds
Web searching and Yahoo'ing a bit confusing.

A pointer to the specific CPAN node where the program can be found
would also be helpful.

Bart Lateur

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

Kenny McCormack wrote:

> I have, over the years, developed a filter (in GAWK)
>to deal with this, by translating the two characters I know about - the
>MS ASCII characters for the single and double quote marks. Then, I
>simply delete all the other high ASCII chars.

You never received any mail in French or German, did you?

Only characters that are in use on Windows, but don't mean anything in
Iso-Latin-1, need conversion. That is the range 128 to 159. I'm not sure
about 255. Let's see...

code description replacement
130 lower single quote ,
131 italic "f" f
132 lower double quote "
133 3 dots ("ellipsis") ...
134 "dagger" (cross) +
135 "double dagger" er.. anybody ever used this?
136 roof shaped accent ^
137 promille ????
138 large "S" with accent S
154 small "s" with accent s
139 left angular bracket <
155 right angular bracket >
140 large "OE" ligature OE
156 small "oe" ligature oe
145 single quote left '
146 single quote right '
147 double quote left "
148 double quote right "
149 big centered dot bullet *
150 narrow hyphen, n-dash -
151 wider hyphen, m-dash --
152 tilde ~
153 TM TM
159 large double dotted "Y" Y
255 small double dotted "y" y

Can you build a filter using that? I'd think so, except for the
"promille" sign. that needs a better replacement.

Bart.

Abigail

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

Kenny McCormack (gaz...@yin.interaccess.com) wrote on MMCXX September
MCMXCIII in <URL:news:7klekd$i1e$1...@yin.interaccess.com>:
**
** Maybe I missed a post or two (I am reading this in comp.lang.awk -
** maybe Abigail posted only to comp.lang.perl.misc?), but the point is
** not "How to write a program to do the translations". Anybody can
** write such a program; it is trivial.
**
** The real point of my post is: What *are* the translations?

That would be off-topic for both comp.lang.perl.misc and comp.lang.awk.

Perhaps you want a microsoft.* group.

Abigail
--
package Just_another_Perl_Hacker; sub print {($_=$_[0])=~ s/_/ /g;
print } sub __PACKAGE__ { &
print ( __PACKAGE__)} &
__PACKAGE__
( )

Bart Lateur

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

Alan J. Flavell wrote:

>MS have a
>tendency to refer to their 8-bit codings as "ANSI", but I've never found
>anything from the ANSI that justifies this usage, either.

But yes. Windows uses a superset of ISO-Latin-1, AKA "Ansi", which is
the same as most Unices use. That is in contrast with plain DOS, which
uses a different character mapping for the upper character code half
altogether (AKA "OEM", Original Equipment Manufacturer).

Bart.

David Cassell

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

Bart Lateur wrote:
> [snip]

> 134 "dagger" (cross) +
> 135 "double dagger" er.. anybody ever used this?

Footnotes and endnotes. For those people who can't understand
[1] and [2].

David

[1] Ezra Pound, "The Cantos", pp. 1078-1079.
[2] Ibid.
--
David Cassell, OAO cas...@mail.cor.epa.gov
Senior computing specialist
mathematical statistician

Alan J. Flavell

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

On Mon, 21 Jun 1999, Bart Lateur wrote:

> >MS have a
> >tendency to refer to their 8-bit codings as "ANSI", but I've never found
> >anything from the ANSI that justifies this usage, either.
>
> But yes. Windows uses a superset of ISO-Latin-1, AKA "Ansi",

Sure, we all know that...

> which is the same as most Unices use.

I don't agree. The unix variants that I've ever used were based on
either DEC Multinational or iso-8859-1. None of them had displayable
characters in the 128-159 range.

But this isn't the point...

Which published standard have you found from the American National
Standards Institute which lays down these codes, please? Without one
of those, I don't see any authority for applying their name. I hunted
around the ANSI's web site but failed to find anything relevant. It
looks as if they backed-out of publishing USA national standards for
character codings after US-ASCII (7-bit code) had finally been settled;
one might surmise they decided that international data transfers were
too important to have a specifically USA standard. If someone can
throw more light on the history I'd be interested, but a Perl language
issue it ain't, I must confess.

all the best

Jonathan Stowe

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

In comp.lang.perl.misc Alan J. Flavell <fla...@mail.cern.ch> wrote:
> Just check that the "euro" has been included,

I'm not sure if thats such a worry - I still havent figured out how to bring
forth the character from the newfangled euro enhanced keyboard at work -
but hey its an opportunity for a new special variable in Perl ;-}

/J\
--
Jonathan Stowe <j...@gellyfish.com>
Some of your questions answered:
<URL:http://www.btinternet.com/~gellyfish/resources/wwwfaq.htm>
Hastings: <URL:http://www.newhoo.com/Regional/UK/England/East_Sussex/Hastings>

Bart Lateur

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

Alan J. Flavell wrote:

>Which published standard have you found from the American National
>Standards Institute which lays down these codes, please? Without one
>of those, I don't see any authority for applying their name.

I see. Actually, it's an ISO standard (International Organization for
Standardization), of which Ansi (American National Standards Institute)
is a member. Americans may be find this offending, but that means that
such an ISO standard carries even more weight than "just" an Ansi
standard. (<http://www.iso.ch/infoe/intro.htm>)

Bart.

Alan J. Flavell

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

On Mon, 21 Jun 1999, Bart Lateur wrote:

> >Which published standard have you found from the American National
> >Standards Institute which lays down these codes, please? Without one
> >of those, I don't see any authority for applying their name.
>
> I see. Actually, it's an ISO standard (International Organization for
> Standardization),

What is? MS-Windows Western coding, CP1252? It's the first I ever
heard it suggested that it was an ISO standard. You sure you're not
still confusing it with iso-8859-1?

Abigail

unread,

Jun 21, 1999, 3:00:00 AM6/21/99

to

Bart Lateur (bart....@skynet.be) wrote on MMCXX September MCMXCIII in
<URL:news:376f8376...@news.skynet.be>:
() Alan J. Flavell wrote:
()
() >MS have a
() >tendency to refer to their 8-bit codings as "ANSI", but I've never found
() >anything from the ANSI that justifies this usage, either.
()
() But yes. Windows uses a superset of ISO-Latin-1, AKA "Ansi", which is
() the same as most Unices use.

Which Unix uses the same set as Windows? All Unices I know use ISO-Latin-x
(or Unicode) for some x.

Abigail
--
srand 123456;$-=rand$_--=>@[[$-,$_]=@[[$_,$-]for(reverse+1..(@[=split
//=>"IGrACVGQ\x02GJCWVhP\x02PL\x02jNMP"));print+(map{$_^q^"^}@[),"\n"

Bart Lateur

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

Abigail wrote:

>() But yes. Windows uses a superset of ISO-Latin-1, AKA "Ansi", which is
>() the same as most Unices use.
>
>Which Unix uses the same set as Windows? All Unices I know use ISO-Latin-x
>(or Unicode) for some x.

I didn't say that. I said that Windows uses a superset of the standard
that many Unices use. And yes, it was chosen for compatibilty. Hence the
break with the DOS backward compatibility. If you ignore the extra
defined characters in SOME FONTS (not all!) in Windows, it IS
ISO-Latin-1.

Bart.

Alan J. Flavell

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

On Tue, 22 Jun 1999, Bart Lateur wrote:

> I didn't say that. I said that Windows uses a superset of the standard
> that many Unices use.

This thread now seems to have gone totally adrift.

My original statement was:

MS have a tendency to refer to their 8-bit codings as "ANSI", but I've
never found anything from the ANSI that justifies this usage,

You gave an impression of wanting to argue with that, but now that
you've clarified several misinterpretations of what you were claiming,
it appears that you haven't addressed that issue at all, but have merely
reiterated what everyone knows, that MS-Windows code is similar to
iso-8859-1 but differs in as much as it assigns printable characters in
the range 128-159 decimal. And that some people wave the term "Ansi"
or "ANSI" around loosely to refer to some 8-bit code or other,
regardless of the fact that nobody can point to an ANSI specification
that justifies this usage.

If we look at the IANA character set registrations, e.g at
http://www.isi.edu/in-notes/iana/assignments/character-sets , the term
"ANSI" appears just three times: ANSI_X3.4-1968 and ANSI_X3.4-1986,
which are the 7-bit code generally known as US-ASCII (ISO-IR-6), and
ANSI_X3.110 1983, which is an 8-bit videotext standard code also known
as ISO-IR-99, and documented for example here:
ftp://dkuug.dk/i18n/WG15-collection/charmaps/ANSI_X3.110-1983 , which
definitely isn't iso-8859-1.

Instead, it seems you are now claiming that the term "Ansi" is an alias
of iso-8859-1, which is an entirely new claim to me, and one that I
don't want to get involved in.

So we see the importance of precision when talking about character
codings. Any other way lies madness.

'bye

T.E.Dickey

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

In comp.lang.awk Bart Lateur <bart....@skynet.be> wrote:
> Alan J. Flavell wrote:

>>MS have a
>>tendency to refer to their 8-bit codings as "ANSI", but I've never found

>>anything from the ANSI that justifies this usage, either.

> But yes. Windows uses a superset of ISO-Latin-1, AKA "Ansi", which is

> the same as most Unices use. That is in contrast with plain DOS, which
> uses a different character mapping for the upper character code half
> altogether (AKA "OEM", Original Equipment Manufacturer).

It's not a superset (because it alters the meanings of some defined codes
in the C1 range - ISO 6429), but a "normally" benign mutation.

--
Thomas E. Dickey
dic...@clark.net
http://www.clark.net/pub/dickey

Tom Christiansen

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

[courtesy cc of this posting mailed to cited author]

In comp.lang.perl.misc,
gaz...@interaccess.com writes:
:I was wondering if anyone had actually done a full-implementation of
:this idea - that is, one that is aware of all the MS funny chars, and
:deals with each appropriately. I'm open to either an AWK or PERL (or,
:lex, for matter) solution...

http://language.perl.com/misc/demoroniser.html

#!/bin/perl -0777pw

# De-moron-ise Illegal Text and HTML from Microsoft Applications
#
# by John Walker -- January 1998
# http://www.fourmilab.ch/
# revised by Larry Rosler -- May 1999
# http://www.hpl.hp.com/personal/Larry_Rosler/
# sed-ified by Tom Christiansen -- May 1999
# tch...@perl.com
# See also
# http://language.perl.com/misc/ms-ascii.html
#
# This program is in the public domain.

use strict;
my %tr;

# Eliminate idiot MS-DOS carriage returns from line terminator.
# See also http://language.perl.com/ppt/ for other cpm2linux
# conversion tools.
s/[\015\012]+/\n/g;

# Supply missing semicolon at end of numeric entity if
# Billy's bozos left it out.
# Map characters in entities.
s/&#(\d+);?/
$tr{chr $1} || ($1 < 0x80 || $1 > 0x9F ? "&#$1;" : chr $1)/ge;

# Now check for any remaining untranslated characters.
my $iline = 1;
$1 eq "\n" ? ++$iline : printf STDERR
"$0: warning -- untranslated character 0x%.2X in line $iline of `%s'\n",
ord $1, ($ARGV eq '-' ? 'standard input' : $ARGV)
while /([\x00-\x08\n\x10-\x1F\x80-\x9F])/g;

# Fix unquoted non-alphanumeric characters in table tags.
s@(<T(?:ABLE|D|H)\s.*?WIDTH\s*=\s*)(\d+%)@$1"$2"@gis;

# Correct PowerPoint mis-nesting of tags.
s@(<FONT\s[^>]*>\s*.*?) () (\s*) ()
@$1$4$2$3@gisx;

# Translate bonehead PowerPoint misuse of <UL> to achieve
# paragraph breaks.
s@<(?:P|/UL)>\s*<UL>@@gi;
s@</UL>\s*@@gi;

# Repair PowerPoint depredations in "text-only slides"
s@@@gi;
s@ (<TD HEIGHT=100)@ <TR>$1@gi;
s@<LI>(?=<H2>)@@gi;

# Repair idiotic double breaks
s@ (?:\s* )+@@gi;

BEGIN {

%tr = (
# Fix dimbulb obscure numeric rendering of & < >.
chr 38 => '&',
chr 60 => '<',
chr 62 => '>',

# Map strategically incompatible non-ISO characters in the
# range 0x82 -- 0x9F into plausible substitutes where
# possible.
"\x82" => ',',
"\x83" => 'f',
"\x84" => ',,',
"\x85" => '...',

"\x88" => '^',
"\x89" => '°/°°',

"\x8B" => '<',
"\x8C" => 'Oe',

"\x91" => '`',
"\x92" => "'",
"\x93" => '"',
"\x94" => '"',
"\x95" => '*',
"\x96" => '-',
"\x97" => '--',
"\x98" => '~',
"\x99" => 'TM',

"\x9B" => '>',
"\x9C" => 'oe',
);

}

--
What I tell you three times is true.

Philip 'Yes, that's my address' Newton

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

On Mon, 21 Jun 1999 22:06:57 +0200, "Alan J. Flavell"
<fla...@mail.cern.ch> wrote:

>On Mon, 21 Jun 1999, Bart Lateur wrote:
>
>> But yes. Windows uses a superset of ISO-Latin-1, AKA "Ansi",
>

>Sure, we all know that...
>

>> which is the same as most Unices use.
>

>I don't agree. The unix variants that I've ever used were based on
>either DEC Multinational or iso-8859-1. None of them had displayable
>characters in the 128-159 range.

Erm, I think Bart meant "Windows uses a superset of [ISO-Latin-1,
which is the same as most Unices use]" rather than "Windows uses [a
superset of ISO-Latin1, which is the same as most Unices use]". This
sounds sensible to me. (NB ISO-Latin-1 == ISO-8859-1)

Cheers,
Philip
--
Philip Newton <nospam...@gmx.net>

Philip 'Yes, that's my address' Newton

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

On Mon, 21 Jun 1999 16:37:28 +0200, "Alan J. Flavell"
<fla...@mail.cern.ch> wrote:

>CP850 for Western European locale).

Though I much prefer CP437 as it has all the extra letters I need for
German *and* the line-drawing characters used extensively by
(American) DOS programs; all the single-line-meets-double-line
characters look awful in CP850 as they turn into accented letters.

Larry Rosler

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

[Posted and a courtesy copy sent.]

In article <376f...@cs.colorado.edu> on 22 Jun 1999 08:14:16 -0700, Tom
Christiansen <tch...@mox.perl.com> says...
...

> # De-moron-ise Illegal Text and HTML from Microsoft Applications
> #
> # by John Walker -- January 1998
> # http://www.fourmilab.ch/
> # revised by Larry Rosler -- May 1999
> # http://www.hpl.hp.com/personal/Larry_Rosler/

...

> # Map strategically incompatible non-ISO characters in the
> # range 0x82 -- 0x9F into plausible substitutes where
> # possible.

...

> "\x91" => '`',
> "\x92" => "'",

Having just used this program on some Redmondware output, I would now
change that one to '´', to match the backtick "\x91" better. Some
of the others might use some more thought too. I didn't look into the
substitutions when I massaged the program.

...

> --
> What I tell you three times is true.

How come no attribution in your quotations file???

``Just the place for a Snark! I have said it twice:
That alone should encourage the crew.
Just the place for a Snark! I have said it thrice:
What I tell three times is true.''

Lewis Carroll
The Hunting of the Snark

--
(Just Another Larry) Rosler

Hewlett-Packard Laboratories
http://www.hpl.hp.com/personal/Larry_Rosler/
l...@hpl.hp.com

Uri Guttman

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

>>>>> "LR" == Larry Rosler <l...@hpl.hp.com> writes:

>> "\x91" => '`',
>> "\x92" => "'",

LR> Having just used this program on some Redmondware output, I would now
LR> change that one to '´', to match the backtick "\x91" better. Some
LR> of the others might use some more thought too. I didn't look into the
LR> substitutions when I massaged the program.

this is a tricky call. you fixed our paper that way and the ' (single
quote) char after translation to ´ is leaning way too far right
for my taste. maybe just converting it to ' is ok.

uri

--
Uri Guttman ----------------- SYStems ARCHitecture and Software Engineering
u...@sysarch.com --------------------------- Perl, Internet, UNIX Consulting
Have Perl, Will Travel ----------------------------- http://www.sysarch.com
The Best Search Engine on the Net ------------- http://www.northernlight.com

Alan J. Flavell

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

On Tue, 22 Jun 1999, Philip 'Yes, that's my address' Newton wrote:

(NB ISO-Latin-1 == ISO-8859-1)

NB, pedantically, ISO-Latin-1 is a repertoire of characters, without
reference to their coding. The ISO-specified character coding for the
Latin-1 repertoire is indeed ISO-8895-1, but the same repertoire of
characters is also included in CP850 and in one of the EBCDIC code pages
(it was called CECP1047 when I was involved in that stuff), as well as
in Windows-1252.

all the best

Kenny McCormack

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

In article <MPG.11d94cadc...@nntp.hpl.hp.com>,

Larry Rosler <l...@hpl.hp.com> wrote:
>[Posted and a courtesy copy sent.]
>
>In article <376f...@cs.colorado.edu> on 22 Jun 1999 08:14:16 -0700, Tom
>Christiansen <tch...@mox.perl.com> says...
>...
>> # De-moron-ise Illegal Text and HTML from Microsoft Applications
>> #
>> # by John Walker -- January 1998
>> # http://www.fourmilab.ch/
>> # revised by Larry Rosler -- May 1999
>> # http://www.hpl.hp.com/personal/Larry_Rosler/

Well, unfortunately, this program *almost* works for me.

When I run it, I get:

demoroniser /etc/group > /dev/null
String found where operator expected at /mydir/bin/demoroniser line 33, near ""$0: warning -- untranslated character 0x%.2X in line $iline of `%s'\n""
(Missing operator before "$0: warning -- untranslated character 0x%.2X in line $iline of `%s'\n"?)

Also, when I run it on a file that contains the 0x92 character (which
is the most common offending character in Redmondware generated files),
I get:

/mydir/bin/demoroniser: warning -- untranslated character 0x92 in line 20 of `MyFile'
/mydir/bin/demoroniser: warning -- untranslated character 0x92 in line 28 of `MyFile'
/mydir/bin/demoroniser: warning -- untranslated character 0x92 in line 35 of `MyFile'
/mydir/bin/demoroniser: warning -- untranslated character 0x92 in line 36 of `MyFile'

And it leaves those chars (0x92) intact in the output. Translating
them was the whole point of the exercise.

Larry Rosler

unread,

Jun 22, 1999, 3:00:00 AM6/22/99

to

[Posted and a courtesy copy mailed.]

In article <x7aets8...@home.sysarch.com> on 22 Jun 1999 12:46:39 -
0400, Uri Guttman <u...@sysarch.com> says...

> >>>>> "LR" == Larry Rosler <l...@hpl.hp.com> writes:
>
> >> "\x91" => '`',
> >> "\x92" => "'",
>
> LR> Having just used this program on some Redmondware output, I would now
> LR> change that one to '´', to match the backtick "\x91" better. Some
> LR> of the others might use some more thought too. I didn't look into the
> LR> substitutions when I massaged the program.
>
> this is a tricky call. you fixed our paper that way and the ' (single
> quote) char after translation to ´ is leaning way too far right
> for my taste. maybe just converting it to ' is ok.

I just rechecked the appearance of ´ using Netscape Navigator 4
and M$IE 5, and they look symmetric with the backtick, i.e., just fine.
Are there any other browsers that anyone cares about? <Ducking. :->

Philip 'Yes, that's my address' Newton

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to

Ah. I thought the two were equivalent (and seem to remember seeing
ISO-8859-9 being called "Latin alphabet no. 1").

I used to think Latin-n == ISO-8859-n for all n, but saw a listing
which shows this breaks down because of Cyrillic, Greek, Arabic, and
Hebrew in the middle; Latin-5 == ISO-8859-9 IIRC.

Philip 'Yes, that's my address' Newton

unread,

Jun 23, 1999, 3:00:00 AM6/23/99

to

On Tue, 22 Jun 1999 14:15:14 -0700, l...@hpl.hp.com (Larry Rosler)
wrote:

>I just rechecked the appearance of ´ using Netscape Navigator 4
>and M$IE 5, and they look symmetric with the backtick, i.e., just fine.

Font?

(i.e. "I'm sure there are fonts where the two characters look
symmetric, and others where they look awful".)

Henry Churchyard

unread,

Jul 6, 1999, 3:00:00 AM7/6/99

to

In article <376f...@cs.colorado.edu>,
Tom Christiansen <tch...@mox.perl.com> wrote:

> http://language.perl.com/misc/demoroniser.html

> #!/bin/perl -0777pw

> # De-moron-ise Illegal Text and HTML from Microsoft Applications

> # Eliminate idiot MS-DOS carriage returns from line terminator.

> s/[\015\012]+/\n/g;

What does that mean? In having \r\n as line-ending sequence, MS-DOS
is arguably more closely following the intentions of the original
ASCII specification, and in fact Internet standards themselves seem to
show that \r\n is the more platform-independent way of doing things --
since in most cases when plain-text messages or commands are passed
around by various protocols (SMTP, NNTP, etc.) it's specified that
text lines will end with \r\n when being sent from one system to
another.

--
Henry Churchyard http://www.crossmyt.com/hc/ || "Is it possible? Can anyone
be so blind to the sordid side of human nature and picnics?"-Charles Willis

Harlan Grove

unread,

Jul 7, 1999, 3:00:00 AM7/7/99

to

In article <7ltq06$3...@piglet.cc.utexas.edu>, chur...@ccwf.cc.utexas.edu
(Henry Churchyard) writes:

>In article <376f...@cs.colorado.edu>,
>Tom Christiansen <tch...@mox.perl.com> wrote:
>
>> http://language.perl.com/misc/demoroniser.html
>
>> #!/bin/perl -0777pw
>> # De-moron-ise Illegal Text and HTML from Microsoft Applications
>
>> # Eliminate idiot MS-DOS carriage returns from line terminator.
>> s/[\015\012]+/\n/g;
>
>What does that mean? In having \r\n as line-ending sequence, MS-DOS
>is arguably more closely following the intentions of the original
>ASCII specification, and in fact Internet standards themselves seem to
>show that \r\n is the more platform-independent way of doing things --
>since in most cases when plain-text messages or commands are passed
>around by various protocols (SMTP, NNTP, etc.) it's specified that
>text lines will end with \r\n when being sent from one system to
>another.

Aw, c'mon. Christiansen's being a unix bigot on purpose. Let him enjoy his
snide comments. Besides, damn little in MS-DOS/Windows is original. CR-LF comes
from CP/M (at least). Whether or not this was a good idea for Microsoft to ape
CP/M back in the early 1980's, they've wisely chosen to stick with \r\n in
order not to break old code and documents.

As for ASCII's original intentions, it was developed back in the wonderful days
of punch cards. Punch cards didn't need \n, \r or \r\n. Paper tape - now that's
a different issue. Serriously, \r\n is the sequence needed to control really
old 'n dumb printers, though it's arguable that \n\r would have been a more
logical sequence. Line ends are just conventions. I'd argue that \0 would make
even more sense.

Eric Bohlman

unread,

Jul 7, 1999, 3:00:00 AM7/7/99

to

Henry Churchyard (chur...@ccwf.cc.utexas.edu) wrote:
: In article <376f...@cs.colorado.edu>,

: Tom Christiansen <tch...@mox.perl.com> wrote:
:
: > http://language.perl.com/misc/demoroniser.html
:
: > #!/bin/perl -0777pw
: > # De-moron-ise Illegal Text and HTML from Microsoft Applications
:
: > # Eliminate idiot MS-DOS carriage returns from line terminator.
: > s/[\015\012]+/\n/g;
:
: What does that mean? In having \r\n as line-ending sequence, MS-DOS
: is arguably more closely following the intentions of the original
: ASCII specification, and in fact Internet standards themselves seem to

I have to agree with you here. CRLF as a line separator is valid ASCII,
and valid in HTML. It's just not the Unix Way. That's an entirely
different case from the tendency of Microsoft tools to use "pretty quotes"
defined only in their proprietary encoding schemes as replacements for
ASCII single and double quotes; the latter is indeed a case of
"moroni[zs]ation."

Henry Churchyard

unread,

Jul 7, 1999, 3:00:00 AM7/7/99

to

In article <19990706201640...@ngol03.aol.com>,

Harlan Grove <hrl...@aol.comzzzzzz> wrote:
>In article <7ltq06$3...@piglet.cc.utexas.edu>, chur...@ccwf.cc.utexas.edu
>(Henry Churchyard) writes:

>>In article <376f...@cs.colorado.edu>,
>>Tom Christiansen <tch...@mox.perl.com> wrote:

>>> # Eliminate idiot MS-DOS carriage returns from line terminator.
>>> s/[\015\012]+/\n/g;

>> What does that mean? In having \r\n as line-ending sequence,
>> MS-DOS is arguably more closely following the intentions of the
>> original ASCII specification, and in fact Internet standards

>> themselves seem to show that \r\n is the more platform-independent

>> way of doing things -- since in most cases when plain-text messages
>> or commands are passed around by various protocols (SMTP, NNTP,
>> etc.) it's specified that text lines will end with \r\n when being
>> sent from one system to another.

> Christiansen's being a unix bigot on purpose. Let him enjoy his
> snide comments. Besides, CR-LF comes from CP/M (at least). Whether

> or not this was a good idea for Microsoft to ape CP/M back in the
> early 1980's, they've wisely chosen to stick with \r\n in order not
> to break old code and documents. As for ASCII's original
> intentions, it was developed back in the wonderful days of punch

> cards. Punch cards didn't need \n, \r or \r\n. Serriously, \r\n is

> the sequence needed to control really old 'n dumb printers, though
> it's arguable that \n\r would have been a more logical sequence.

Classic IBM 80-column punch cards didn't use ASCII -- in fact, they
didn't use any encoding that straightforwardly fits into an 8-bit
byte. And in the case of those printers you mention (which aren't as
archaic as all that, considering they were in widespread use well into
the 1980's), CR-LF was more natural than LF-CR, because with the
former you could issue multiple CR's (to allow you to produce your
Snoopy calendar illustration, or to produce underlining by
overstriking with the "_" character) before sending the line-ending
CR-LF sequence...

--
Henry Churchyard http://www.crossmyt.com/hc/ || "Is it possible? Can anyone

be so blind to the sordid side of human nature and picnics?"--Charles Willis

Harlan Grove

unread,

Jul 7, 1999, 3:00:00 AM7/7/99

to

In article <7m0jip$r...@piglet.cc.utexas.edu>, chur...@ccwf.cc.utexas.edu
(Henry Churchyard) writes:

<snip>
>> . . . As for ASCII's original

>> intentions, it was developed back in the wonderful days of punch
>> cards. Punch cards didn't need \n, \r or \r\n. Serriously, \r\n is
>> the sequence needed to control really old 'n dumb printers, though
>> it's arguable that \n\r would have been a more logical sequence.
>
>Classic IBM 80-column punch cards didn't use ASCII -- in fact, they
>didn't use any encoding that straightforwardly fits into an 8-bit
>byte. And in the case of those printers you mention (which aren't as
>archaic as all that, considering they were in widespread use well into
>the 1980's), CR-LF was more natural than LF-CR, because with the
>former you could issue multiple CR's (to allow you to produce your
>Snoopy calendar illustration, or to produce underlining by
>overstriking with the "_" character) before sending the line-ending
>CR-LF sequence...

OK, punch cards - bad example with regard to character set encoding. Somewhat
relevant in that fixed-length records don't need any explicit separators.

Paper tape may be more apt - NULL (no holes) would be the visually obvious way
to separate things. As for old printers, some TTYs were bi-directional, so line
1 would print left to right, followed by a line feed, then line 2 printing
right to left, etc. No CR's at all. Maybe that's why unix uses only LF. Who
knows (who cares)?