Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Warning: apostrophe VIRUS.

119 views
Skip to first unread message

no.to...@gmail.com

unread,
Sep 10, 2015, 9:37:00 AM9/10/15
to
I don't know where it started.
Perhaps with something like a new [aka. pay again] MS.word:
the single-quote/apostrophe is not rendered as
ASCII-HexChar(27), as it should be.

It's fetched [by cursed http] as 2 bytes:
HexChar(C2) and HexChar(80) !!

If you're just reading it, your brain will handle the mess,
but if you're 'computerising' this WRONG text, expect problems.

Eg. my one problem, is with Festival:TextToSpeech, which I heavily
rely on, says: "sea aye enn tea", instead of "can't",
when it reads the VIRUS:"Can#%t".
Where "#%" are the BAD-bytes received for ASCII-HexChar(27).

I was tempted to say that *nix users should more strongly discriminate
themselves from 'Windows users', but I know of a competent *nix user,
whose USEnet posts always sends this VIRUS.

Eef Hartman

unread,
Sep 10, 2015, 10:12:15 AM9/10/15
to
In alt.os.linux.slackware no.to...@gmail.com wrote:
> It's fetched [by cursed http] as 2 bytes:
> HexChar(C2) and HexChar(80) !!

It is not a bug, it's a feature:
it means the sending site is not using Ascii cq ISO-8859 encoding on
that page but UTF-8, a multi-byte (up to 4) character encoding.
See for instance this page:
http://www.utf8-chartable.de/unicode-utf8-table.pl
for tables of UTF-8 vs Unicode characters.

Essentially there are THREE different "single quote"s;
the Ascii one (straight down)
opening apostrophe (slanted - at the top - to the left) and
closing apostrophe (slanted to the right)

If the page creator used one of the latter two, you'll get it as UTF
_or_ as a Unicode one (if the page has been created IN M$-Windows).
Your browser should be able to render all of these correctly:
user_pref("intl.charsetmenu.browser.cache", "UTF-8, windows-1252,
ISO-8859-1, ISO-8859-2, ISO-8859-15");
(from my firefox "prefs.js" settings file).
BTW: ISO-8859-15 is my default charset, but as you see Firefox will
handle UTF-8 and windows-1252 too.

Michael Baeuerle

unread,
Sep 10, 2015, 11:13:28 AM9/10/15
to
Eef Hartman wrote:
> In alt.os.linux.slackware no.to...@gmail.com wrote:
> >
> > It's fetched [by cursed http] as 2 bytes:
> > HexChar(C2) and HexChar(80) !!
>
> It is not a bug, it's a feature:
> it means the sending site is not using Ascii cq ISO-8859 encoding on
> that page but UTF-8, a multi-byte (up to 4) character encoding.
> See for instance this page:
> http://www.utf8-chartable.de/unicode-utf8-table.pl
> for tables of UTF-8 vs Unicode characters.
>
> Essentially there are THREE different "single quote"s;
> the Ascii one (straight down)
> opening apostrophe (slanted - at the top - to the left) and
> closing apostrophe (slanted to the right)

Wikipedia lists some more Unicode apostrophe style characters:
<https://en.wikipedia.org/wiki/Apostrophe#Unicode>

For the word "can't", maybe this variant would be "correct" from the
typographic point of view:
<http://www.fileformat.info/info/unicode/char/2019/index.htm>

"can’t"

> If the page creator used one of the latter two, you'll get it as UTF
> _or_ as a Unicode one (if the page has been created IN M$-Windows).
> Your browser should be able to render all of these correctly:
> user_pref("intl.charsetmenu.browser.cache", "UTF-8, windows-1252,
> ISO-8859-1, ISO-8859-2, ISO-8859-15");
> (from my firefox "prefs.js" settings file).
> BTW: ISO-8859-15 is my default charset, but as you see Firefox will
> handle UTF-8 and windows-1252 too.

But 0xC2 0x80 looks wrong in any case. It is the (non printable) C1
control character "PAD" in Unicode (if the two bytes are interpreted as
UTF-8), not an apostrophe.


[Xpost and Fup2 set to comp.os.linux.misc]

Chick Tower

unread,
Sep 10, 2015, 6:54:35 PM9/10/15
to
On 2015-09-10, no.to...@gmail.com wrote:
> I don't know where it started.
> Perhaps with something like a new [aka. pay again] MS.word:
> the single-quote/apostrophe is not rendered as
> ASCII-HexChar(27), as it should be.
>
> It's fetched [by cursed http] as 2 bytes:
> HexChar(C2) and HexChar(80) !!
>
> If you're just reading it, your brain will handle the mess,
> but if you're 'computerising' this WRONG text, expect problems.

Perhaps you can process your texts with tr or sed to replace them with
the ASCII apostrophe before feeding them to festival.
--
Chick Tower

For e-mail: aols2 DOT sent DOT towerboy AT xoxy DOT net

Steve Hayes

unread,
Sep 11, 2015, 4:29:00 AM9/11/15
to
On Thu, 10 Sep 2015 13:35:07 +0000 (UTC), no.to...@gmail.com wrote:

>I don't know where it started.
>Perhaps with something like a new [aka. pay again] MS.word:
>the single-quote/apostrophe is not rendered as
>ASCII-HexChar(27), as it should be.
>
>It's fetched [by cursed http] as 2 bytes:
>HexChar(C2) and HexChar(80) !!

I keep getting e-mails with A trademark Euro.

I usually delete them unread.




--
Steve Hayes from Tshwane, South Africa
Web: http://www.khanya.org.za/stevesig.htm
Blog: http://khanya.wordpress.com
E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

Unknown

unread,
Sep 11, 2015, 10:36:41 PM9/11/15
to
On Thu, 10 Sep 2015 14:12:13 +0000, Eef Hartman wrote:

> In alt.os.linux.slackware no.to...@gmail.com wrote:
>> It's fetched [by cursed http] as 2 bytes: HexChar(C2) and HexChar(80)
>> !!
>
> It is not a bug, it's a feature:
>
No it's crap; that's why I labeled it a VIRUS.
Don't mess minimalist ASCII by inserting Unicode.
Soon they'll add Chinese pictograms too.

Chick Tower Wrote:
] Perhaps you can process your texts with tr or sed to replace them with
] the ASCII apostrophe before feeding them to festival.

Yes, but I'm not looking for extra work; a continual battle against
invading viruses.

BTW I've already got a chain of `sed` to prepare the text for festival:
eg. to remove all: "["<numericalString>"]" since I fetch text by `links`.
Which confirms that the <concatenative> programming style [aka *nix
piping] is a winner, for coping with the 'nothing is permanent, except
continual change' situation.

Keith Keller

unread,
Sep 11, 2015, 10:50:07 PM9/11/15
to
["Followup-To:" header set to alt.os.linux.slackware.]

On 2015-09-12, Unknown <d...@gmail.com> wrote:
>
> No it's crap; that's why I labeled it a VIRUS.

Can you please label all of your future posts as a VIRUS?

--keith



--
kkeller...@wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information

Eef Hartman

unread,
Sep 12, 2015, 3:13:56 AM9/12/15
to
In alt.os.linux.slackware Unknown <d...@gmail.com> wrote:
> Don't mess minimalist ASCII by inserting Unicode.

ASCII is dead, no Linux is using it pure anymore (although it is a
common subset of the modern charsets ISO-8859 and UTF).

I.e. your own post had _this_ in its headers:
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
so you're using UTF-8 (Unicode Transformation Format, 8-bits) in your
posts. The first 127 codes of that _are_ ASCII, but - as mentioned
before - UTF-8 is a multi-byte encoding with many more chars then the
only 95 printable ones of ACII.

Rich

unread,
Sep 12, 2015, 10:55:37 AM9/12/15
to
Which likely simply means that our resident crank/troll has forgotten
to tell festival to use UTF-8 as its input character set.

Actually a tiny bit of googling shows that festival may not yet be
capable of 'reading' utf-8 inputs. One post mentioned using iconv to
convert the utf-8 strings into iso-8859 before feeding them to
festival. This is probably the cleanest method until festival becomes
utf-8 aware.

RS Wood

unread,
Sep 22, 2015, 4:51:49 AM9/22/15
to
Seems the old demoroniser perl script might be the way to go, here:
https://www.fourmilab.ch/webtools/demoroniser/

Funny clip from the website:
//--clip
A little detective work revealed that, as is usually the case when you
encounter something shoddy in the vicinity of a computer, Microsoft
incompetence and gratuitous incompatibility were to blame. Western
language HTML documents are written in the ISO 8859-1 Latin-1 character
set, with a specified set of escapes for special characters. Blithely
ignoring this prescription, as usual, Microsoft use their own
"extension" to Latin-1, in which a variety of characters which do not
appear in Latin-1 are inserted in the range 0x82 through 0x95--this
having the merit of being incompatible with both Latin-1 and Unicode,
which reserve this region for additional control characters.
These characters include open and close single and double quotes, em
and en dashes, an ellipsis and a variety of other things you've been
dying for, such as a capital Y umlaut and a florin symbol. Well, okay,
you say, if Microsoft want to have their own little incompatible
character set, why not? Because it doesn't stop there--in their
inimitable fashion (who would want to?)--they aggressively pollute the
Web pages of unknowing and innocent victims worldwide with these
characters, with the result that the owners of these pages look like
semi-literate morons when their pages are viewed on non-Microsoft
platforms (or on Microsoft platforms, for that matter, if the user has
selected as the browser's font one of the many TrueType fonts which do
not include the incompatible Microsoft characters).
You see, "state of the art" Microsoft Office applications sport a nifty
feature called "smart quotes." (Rule of thumb--every time Microsoft use
the word "smart," be on the lookout for something dumb).
//--clip

Peter Chant

unread,
Sep 22, 2015, 2:33:06 PM9/22/15
to
On 09/12/2015 03:42 AM, Keith Keller wrote:

>> No it's crap; that's why I labeled it a VIRUS.
>
> Can you please label all of your future posts as a VIRUS?

Looking up unicode for 'Like'...

👍
UTF-8


0 new messages