Message from discussion Least-lossy string.encode to us-ascii?
Received: by 10.180.81.136 with SMTP id a8mr440180wiy.3.1348233572039;
Fri, 21 Sep 2012 06:19:32 -0700 (PDT)
From: Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>
Subject: Re: Least-lossy string.encode to us-ascii?
Date: 14 Sep 2012 04:05:32 GMT
Organization: Unlimited download news at news.astraweb.com
Content-Type: text/plain; charset=UTF-8
On Thu, 13 Sep 2012 21:34:52 -0500, Tim Chase wrote:
> On 09/13/12 21:09, Mark Tolonen wrote:
>> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>>> Vlastimil's solution kept the characters but stripped them of their
>>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>>> stdlib. Hard to do better than that :-)
>> How about using UTF-7 for transmission and decode on the other end?
>> This keeps the transmission all 7-bit, and no loss.
>> >>> s=u"serviÃ§o mÃ³vil".encode('utf-7')
>> >>> print s
>> servi+AOc-o m+APM-vil
>> >>> print s.decode('utf-7')
>> serviÃ§o mÃ³vil
> Nice if I control both ends of the pipe. Unfortunately, I only control
> what goes in, and I want it to be as un-screw-uppable as possible when
> it comes out the other end (may be web, CSV files, PDFs, FTP'ed file
> dumps, spreadsheets, word-processing documents, etc), and us-ascii is
> the lowest-common-denominator of unscrewuppableness while requiring
> nothing of the the other end. :-)
Wrong. It requires support for US-ASCII. What if the other end is an IBM
mainframe using EBCDIC?
Frankly, I am appalled that you are intentionally perpetuating the
ignorance of US-ASCII-only applications, not because you have no choice
about inter-operating with some ancient, brain-dead application, but
because you artificially choose to follow an obsolete *and incorrect*
It is *incorrect* because you can change the meaning of text by stripping
accents and deleting characters. Consequences can include murder and suicide:
At least tell me that "ASCII only" is merely an *option* for your
application, not the only choice, and that it defaults to UTF-8 which is
the right standard to use for text.