Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Least-lossy string.encode to us-ascii?

Received: by 10.180.81.136 with SMTP id a8mr440180wiy.3.1348233572039;
        Fri, 21 Sep 2012 06:19:32 -0700 (PDT)
Path: ed8ni9017372wib.0!nntp.google.com!feeder2.cambriumusenet.nl!feeder1.cambriumusenet.nl!feed.tweaknews.nl!94.232.116.12.MISMATCH!feed.xsnews.nl!border-2.ams.xsnews.nl!border4.nntp.ams.giganews.com!border2.nntp.ams.giganews.com!border2.nntp.dca.giganews.com!nntp.giganews.com!newsfeed.news.ucla.edu!usenet.stanford.edu!news.glorb.com!news.astraweb.com!border5.newsrouter.astraweb.com!not-for-mail
From: Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>
Subject: Re: Least-lossy string.encode to us-ascii?
Newsgroups: comp.lang.python
References: <50524F6F.6070604@tim.thechases.com>
	<k2tqne$gie$1@ger.gmane.org>
	<mailman.654.1347580392.27098.python-list@python.org>
	<8a35c480-7594-4202-afe8-f03db9418301@googlegroups.com>
	<mailman.660.1347590022.27098.python-list@python.org>
MIME-Version: 1.0
Date: 14 Sep 2012 04:05:32 GMT
Lines: 46
Message-ID: <5052ad0c$0$29981$c3e8da3$5496439d@news.astraweb.com>
Organization: Unlimited download news at news.astraweb.com
NNTP-Posting-Host: 561319cb.news.astraweb.com
X-Trace: DXC=FcCLS6[BS\1l=bkA3Hk5C1L?0kYOcDh@:N7:H2`MmAU3S2kSQZD<P_;]G;2>V^?kW32Of6BS7?nc:Jg:DSR1enQ3<W_:aQjllU6
Bytes: 2970
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

On Thu, 13 Sep 2012 21:34:52 -0500, Tim Chase wrote:

> On 09/13/12 21:09, Mark Tolonen wrote:
>> On Thursday, September 13, 2012 4:53:13 PM UTC-7, Tim Chase wrote:
>>> Vlastimil's solution kept the characters but stripped them of their
>>> accents/tildes/cedillas/etc, doing just what I wanted, all using the
>>> stdlib.  Hard to do better than that :-)
>> 
>> How about using UTF-7 for transmission and decode on the other end? 
>> This keeps the transmission all 7-bit, and no loss.
>> 
>>     >>> s=u"serviço móvil".encode('utf-7')
>>     >>> print s
>>     servi+AOc-o m+APM-vil
>>     >>> print s.decode('utf-7')
>>     serviço móvil
> 
> Nice if I control both ends of the pipe.  Unfortunately, I only control
> what goes in, and I want it to be as un-screw-uppable as possible when
> it comes out the other end (may be web, CSV files, PDFs, FTP'ed file
> dumps, spreadsheets, word-processing documents, etc), and us-ascii is
> the lowest-common-denominator of unscrewuppableness while requiring
> nothing of the the other end. :-)

Wrong. It requires support for US-ASCII. What if the other end is an IBM 
mainframe using EBCDIC?

Frankly, I am appalled that you are intentionally perpetuating the 
ignorance of US-ASCII-only applications, not because you have no choice 
about inter-operating with some ancient, brain-dead application, but 
because you artificially choose to follow an obsolete *and incorrect* 
standard.

It is *incorrect* because you can change the meaning of text by stripping 
accents and deleting characters. Consequences can include murder and suicide:

http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail

At least tell me that "ASCII only" is merely an *option* for your 
application, not the only choice, and that it defaults to UTF-8 which is 
the right standard to use for text.



-- 
Steven