python 3.3 repr

Robin Becker

unread,

Nov 15, 2013, 6:28:15 AM11/15/13

to pytho...@python.org

I'm trying to understand what's going on with this simple program

if __name__=='__main__':
print("repr=%s" % repr(u'\xc1'))
print("%%r=%r" % u'\xc1')

On my windows XP box this fails miserably if run directly at a terminal

C:\tmp> \Python33\python.exe bang.py
Traceback (most recent call last):
File "bang.py", line 2, in <module>
print("repr=%s" % repr(u'\xc1'))
File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xc1' in position 6:
character maps to <undefined>

If I run the program redirected into a file then no error occurs and the the
result looks like this

C:\tmp>cat fff
repr='┴'
%r='┴'

and if I run it into a pipe it works as though into a file.

It seems that repr thinks it can render u'\xc1' directly which is a problem
since print then seems to want to convert that to cp437 if directed into a terminal.

I find the idea that print knows what it's printing to a bit dangerous, but it's
the repr behaviour that strikes me as bad.

What is responsible for defining the repr function's 'printable' so that repr
would give me say an Ascii rendering?
-confused-ly yrs-
Robin Becker

Ned Batchelder

unread,

Nov 15, 2013, 6:38:12 AM11/15/13

to

In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.

--Ned.

Robin Becker

unread,

Nov 15, 2013, 7:16:52 AM11/15/13

to pytho...@python.org

On 15/11/2013 11:38, Ned Batchelder wrote:
..........

>
> In Python3, repr() will return a Unicode string, and will preserve existing Unicode characters in its arguments. This has been controversial. To get the Python 2 behavior of a pure-ascii representation, there is the new builtin ascii(), and a corresponding %a format string.
>
> --Ned.
>

thanks for this, edoesn't make the split across python2 - 3 any easier.
--
Robin Becker

Ned Batchelder

unread,

Nov 15, 2013, 8:54:08 AM11/15/13

to

No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:

try:
repr = ascii
except NameError:
pass

and then use repr throughout.

--Ned.

Roy Smith

unread,

Nov 15, 2013, 9:25:48 AM11/15/13

to pytho...@python.org

In article <b6db8982-feac-4036...@googlegroups.com>,

Ned Batchelder <n...@nedbatchelder.com> wrote:

In Python3, repr() will return a Unicode string, and will preserve existing

Unicode characters in its arguments. This has been controversial. To get

the Python 2 behavior of a pure-ascii representation, there is the new

builtin ascii(), and a corresponding %a format string.

I'm still stuck on Python 2, and while I can understand the controversy ("It breaks my Python 2 code!"), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard.

The idea behind repr() is to provide a "just plain text" representation of an object. In P2, "just plain text" means ascii, so escaping non-ascii characters makes sense. In P3, "just plain text" means unicode, so escaping non-ascii characters no longer makes sense.

Some of us have been doing this long enough to remember when "just plain text" meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like:

MAIN() \(

PRINTF("HELLO, ASCII WORLD");

\)

because ASR-33's didn't have curly braces (or lower case).

Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those.

--

Roy Smith

r...@panix.com

Robin Becker

unread,

Nov 15, 2013, 9:29:17 AM11/15/13

to pytho...@python.org

On 15/11/2013 13:54, Ned Batchelder wrote:
.........

>
> No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:
>
> try:
> repr = ascii
> except NameError:
> pass

....
yes I tried that, but it doesn't affect %r which is inlined in unicodeobject.c,
for me it seems easier to fix windows to use something like a standard encoding
of utf8 ie cp65001, but that's quite hard to do globally. It seems sitecustomize
is too late to set os.environ['PYTHONIOENCODING'], perhaps I can stuff that into
one of the global environment vars and have it work for all python invocations.
--
Robin Becker

Serhiy Storchaka

unread,

Nov 15, 2013, 9:40:36 AM11/15/13

to pytho...@python.org

15.11.13 15:54, Ned Batchelder написав(ла):

> No, but I've found that significant programs that run on both 2 and 3 need to have some shims to make the code work anyway. You could do this:
>
> try:
> repr = ascii
> except NameError:
> pass
>
> and then use repr throughout.

Or rather

try:
ascii
except NameError:
ascii = repr

and then use ascii throughout.

Robin Becker

unread,

Nov 15, 2013, 9:43:17 AM11/15/13

to pytho...@python.org

..........

> I'm still stuck on Python 2, and while I can understand the controversy ("It breaks my Python 2 code!"), this seems like the right thing to have done. In Python 2, unicode is an add-on. One of the big design drivers in Python 3 was to make unicode the standard.
>
> The idea behind repr() is to provide a "just plain text" representation of an object. In P2, "just plain text" means ascii, so escaping non-ascii characters makes sense. In P3, "just plain text" means unicode, so escaping non-ascii characters no longer makes sense.
>

unfortunately the word 'printable' got into the definition of repr; it's clear
that printability is not the same as unicode at least as far as the print
function is concerned. In my opinion it would have been better to leave the old
behaviour as that would have eased the compatibility.

The python gods don't count that sort of thing as important enough so we get the
mess that is the python2/3 split. ReportLab has to do both so it's a real issue;
in addition swapping the str - unicode pair to bytes str doesn't help one's
mental models either :(

Things went wrong when utf8 was not adopted as the standard encoding thus
requiring two string types, it would have been easier to have a len function to
count bytes as before and a glyphlen to count glyphs. Now as I understand it we
have a complicated mess under the hood for unicode objects so they have a
variable representation to approximate an 8 bit representation when suitable etc
etc etc.

> Some of us have been doing this long enough to remember when "just plain text" meant only a single case of the alphabet (and a subset of ascii punctuation). On an ASR-33, your C program would print like:
>
> MAIN() $
> PRINTF("HELLO, ASCII WORLD");
> $
>
> because ASR-33's didn't have curly braces (or lower case).
>
> Having P3's repr() escape non-ascii characters today makes about as much sense as expecting P2's repr() to escape curly braces (and vertical bars, and a few others) because not every terminal can print those.
>

.....
I can certainly remember those days, how we cried and laughed when 8 bits became
popular.
--
Robin Becker

Joel Goldstick

unread,

Nov 15, 2013, 9:50:24 AM11/15/13

to Robin Becker, pytho...@python.org, Roy Smith

>> Some of us have been doing this long enough to remember when "just plain
>> text" meant only a single case of the alphabet (and a subset of ascii
>> punctuation). On an ASR-33, your C program would print like:
>>
>> MAIN() $
>> PRINTF("HELLO, ASCII WORLD");
>> $
>>
>> because ASR-33's didn't have curly braces (or lower case).
>>
>> Having P3's repr() escape non-ascii characters today makes about as much
>> sense as expecting P2's repr() to escape curly braces (and vertical bars,
>> and a few others) because not every terminal can print those.
>>
> .....
> I can certainly remember those days, how we cried and laughed when 8 bits
> became popular.
>

Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
;). That eighth bit sure was less confusing than codepoint
translations

> --
> Robin Becker
> --
> https://mail.python.org/mailman/listinfo/python-list

--
Joel Goldstick
http://joelgoldstick.com

Robin Becker

unread,

Nov 15, 2013, 9:52:02 AM11/15/13

to pytho...@python.org, Serhiy Storchaka

On 15/11/2013 14:40, Serhiy Storchaka wrote:
......

>> and then use repr throughout.
>

> Or rather
>
> try:
> ascii
> except NameError:
> ascii = repr
>
> and then use ascii throughout.
>
>

apparently you can import ascii from future_builtins and the print() function is
available as

from __future__ import print_function

nothing fixes all those %r formats to be %a though :(
--
Robin Becker

Robin Becker

unread,

Nov 15, 2013, 10:03:55 AM11/15/13

to pytho...@python.org, Roy Smith

...........

>> became popular.
>>
> Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
> ;). That eighth bit sure was less confusing than codepoint
> translations

no we had 6 bits in 60 bit words as I recall; extracting the nth character
involved division by 6; smart people did tricks with inverted multiplications
etc etc :(
--
Robin Becker

Joel Goldstick

unread,

Nov 15, 2013, 10:07:49 AM11/15/13

to Robin Becker, pytho...@python.org, Roy Smith

Cool, someone here is older than me! I came in with the 8080, and I
remember split octal, but sixes are something I missed out on.
> Robin Becker

Ned Batchelder

unread,

Nov 15, 2013, 10:08:23 AM11/15/13

to

On Friday, November 15, 2013 9:43:17 AM UTC-5, Robin Becker wrote:
> Things went wrong when utf8 was not adopted as the standard encoding thus
> requiring two string types, it would have been easier to have a len function to
> count bytes as before and a glyphlen to count glyphs. Now as I understand it we
> have a complicated mess under the hood for unicode objects so they have a
> variable representation to approximate an 8 bit representation when suitable etc
> etc etc.
>

Dealing with bytes and Unicode is complicated, and the 2->3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it.

--Ned.

> --
> Robin Becker

Chris Angelico

unread,

Nov 15, 2013, 10:08:28 AM11/15/13

to pytho...@python.org

On Sat, Nov 16, 2013 at 1:43 AM, Robin Becker <ro...@reportlab.com> wrote:
> ..........
>
>> I'm still stuck on Python 2, and while I can understand the controversy
>> ("It breaks my Python 2 code!"), this seems like the right thing to have
>> done. In Python 2, unicode is an add-on. One of the big design drivers in
>> Python 3 was to make unicode the standard.
>>
>> The idea behind repr() is to provide a "just plain text" representation of
>> an object. In P2, "just plain text" means ascii, so escaping non-ascii
>> characters makes sense. In P3, "just plain text" means unicode, so escaping
>> non-ascii characters no longer makes sense.
>>
>
> unfortunately the word 'printable' got into the definition of repr; it's
> clear that printability is not the same as unicode at least as far as the
> print function is concerned. In my opinion it would have been better to
> leave the old behaviour as that would have eased the compatibility.

"Printable" means many different things in different contexts. In some
contexts, the sequence \x66\x75\x63\x6b is considered unprintable, yet
each of those characters is perfectly displayable in its natural form.
Under IDLE, non-BMP characters can't be displayed (or at least, that's
how it has been; I haven't checked current status on that one). On
Windows, the console runs in codepage 437 by default (again, I may be
wrong here), so anything not representable in that has to be escaped.
My Linux box has its console set to full Unicode, everything working
perfectly, so any non-control character can be printed. As far as
Python's concerned, all of that is outside - something is "printable"
if it's printable within Unicode, and the other hassles are matters of
encoding. (Except the first one. I don't think there's an encoding
"g-rated".)

> The python gods don't count that sort of thing as important enough so we get
> the mess that is the python2/3 split. ReportLab has to do both so it's a
> real issue; in addition swapping the str - unicode pair to bytes str doesn't
> help one's mental models either :(

That's fixing, in effect, a long-standing bug - of a sort. The name
"str" needs to be applied to the most normal string type. As of Python
3, that's a Unicode string, which is as it should be. In Python 2, it
was the ASCII/bytes string, which still fit the description of "most
normal string type", but that means that Python 2 programs are
Unicode-unaware by default, which is a flaw. Hence the Py3 fix.

> Things went wrong when utf8 was not adopted as the standard encoding thus
> requiring two string types, it would have been easier to have a len function
> to count bytes as before and a glyphlen to count glyphs. Now as I understand
> it we have a complicated mess under the hood for unicode objects so they
> have a variable representation to approximate an 8 bit representation when
> suitable etc etc etc.

http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

There are languages that do what you describe. It's very VERY easy to
break stuff. What happens when you slice a string?

>>> foo = "asdf"
>>> foo[:2],foo[2:]
('as', 'df')

>>> foo = "q\u1234zy"
>>> foo[:2],foo[2:]
('qሴ', 'zy')

Looks good to me. I split a four-character string, I get two
one-character strings. If that had been done in UTF-8, either I would
need to know "don't split at that boundary, that's between bytes in a
character", or else the indexing and slicing would have to be done by
counting characters from the beginning of the string - an O(n)
operation, rather than an O(1) pointer arithmetic, not to mention that
it'll blow your CPU cache (touching every part of a potentially-long
string) just to find the position.

The only reliable way to manage things is to work with true Unicode.
You can completely ignore the internal CPython representation; what
matters is that in Python (any implementation, as long as it conforms
with version 3.3 or later) lets you index Unicode codepoints out of a
Unicode string, without differentiating between those that happen to
be ASCII, those that fit in a single byte, those that fit in two
bytes, and those that are flagged RTL, because none of those
considerations makes any difference to you.

It takes some getting your head around, but it's worth it - same as
using git instead of a Windows shared drive. (I'm still trying to push
my family to think git.)

ChrisA

Robin Becker

unread,

Nov 15, 2013, 10:18:02 AM11/15/13

to pytho...@python.org, Roy Smith

On 15/11/2013 15:07, Joel Goldstick wrote:
........

>
> Cool, someone here is older than me! I came in with the 8080, and I
> remember split octal, but sixes are something I missed out on.

The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9, pdp 8s had
12 bits I think, then came the IBM 7094 which had 36 bits and finally the
CDC6000 & 7600 machines with 60 bits, some one must have liked 6's
-mumbling-ly yrs-
Robin Becker

Roy Smith

unread,

Nov 15, 2013, 10:32:54 AM11/15/13

to Robin Becker, pytho...@python.org

On Nov 15, 2013, at 10:18 AM, Robin Becker wrote:

The pdp 10/15 had 18 bit words and could be organized as 3*6 or 2*9

I don't know about the 15, but the 10 had 36 bit words (18-bit halfwords). One common character packing was 5 7-bit characters per 36 bit word (with the sign bit left over).

Anybody remember RAD-50? It let you represent a 6-character filename (plus a 3-character extension) in a 16 bit word. RT-11 used it, not sure if it showed up anywhere else.

---

Roy Smith

r...@panix.com

Robin Becker

unread,

Nov 15, 2013, 10:39:04 AM11/15/13

to pytho...@python.org

.........

>
> Dealing with bytes and Unicode is complicated, and the 2->3 transition is not easy, but let's please not spread the misunderstanding that somehow the Flexible String Representation is at fault. However you store Unicode code points, they are different than bytes, and it is complex having to deal with both. You can't somehow make the dichotomy go away, you can only choose where you want to think about it.
>
> --Ned.

.......
I don't think that's what I said; the flexible representation is just an added
complexity that has come about because of the wish to store strings in a compact
way. The requirement for such complexity is the unicode type itself (especially
the storage requirements) which necessitated some remedial action.

There's no point in fighting the change to using unicode. The type wasn't
required for any technical reason as other languages didn't go this route and
are reasonably ok, but there's no doubt the change made things more difficult.
--
Robin Becker

Antoon Pardon

unread,

Nov 15, 2013, 10:49:26 AM11/15/13

to pytho...@python.org

Op 15-11-13 16:39, Robin Becker schreef:
> .........

>>
>> Dealing with bytes and Unicode is complicated, and the 2->3 transition
>> is not easy, but let's please not spread the misunderstanding that
>> somehow the Flexible String Representation is at fault. However you
>> store Unicode code points, they are different than bytes, and it is
>> complex having to deal with both. You can't somehow make the
>> dichotomy go away, you can only choose where you want to think about it.
>>
>> --Ned.

> .......
> I don't think that's what I said; the flexible representation is just an

> added complexity ...

No it is not, at least not for python programmers. (It of course is for
the python implementors). The python programmer doesn't have to care
about the flexible representation, just as the python programmer doesn't
have to care about the internal reprensentation of (long) integers. It
is an implemantation detail that is mostly ignorable.

--
Antoon Pardon

Chris Angelico

unread,

Nov 15, 2013, 11:01:37 AM11/15/13

to pytho...@python.org

On Sat, Nov 16, 2013 at 2:39 AM, Robin Becker <ro...@reportlab.com> wrote:
>> Dealing with bytes and Unicode is complicated, and the 2->3 transition is
>> not easy, but let's please not spread the misunderstanding that somehow the
>> Flexible String Representation is at fault. However you store Unicode code
>> points, they are different than bytes, and it is complex having to deal with
>> both. You can't somehow make the dichotomy go away, you can only choose
>> where you want to think about it.
>>
>> --Ned.
>

> .......
> I don't think that's what I said; the flexible representation is just an

> added complexity that has come about because of the wish to store strings in
> a compact way. The requirement for such complexity is the unicode type
> itself (especially the storage requirements) which necessitated some
> remedial action.
>
> There's no point in fighting the change to using unicode. The type wasn't
> required for any technical reason as other languages didn't go this route
> and are reasonably ok, but there's no doubt the change made things more
> difficult.

There's no perceptible difference between a 3.2 wide build and the 3.3
flexible representation. (Differences with narrow builds are bugs, and
have now been fixed.) As far as your script's concerned, Python 3.3
always stores strings in UTF-32, four bytes per character. It just
happens to be way more efficient on memory, most of the time.

Other languages _have_ gone for at least some sort of Unicode support.
Unfortunately quite a few have done a half-way job and use UTF-16 as
their internal representation. That means there's no difference
between U+0012, U+0123, and U+1234, but U+12345 suddenly gets handled
differently. ECMAScript actually specifies the perverse behaviour of
treating codepoints >U+FFFF as two elements in a string, because it's
just too costly to change.

There are a small number of languages that guarantee correct Unicode
handling. I believe bash scripts get this right (though I haven't
tested; string manipulation in bash isn't nearly as rich as a proper
text parsing language, so I don't dig into it much); Pike is a very
Python-like language, and PEP 393 made Python even more Pike-like,
because Pike's string has been variable width for as long as I've
known it. A handful of other languages also guarantee UTF-32
semantics. All of them are really easy to work with; instead of
writing your code and then going "Oh, I wonder what'll happen if I
give this thing weird characters?", you just write your code, safe in
the knowledge that there is no such thing as a "weird character"
(except for a few in the ASCII set... you may find that code breaks if
given a newline in the middle of something, or maybe the slash
confuses you).

Definitely don't fight the change to Unicode, because it's not a
change at all... it's just fixing what was buggy. You already had a
difference between bytes and characters, you just thought you could
ignore it.

ChrisA

William Ray Wing

unread,

Nov 15, 2013, 11:30:32 AM11/15/13

to pytho...@python.org, William Ray Wing

> --
> https://mail.python.org/mailman/listinfo/python-list

Yes, the PDP-8s, LINC-8s, and PDP-12s were all 12-bit computers. However the LINC-8 operated with word-pairs (instruction in one location followed by address to be operated on in the next) so it was effectively a 24-bit computer and the PDP-12 was able to execute BOTH PDP-8 and LINC-8 instructions (it added one extra instruction to each set that flipped the mode).

First assembly language program I ever wrote was on a PDP-12. (If there is an emoticon for a face with a gray beard, I don't know it.)

-Bill

Gene Heskett

unread,

Nov 15, 2013, 11:36:37 AM11/15/13

to pytho...@python.org

On Friday 15 November 2013 11:28:19 Joel Goldstick did opine:

> On Fri, Nov 15, 2013 at 10:03 AM, Robin Becker <ro...@reportlab.com>
wrote:

> > ...........
> >
> >>> became popular.
> >>
> >> Really? you cried and laughed over 7 vs. 8 bits? That's lovely (?).
> >> ;). That eighth bit sure was less confusing than codepoint
> >> translations
> >
> > no we had 6 bits in 60 bit words as I recall; extracting the nth
> > character involved division by 6; smart people did tricks with
> > inverted multiplications etc etc :(
> > --
>

> Cool, someone here is older than me! I came in with the 8080, and I
> remember split octal, but sixes are something I missed out on.

Ok, if you are feeling old & decrepit, hows this for a birthday: 10/04/34,
I came into micro computers about RCA 1802 time. Wrote a program for the
1802 without an assembler, for tape editing in '78 at KRCR-TV in Redding
CA, that was still in use in '94, but never really wrote assembly code
until the 6809 was out in the Radio Shack Color Computers. os9 on the
coco's was the best teacher about the unix way of doing things there ever
was. So I tell folks these days that I am 39, with 40 years experience at
being 39. ;-)

> > Robin Becker

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

Counting in binary is just like counting in decimal -- if you are all
thumbs.
-- Glaser and Way
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
law-abiding citizens.

Zero Piraeus

unread,

Nov 15, 2013, 12:06:25 PM11/15/13

to pytho...@python.org

:

On Fri, Nov 15, 2013 at 10:32:54AM -0500, Roy Smith wrote:
> Anybody remember RAD-50? It let you represent a 6-character filename
> (plus a 3-character extension) in a 16 bit word. RT-11 used it, not
> sure if it showed up anywhere else.

Presumably 16 is a typo, but I just had a moderate amount of fun
envisaging how that might work: if the characters were restricted to
vowels, then 5**6 < 2**14, giving a couple of bits left over for a
choice of four preset "three-character" extensions.

I can't say that AEIOUA.EX1 looks particularly appealing, though ...

-[]z.

--
Zero Piraeus: pollice verso
http://etiol.net/pubkey.asc

Steven D'Aprano

unread,

Nov 15, 2013, 12:10:45 PM11/15/13

to

On Fri, 15 Nov 2013 14:43:17 +0000, Robin Becker wrote:

> Things went wrong when utf8 was not adopted as the standard encoding
> thus requiring two string types, it would have been easier to have a len
> function to count bytes as before and a glyphlen to count glyphs. Now as
> I understand it we have a complicated mess under the hood for unicode
> objects so they have a variable representation to approximate an 8 bit
> representation when suitable etc etc etc.

No no no! Glyphs are *pictures*, you know the little blocks of pixels
that you see on your monitor or printed on a page. Before you can count
glyphs in a string, you need to know which typeface ("font") is being
used, since fonts generally lack glyphs for some code points.

[Aside: there's another complication. Some fonts define alternate glyphs
for the same code point, so that the design of (say) the letter "a" may
vary within the one string according to whatever typographical rules the
font supports and the application calls. So the question is, when you
"count glyphs", should you count "a" and "alternate a" as a single glyph
or two?]

You don't actually mean count glyphs, you mean counting code points
(think characters, only with some complications that aren't important for
the purposes of this discussion).

UTF-8 is utterly unsuited for in-memory storage of text strings, I don't
care how many languages (Go, Haskell?) make that mistake. When you're
dealing with text strings, the fundamental unit is the character, not the
byte. Why do you care how many bytes a text string has? If you really
need to know how much memory an object is using, that's where you use
sys.getsizeof(), not len().

We don't say len({42: None}) to discover that the dict requires 136
bytes, why would you use len("heåvy") to learn that it uses 23 bytes?

UTF-8 is variable width encoding, which means it's *rubbish* for the in-
memory representation of strings. Counting characters is slow. Slicing is
slow. If you have mutable strings, deleting or inserting characters is
slow. Every operation has to effectively start at the beginning of the
string and count forward, lest it split bytes in the middle of a UTF
unit. Or worse, the language doesn't give you any protection from this at
all, so rather than slow string routines you have unsafe string routines,
and it's your responsibility to detect UTF boundaries yourself.

In case you aren't familiar with what I'm talking about, here's an
example using Python 3.2, starting with a Unicode string and treating it
as UTF-8 bytes:

py> u = "heåvy"
py> s = u.encode('utf-8')
py> for c in s:
... print(chr(c))
...
h
e
Ã
¥
v
y

"Ã¥"? It didn't take long to get moji-bake in our output, and all I did
was print the (byte) string one "character" at a time. It gets worse: we
can easily end up with invalid UTF-8:

py> a, b = s[:len(s)//2], s[len(s)//2:] # split the string in half
py> a.decode('utf-8')

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2:
unexpected end of data
py> b.decode('utf-8')

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0:
invalid start byte

No, UTF-8 is okay for writing to files, but it's not suitable for text
strings. The in-memory representation of text strings should be constant
width, based on characters not bytes, and should prevent the caller from
accidentally ending up with moji-bake or invalid strings.

--
Steven

Chris Angelico

unread,

Nov 15, 2013, 12:11:47 PM11/15/13

to pytho...@python.org

On Sat, Nov 16, 2013 at 4:06 AM, Zero Piraeus <z...@etiol.net> wrote:
> :
>
> On Fri, Nov 15, 2013 at 10:32:54AM -0500, Roy Smith wrote:
>> Anybody remember RAD-50? It let you represent a 6-character filename
>> (plus a 3-character extension) in a 16 bit word. RT-11 used it, not
>> sure if it showed up anywhere else.
>
> Presumably 16 is a typo, but I just had a moderate amount of fun
> envisaging how that might work: if the characters were restricted to
> vowels, then 5**6 < 2**14, giving a couple of bits left over for a
> choice of four preset "three-character" extensions.
>
> I can't say that AEIOUA.EX1 looks particularly appealing, though ...

Looks like it might be this scheme:

https://en.wikipedia.org/wiki/DEC_Radix-50

36-bit word for a 6-char filename, but there was also a 16-bit
variant. I do like that filename scheme you describe, though it would
tend to produce names that would suit virulent diseases.

ChrisA

Chris Angelico

unread,

Nov 15, 2013, 12:29:11 PM11/15/13

to pytho...@python.org

On Sat, Nov 16, 2013 at 4:10 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> No, UTF-8 is okay for writing to files, but it's not suitable for text
> strings.

Correction: It's _great_ for writing to files (and other fundamentally
byte-oriented streams, like network connections). Does a superb job as
the default encoding for all sorts of situations. But, as you say, it
sucks if you want to find the Nth character.

ChrisA

Serhiy Storchaka

unread,

Nov 15, 2013, 12:37:02 PM11/15/13

to pytho...@python.org

15.11.13 17:32, Roy Smith написав(ла):

> Anybody remember RAD-50? It let you represent a 6-character filename
> (plus a 3-character extension) in a 16 bit word. RT-11 used it, not
> sure if it showed up anywhere else.

In three 16-bit words.

Cousin Stanley

unread,

Nov 15, 2013, 12:45:54 PM11/15/13

to

> ....

> We don't say len({42: None}) to discover
> that the dict requires 136 bytes,
> why would you use len("heåvy")
> to learn that it uses 23 bytes ?

> ....

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
illustrate the difference in length of python objects
and the size of their system storage
"""

import sys

s = "heåvy"

d = { 42 : None }

print
print ' s : %s' % s
print ' len( s ) : %d' % len( s )
print ' sys.getsizeof( s ) : %s ' % sys.getsizeof( s )
print
print
print ' d : ' , d
print ' len( d ) : %d' % len( d )
print ' sys.getsizeof( d ) : %d ' % sys.getsizeof( d )

--
Stanley C. Kitching
Human Being
Phoenix, Arizona

Neil Cerutti

unread,

Nov 15, 2013, 12:47:01 PM11/15/13

to

On 2013-11-15, Chris Angelico <ros...@gmail.com> wrote:
> Other languages _have_ gone for at least some sort of Unicode
> support. Unfortunately quite a few have done a half-way job and
> use UTF-16 as their internal representation. That means there's
> no difference between U+0012, U+0123, and U+1234, but U+12345
> suddenly gets handled differently. ECMAScript actually
> specifies the perverse behaviour of treating codepoints >U+FFFF
> as two elements in a string, because it's just too costly to
> change.

The unicode support I'm learning in Go is, "Everything is utf-8,
right? RIGHT?!?" It also has the interesting behavior that
indexing strings retrieves bytes, while iterating over them
results in a sequence of runes.

It comes with support for no encodings save utf-8 (natively) and
utf-16 (if you work at it). Is that really enough?

--
Neil Cerutti

Mark Lawrence

unread,

Nov 15, 2013, 12:58:43 PM11/15/13

to pytho...@python.org

I also used the RCA 1802, but did you use the Ferranti F100L? Rationale
for the use of both, mid/late 70s they were the only processors of their
respective type with military approvals.

Can't remember how we coded on the F100L, but the 1802 work was done on
the Texas Instruments Silent 700, copying from one cassette tape to
another. Set the controls wrong when copying and whoops, you've just
overwritten the work you've just done. We could have had a decent
development environment but it was on a UK MOD cost plus project, so the
more inefficiently you worked, the more profit your employer made.

--
Python is the second best programming language in the world.
But the best has yet to be invented. Christian Tismer

Mark Lawrence

Gene Heskett

unread,

Nov 15, 2013, 2:23:49 PM11/15/13

to pytho...@python.org

On Friday 15 November 2013 13:52:40 Mark Lawrence did opine:

BTDT but in 1959-60 era. Testing the ullage pressure regulators for the
early birds, including some that gave John Glenn his first ride or 2. I
don't recall the brand of paper tape recorders, but they used 12at7's &
12au7's by the grocery sack full. One or more got noisy & me being the
budding C.E.T. that I now am, of course ran down the bad ones and requested
new ones. But you had to turn in the old ones, which Stellardyne Labs
simply recycled back to you the next time you needed a few. Hopeless
management IMO, but thats cost plus for you.

At 10k$ a truckload for helium back then, each test lost about $3k worth of
helium because the recycle catcher tank was so thin walled. And the 6
stage cardox re-compressor was so leaky, occasionally blowing up a pipe out
of the last stage that put about 7800 lbs back in the monel tanks.

I considered that a huge waste compared to the cost of a 12au7, then about
$1.35, and raised hell, so I got fired. They simply did not care that a
perfectly good regulator was being abused to death when it took 10 or more
test runs to get one good recording for the certification. At those
operating pressures, the valve faces erode just like the seats in your
shower faucets do in 20 years. Ten such runs and you may as well bin it,
but they didn't.

I am amazed that as many of those birds worked as did. Of course if it
wasn't manned, they didn't talk about the roman candles on the launch pads.
I heard one story that they had to regrade one pads real estate at
Vandenburg & start all over, seems some ID10T had left the cable to the
explosive bolts hanging on the cable tower. Ooops, and theres no off
switch in many of those once the umbilical has been dropped.

Cheers, Gene
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)

Tehee quod she, and clapte the wyndow to.
-- Geoffrey Chaucer

Steven D'Aprano

unread,

Nov 15, 2013, 8:09:28 PM11/15/13

to

On Fri, 15 Nov 2013 17:47:01 +0000, Neil Cerutti wrote:

> The unicode support I'm learning in Go is, "Everything is utf-8, right?
> RIGHT?!?" It also has the interesting behavior that indexing strings
> retrieves bytes, while iterating over them results in a sequence of
> runes.
>
> It comes with support for no encodings save utf-8 (natively) and utf-16
> (if you work at it). Is that really enough?

Only if you never need to handle data created by other applications.

--
Steven