function inclusion problem

vlya...@gmail.com

unread,

Feb 10, 2015, 6:38:12 PM2/10/15

to

I defined function Fatalln in "mydef.py" and it works fine if i call it from "mydef.py", but when i try to call it from "test.py" in the same folder:
import mydef
...
Fatalln "my test"
i have NameError: name 'Fatalln' is not defined
I also tried include('mydef.py') with the same result...
What is the right syntax?
Thanks

sohca...@gmail.com

unread,

Feb 10, 2015, 6:55:48 PM2/10/15

to

It would help us help you a lot of you copy/paste your code from both mydef.py and test.py so we can see exactly what you're doing.

Don't re-type what you entered, because people (Especially new programmers) are prone to either making typos or leaving out certain things because they don't think they're important. Copy/Paste the code from the two files and then copy/paste the error you're getting.

Steven D'Aprano

unread,

Feb 10, 2015, 6:58:03 PM2/10/15

to

Preferred:

import mydef
mydef.Fatalln("my test")

Also acceptable:

from mydef import Fatalln
Fatalln("my test")

--
Steven

Michael Torrie

unread,

Feb 10, 2015, 7:00:37 PM2/10/15

to pytho...@python.org

Almost.

Try this:

mydef.Fatalln()

Unless you import the symbols from your mydef module into your program
they have to referenced by the module name. This is a good thing and it
helps keep your code separated and clean. It is possible to import
individual symbols from a module like this:

from mydef import Fatalln

Avoid the temptation to import *all* symbols from a module into the
current program's namespace. Better to type out the extra bit.
Alternatively you can alias imports like this

import somemodule.submodule as foo

Frequently this idiom is used when working with numpy to save a bit of
time, while preserving the separate namespaces.

sohca...@gmail.com

unread,

Feb 10, 2015, 7:03:06 PM2/10/15

to

On Tuesday, February 10, 2015 at 3:38:12 PM UTC-8, vlya...@gmail.com wrote:

If you only do `import mydef`, then it creates a module object called `mydef` which contains all the global members in mydef.py. When you want to call a function from that module, you need to specify that you're calling a function from that module by putting the module name followed by a period, then the function. For example:

mydef.Fatalln("my test")

If you wanted to be able to call Fatalln without using the module name, you could import just the Fatalln function:

from mydef import Fatalln
Fatalln("my test")

If you had a lot of functions in mydef.py and wanted to be able to access them all without that pesky module name, you could also do:

from mydef import *

However, this can often be considered a bad practice as you're polluting your global name space, though can be acceptable in specific scenarios.

For more information, check https://docs.python.org/3/tutorial/modules.html

Ian Kelly

unread,

Feb 10, 2015, 7:03:36 PM2/10/15

to Python

import mydef
mydef.Fatalin("my test")

or

from mydef import Fatalin
Fatalin("my test")

Laura Creighton

unread,

Feb 10, 2015, 7:06:55 PM2/10/15

to vlya...@gmail.com, pytho...@python.org, l...@openend.se

>--
>https://mail.python.org/mailman/listinfo/python-list

from mydef import Fatalln

Laura Creighton

unread,

Feb 10, 2015, 7:17:00 PM2/10/15

to Laura Creighton, pytho...@python.org, vlya...@gmail.com, l...@openend.se

In a message of Wed, 11 Feb 2015 01:06:00 +0100, Laura Creighton writes:
>In a message of Tue, 10 Feb 2015 15:38:02 -0800, vlya...@gmail.com writes:

>>--
>>https://mail.python.org/mailman/listinfo/python-list
>
>from mydef import Fatalln
>

Also, please be warned. If you use a unix system, or a linux
system. There are lots of problems you can get into if you
expect something named 'test' to run your code. Because they
already have one in their shell, and that one wins, and so ...
well, test.py is safe. But if you rename it as a script and call
it the binary file test ...

Bad and unexpected things happen.

Name it 'testme' or something like that. Never have that problem again.
:)

Been there, done that!
Laura

Message has been deleted

Victor L

unread,

Feb 11, 2015, 8:28:24 AM2/11/15

to Laura Creighton, pytho...@python.org

Laura, thanks for the answer - it works. Is there some equivalent of "include" to expose every function in that script?

Thanks again,

-V

On Tue, Feb 10, 2015 at 7:16 PM, Laura Creighton <l...@openend.se> wrote:

In a message of Wed, 11 Feb 2015 01:06:00 +0100, Laura Creighton writes:
>In a message of Tue, 10 Feb 2015 15:38:02 -0800, vlya...@gmail.com writes:

Dave Angel

unread,

Feb 11, 2015, 10:07:59 AM2/11/15

to pytho...@python.org

On 02/11/2015 08:27 AM, Victor L wrote:
> Laura, thanks for the answer - it works. Is there some equivalent of
> "include" to expose every function in that script?
> Thanks again,
> -V
>

Please don't top-post, and please use text email, not html. Thank you.

yes, as sohca...@gmail.com pointed out, you can do

from mydef import *

But this is nearly always bad practice. If there are only a few
functions you need access to, you should do

from mydef import Fatalln, func2, func42

and if there are tons of them, you do NOT want to pollute your local
namespace with them, and should do:

import mydef

x = mydef.func2() # or whatever

The assumption is that when the code is in an importable module, it'll
be maintained somewhat independently of the calling script/module. For
example, if you're using a third party library, it could be updated
without your having to rewrite your own calling code.

So what happens if the 3rd party adds a new function, and you happen to
have one by the same name. If you used the import* semantics, you could
suddenly have broken code, with the complaint "But I didn't change a thing."

Similarly, if you import from more than one module, and use the import*
form, they could conflict with each other. And the order of importing
will (usually) determine which names override which ones.

The time that it's reasonable to use import* is when the third-party
library already recommends it. They should only do so if they have
written their library to only expose a careful subset of the names
declared, and documented all of them. And when they make new releases,
they're careful to hide any new symbols unless carefully documented in
the release notes, so you can manually check for interference.

Incidentally, this is also true of the standard library. There are
symbols that are importable from multiple places, and sometimes they
have the same meanings, sometimes they don't. An example (in Python
2.7) of the latter is os.walk and os.path.walk

When I want to use one of those functions, I spell it out:
for dirpath, dirnames, filenames in os.walk(topname):

That way, there's no doubt in the reader's mind which one I intended.

--
DaveA

Tim Chase

unread,

Feb 11, 2015, 10:36:54 AM2/11/15

to pytho...@python.org

On 2015-02-11 10:07, Dave Angel wrote:
> if there are tons of them, you do NOT want to pollute your local
> namespace with them, and should do:
>
> import mydef
>
> x = mydef.func2() # or whatever

or, if that's verbose, you can give a shorter alias:

import Tkinter as tk
root = tk.Tk()
root.mainloop()

-tkc

Chris Angelico

unread,

Feb 11, 2015, 10:37:31 AM2/11/15

to pytho...@python.org

On Thu, Feb 12, 2015 at 2:07 AM, Dave Angel <d...@davea.name> wrote:
> Similarly, if you import from more than one module, and use the import*
> form, they could conflict with each other. And the order of importing will
> (usually) determine which names override which ones.

Never mind about conflicts and order of importing... just try figuring
out code like this:

from os import *
from sys import *
from math import *

# Calculate the total size of all files in a directory
tot = 0
for path, dirs, files in walk(argv[1]):
# We don't need to sum the directories separately
for f in files:
# getsizeof() returns a value in bytes
tot += getsizeof(f)/1024.0/1024.0

print("Total directory size:", floor(tot), "MB")

Now, I'm sure some of the experienced Python programmers here can see
exactly what's wrong. But can everyone? I doubt it. Even if you run
it, I doubt you'd get any better clue. But if you could see which
module everything was imported from, it'd be pretty obvious.

ChrisA

Laura Creighton

unread,

Feb 24, 2015, 2:58:21 PM2/24/15

to Laura Creighton, pytho...@python.org, l...@openend.se

Dave Angel
are you another Native English speaker living in a world where ASCII
is enough?

Laura

Dave Angel

unread,

Feb 24, 2015, 3:42:09 PM2/24/15

to pytho...@python.org

I'm a native English speaker, and 7 bits is not nearly enough. Even if
I didn't currently care, I have some history:

No. CDC display code is enough. Who needs lowercase?

No. Baudot code is enough.

No, EBCDIC is good enough. Who cares about other companies.

No, the "golf-ball" only holds this many characters. If we need more,
we can just get the operator to switch balls in the middle of printing.

No. 2 digit years is enough. This world won't last till the millennium
anyway.

No. 2k is all the EPROM you can have. Your code HAS to fit in it, and
only 1.5k RAM.

No. 640k is more than anyone could need.

No, you cannot use a punch card made on a model 26 keypunch in the same
deck as one made on a model 29. Too bad, many of the codes are
different. (This one cost me travel back and forth between two
different locations with different model keypunches)

No. 8 bits is as much as we could ever use for characters. Who could
possibly need names or locations outside of this region? Or from
multiple places within it?

35 years ago I helped design a serial terminal that "spoke" Chinese,
using a two-byte encoding. But a single worldwide standard didn't come
until much later, and I cheered Unicode when it was finally unveiled.

I've worked with many printers that could only print 70 or 80 unique
characters. The laser printer, and even the matrix printer are
relatively recent inventions.

Getting back on topic:

According to:
http://support.esri.com/cn/knowledgebase/techarticles/detail/27345

"""ArcGIS Desktop applications, such as ArcMap, are Unicode based, so
they support Unicode to a certain level. The level of Unicode support
depends on the data format."""

That page was written about 2004, so there was concern even then.

And according to another, """In the header of each shapefile (.DBF), a
reference to a code page is included."""

--
DaveA

Steven D'Aprano

unread,

Feb 24, 2015, 8:19:58 PM2/24/15

to

ASCII was never enough. Not even for Americans, who couldn't write things
like "I bought a comic book for 10¢ yesterday", let alone interesting
things from maths and science.

I missed the whole 7-bit ASCII period, my first computer (Mac 128K) already
had an extended character set beyond ASCII. But even that never covered the
full range of characters I wanted to write, and then there was the horrible
mess that you got whenever you copied text files from a Mac to a DOS or
Windows PC or visa versa. Yes, even in 1984 we were transferring files and
running into encoding issues.

--
Steven

Marcos Almeida Azevedo

unread,

Feb 24, 2015, 11:54:50 PM2/24/15

to Steven D'Aprano, pytho...@python.org

On Wed, Feb 25, 2015 at 9:19 AM, Steven D'Aprano <steve+comp....@pearwood.info> wrote:

Laura Creighton wrote:

> Dave Angel
> are you another Native English speaker living in a world where ASCII
> is enough?

ASCII was never enough. Not even for Americans, who couldn't write things
like "I bought a comic book for 10¢ yesterday", let alone interesting
things from maths and science.

ASCII was a necessity back then because RAM and storage are too small.

I missed the whole 7-bit ASCII period, my first computer (Mac 128K) already
had an extended character set beyond ASCII. But even that never covered the

I miss the days when I was coding with my XT computer (640kb RAM) too. Things were so simple back then.

full range of characters I wanted to write, and then there was the horrible
mess that you got whenever you copied text files from a Mac to a DOS or
Windows PC or visa versa. Yes, even in 1984 we were transferring files and
running into encoding issues.

--
Steven

--
https://mail.python.org/mailman/listinfo/python-list

--

Marcos | I love PHP, Linux, and Java

Rustom Mody

unread,

Feb 26, 2015, 7:40:25 AM2/26/15

to

Wrote something up on why we should stop using ASCII:
http://blog.languager.org/2015/02/universal-unicode.html

(Yeah the world is a bit larger than a small bunch of islands off a half-continent.
But this is not that discussion!)

Rustom Mody

unread,

Feb 26, 2015, 8:15:56 AM2/26/15

to

Dave's list above of instances of 'poverty is a good idea' turning out stupid and narrow-minded in hindsight is neat. Thought I'd ack that explicitly.

Chris Angelico

unread,

Feb 26, 2015, 8:24:34 AM2/26/15

to pytho...@python.org

On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody <rusto...@gmail.com> wrote:
> Wrote something up on why we should stop using ASCII:
> http://blog.languager.org/2015/02/universal-unicode.html

>From that post:

"""
5.1 Gibberish

When going from the original 2-byte unicode (around version 3?) to the
one having supplemental planes, the unicode consortium added blocks
such as

* Egyptian hieroglyphs
* Cuneiform
* Shavian
* Deseret
* Mahjong
* Klingon

To me (a layman) it looks unprofessional – as though they are playing
games – that billions of computing devices, each having billions of
storage words should have their storage wasted on blocks such as
these.
"""

The shift from Unicode as a 16-bit code to having multiple planes came
in with Unicode 2.0, but the various blocks were assigned separately:
* Egyptian hieroglyphs: Unicode 5.2
* Cuneiform: Unicode 5.0
* Shavian: Unicode 4.0
* Deseret: Unicode 3.1
* Mahjong Tiles: Unicode 5.1
* Klingon: Not part of any current standard

However, I don't think historians will appreciate you calling all of
these "gibberish". To adequately describe and discuss old texts
without these Unicode blocks, we'd have to either do everything with
images, or craft some kind of reversible transliteration system and
have dedicated software to render the texts on screen. Instead, what
we have is a well-known and standardized system for transliterating
all of these into numbers (code points), and rendering them becomes a
simple matter of installing an appropriate font.

Also, how does assigning meanings to codepoints "waste storage"? As
soon as Unicode 2.0 hit and 16-bit code units stopped being
sufficient, everyone needed to allocate storage - either 32 bits per
character, or some other system - and the fact that some codepoints
were unassigned had absolutely no impact on that. This is decidedly
NOT unprofessional, and it's not wasteful either.

ChrisA

Sam Raker

unread,

Feb 26, 2015, 11:46:11 AM2/26/15

to

I'm 100% in favor of expanding Unicode until the sun goes dark. Doing so helps solve the problems affecting speakers of "underserved" languages--access and language preservation. Speakers of Mongolian, Cherokee, Georgian, etc. all deserve to be able to interact with technology in their native languages as much as we speakers of ASCII-friendly languages do. Unicode support also makes writing papers on, dictionaries of, and new texts in such languages much easier, which helps the fight against language extinction, which is a sadly pressing issue.

Also, like, computers are big. Get an external drive for your high-resolution PDF collection of Medieval manuscripts if you feel like you're running out of space. A few extra codepoints aren't going to be the straw that breaks the camel's back.

On Thursday, February 26, 2015 at 8:24:34 AM UTC-5, Chris Angelico wrote:
> On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody <rusto...@gmail.com> wrote:
> > Wrote something up on why we should stop using ASCII:
> > http://blog.languager.org/2015/02/universal-unicode.html
>
> >From that post:
>
> """
> 5.1 Gibberish
>
> When going from the original 2-byte unicode (around version 3?) to the
> one having supplemental planes, the unicode consortium added blocks
> such as
>
> * Egyptian hieroglyphs
> * Cuneiform
> * Shavian
> * Deseret
> * Mahjong
> * Klingon
>

> To me (a layman) it looks unprofessional - as though they are playing
> games - that billions of computing devices, each having billions of

Terry Reedy

unread,

Feb 26, 2015, 12:03:44 PM2/26/15

to pytho...@python.org

On 2/26/2015 8:24 AM, Chris Angelico wrote:
> On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody <rusto...@gmail.com> wrote:

>> Wrote something up on why we should stop using ASCII:
>> http://blog.languager.org/2015/02/universal-unicode.html

I think that the main point of the post, that many Unicode chars are
truly planetary rather than just national/regional, is excellent.

> From that post:
>
> """
> 5.1 Gibberish
>
> When going from the original 2-byte unicode (around version 3?) to the
> one having supplemental planes, the unicode consortium added blocks
> such as
>
> * Egyptian hieroglyphs
> * Cuneiform
> * Shavian
> * Deseret
> * Mahjong
> * Klingon
>

> To me (a layman) it looks unprofessional – as though they are playing
> games – that billions of computing devices, each having billions of

> storage words should have their storage wasted on blocks such as
> these.
> """
>
> The shift from Unicode as a 16-bit code to having multiple planes came
> in with Unicode 2.0, but the various blocks were assigned separately:
> * Egyptian hieroglyphs: Unicode 5.2
> * Cuneiform: Unicode 5.0
> * Shavian: Unicode 4.0
> * Deseret: Unicode 3.1
> * Mahjong Tiles: Unicode 5.1
> * Klingon: Not part of any current standard

You should add emoticons, but not call them or the above 'gibberish'.
I think that this part of your post is more 'unprofessional' than the
character blocks. It is very jarring and seems contrary to your main point.

> However, I don't think historians will appreciate you calling all of
> these "gibberish". To adequately describe and discuss old texts
> without these Unicode blocks, we'd have to either do everything with
> images, or craft some kind of reversible transliteration system and
> have dedicated software to render the texts on screen. Instead, what
> we have is a well-known and standardized system for transliterating
> all of these into numbers (code points), and rendering them becomes a
> simple matter of installing an appropriate font.
>
> Also, how does assigning meanings to codepoints "waste storage"? As
> soon as Unicode 2.0 hit and 16-bit code units stopped being
> sufficient, everyone needed to allocate storage - either 32 bits per
> character, or some other system - and the fact that some codepoints
> were unassigned had absolutely no impact on that. This is decidedly
> NOT unprofessional, and it's not wasteful either.

I agree.

--
Terry Jan Reedy

Rustom Mody

unread,

Feb 26, 2015, 12:08:37 PM2/26/15

to

On Thursday, February 26, 2015 at 10:16:11 PM UTC+5:30, Sam Raker wrote:
> I'm 100% in favor of expanding Unicode until the sun goes dark. Doing so helps solve the problems affecting speakers of "underserved" languages--access and language preservation. Speakers of Mongolian, Cherokee, Georgian, etc. all deserve to be able to interact with technology in their native languages as much as we speakers of ASCII-friendly languages do. Unicode support also makes writing papers on, dictionaries of, and new texts in such languages much easier, which helps the fight against language extinction, which is a sadly pressing issue.

Agreed -- Correcting the inequities caused by ASCII-bias is a good thing.

In fact the whole point of my post was to say just that by carving out and
focussing on a 'universal' subset of unicode that is considerably larger than
ASCII but smaller than unicode, we stand to reduce ASCII-bias.

As also other posts like
http://blog.languager.org/2014/04/unicoded-python.html
http://blog.languager.org/2014/05/unicode-in-haskell-source.html

However my example listed

> > * Egyptian hieroglyphs
> > * Cuneiform
> > * Shavian
> > * Deseret
> > * Mahjong
> > * Klingon

Ok Chris has corrected me re. Klingon-in-unicode. So lets drop that.
Of the others which do you thing is in 'underserved' category?

More generally which of http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Supplementary_Multilingual_Plane
are underserved?

Chris Angelico

unread,

Feb 26, 2015, 12:29:43 PM2/26/15

to pytho...@python.org

On Fri, Feb 27, 2015 at 4:02 AM, Terry Reedy <tjr...@udel.edu> wrote:
> On 2/26/2015 8:24 AM, Chris Angelico wrote:
>>
>> On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody <rusto...@gmail.com>
>> wrote:
>>>

>>> Wrote something up on why we should stop using ASCII:
>>> http://blog.languager.org/2015/02/universal-unicode.html
>
>

> I think that the main point of the post, that many Unicode chars are truly
> planetary rather than just national/regional, is excellent.

Agreed. Like you, though, I take exception at the "Gibberish" section.

Unicode offers us a number of types of character needed by linguists:

1) Letters[1] common to many languages, such as the unadorned Latin
and Cyrillic letters
2) Letters specific to one or very few languages, such as the Turkish dotless i
3) Diacritical marks, ready to be combined with various letters
4) Precomposed forms of various common "letter with diacritical" combinations
5) Other precomposed forms, eg ligatures and Hangul syllables
6) Symbols, punctuation, and various other marks
7) Spacing of various widths and attributes

Apart from #4 and #5, which could be avoided by using the decomposed
forms everywhere, each of these character types is vital. You can't
typeset a document without being able to adequately represent every
part of it. Then there are additional characters that aren't strictly
necessary, but are extremely convenient, such as the emoticon
sections. You can talk in text and still put in a nice little picture
of a globe, or the monkey-no-evil set, etc.

Most of these characters - in fact, all except #2 and maybe a few of
the diacritical marks - are used in multiple places/languages. Unicode
isn't about taking everyone's separate character sets and numbering
them all so we can reference characters from anywhere; if you wanted
that, you'd be much better off with something that lets you specify a
code page in 16 bits and a character in 8, which is roughly the same
size as Unicode anyway. What we have is, instead, a system that brings
them all together - LATIN SMALL LETTER A is U+0061 no matter whether
it's being used to write English, French, Malaysian, Turkish,
Croatian, Vietnamese, or Icelandic text. Unicode is truly planetary.

ChrisA

[1] I use the word "letter" loosely here; Chinese and Japanese don't
have a concept of letters as such, but their glyphs are still
represented.

Rustom Mody

unread,

Feb 26, 2015, 12:59:24 PM2/26/15

to

On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
> On 2/26/2015 8:24 AM, Chris Angelico wrote:

Emoticons (or is it emoji) seems to have some (regional?) takeup?? Dunno…
In any case I'd like to stay clear of political(izable) questions

> I think that this part of your post is more 'unprofessional' than the
> character blocks. It is very jarring and seems contrary to your main point.

Ok I need a word for
1. I have no need for this
2. 99.9% of the (living) on this planet also have no need for this

>
> > However, I don't think historians will appreciate you calling all of
> > these "gibberish". To adequately describe and discuss old texts
> > without these Unicode blocks, we'd have to either do everything with
> > images, or craft some kind of reversible transliteration system and
> > have dedicated software to render the texts on screen. Instead, what
> > we have is a well-known and standardized system for transliterating
> > all of these into numbers (code points), and rendering them becomes a
> > simple matter of installing an appropriate font.
> >
> > Also, how does assigning meanings to codepoints "waste storage"? As
> > soon as Unicode 2.0 hit and 16-bit code units stopped being
> > sufficient, everyone needed to allocate storage - either 32 bits per
> > character, or some other system - and the fact that some codepoints
> > were unassigned had absolutely no impact on that. This is decidedly
> > NOT unprofessional, and it's not wasteful either.
>
> I agree.

I clearly am more enthusiastic than knowledgeable about unicode.
But I know my basic CS well enough (as I am sure you and Chris also do)

So I dont get how 4 bytes is not more expensive than 2.
Yeah I know you can squeeze a unicode char into 3 bytes or even 21 bits
You could use a clever representation like UTF-8 or FSR.
But I dont see how you can get out of this that full-unicode costs more than
exclusive BMP.

eg consider the case of 32 vs 64 bit executables.
The 64 bit executable is generally larger than the 32 bit one
Now consider the case of a machine that has say 2GB RAM and a 64-bit processor.
You could -- I think -- make a reasonable case that all those all-zero hi-address-words are 'waste'.

And youve got the general sense best so far:

> I think that the main point of the post, that many Unicode chars are
> truly planetary rather than just national/regional,

And if the general tone/tenor of what I have written is probably not getting
across by some words (like 'gibberish'?) so I'll try and reword.

However let me try and clarify that the whole of section 5 is 'iffy' with 5.1 being only more extreme. Ive not written these in because the point of that
post is not to criticise unicode but to highlight the universal(isable) parts.

Still if I were to expand on the criticisms here are some examples:

Math-Greek: Consider the math-alpha block
http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Mathematical_Alphanumeric_Symbols_block

Now imagine a beginning student not getting the difference between font, glyph,
character. To me this block represents this same error cast into concrete and
dignified by the (supposed) authority of the unicode consortium.

There are probably dozens of other such stupidities like distinguishing kelvin K from latin K as if that is the business of the unicode consortium

My real reservations about unicode come from their work in areas that I happen to know something about

Music: To put music simply as a few mostly-meaningless 'dingbats' like ♩ ♪ ♫ is perhaps ok
However all this stuff http://xahlee.info/comp/unicode_music_symbols.html
makes no sense (to me) given that music (ie standard western music written in staff notation) is inherently 2 dimensional -- multi-voiced, multi-staff, chordal

Sanskrit/Devanagari:
Consists of bogus letters that dont exist in devanagari
The letter ऄ (0904) is found here http://unicode.org/charts/PDF/U0900.pdf
But not here http://en.wikipedia.org/wiki/Devanagari#Vowels
So I call it bogus-devanagari

Contrariwise an important letter in vedic pronunciation the double-udatta is missing
http://list.indology.info/pipermail/indology_list.indology.info/2000-April/021070.html

All of which adds up to the impression that the unicode consortium occasionally fails to do due diligence

In any case all of the above is contrary to /irrelevant to my post which is about
identifying the more universal parts of unicode

Rustom Mody

unread,

Feb 26, 2015, 2:59:38 PM2/26/15

to

On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:

> You should add emoticons, but not call them or the above 'gibberish'.

Done -- and of course not under gibberish.
I dont really know much how emoji are used but I understand they are.
JFTR I consider it necessary to be respectful to all (living) people.
For that matter even dead people(s) - no need to be disrespectful to the egyptians who created the hieroglyphs or the sumerians who wrote cuneiform.

I only find it crosses a line when the 2 millenia dead creations are made to take
the space of the living.

Chris wrote:
> * Klingon: Not part of any current standard

Thanks Removed.

wxjm...@gmail.com

unread,

Feb 26, 2015, 3:20:55 PM2/26/15

to

Le jeudi 26 février 2015 18:59:24 UTC+1, Rustom Mody a écrit :
>
> ...To me this block represents this same error cast into concrete and

> dignified by the (supposed) authority of the unicode consortium.
>

Unicode does not prescribe, it registrates.

Eg. The inclusion of
U+1E9E, 'LATIN CAPITAL LETTER SHARP S'
has been officialy proposed by the "German
Federal Government".
(I have a pdf copy somewhere).

Chris Angelico

unread,

Feb 26, 2015, 5:14:18 PM2/26/15

to pytho...@python.org

On Fri, Feb 27, 2015 at 4:59 AM, Rustom Mody <rusto...@gmail.com> wrote:
> On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
>> I think that this part of your post is more 'unprofessional' than the
>> character blocks. It is very jarring and seems contrary to your main point.
>
> Ok I need a word for
> 1. I have no need for this
> 2. 99.9% of the (living) on this planet also have no need for this

So what, seven million people need it? Sounds pretty useful to me. And
your figure is an exaggeration; a lot more people than that use
emoji/emoticons.

>> > Also, how does assigning meanings to codepoints "waste storage"? As
>> > soon as Unicode 2.0 hit and 16-bit code units stopped being
>> > sufficient, everyone needed to allocate storage - either 32 bits per
>> > character, or some other system - and the fact that some codepoints
>> > were unassigned had absolutely no impact on that. This is decidedly
>> > NOT unprofessional, and it's not wasteful either.
>>
>> I agree.
>
> I clearly am more enthusiastic than knowledgeable about unicode.
> But I know my basic CS well enough (as I am sure you and Chris also do)
>
> So I dont get how 4 bytes is not more expensive than 2.
> Yeah I know you can squeeze a unicode char into 3 bytes or even 21 bits
> You could use a clever representation like UTF-8 or FSR.
> But I dont see how you can get out of this that full-unicode costs more than
> exclusive BMP.

Sure, UCS-2 is cheaper than the current Unicode spec. But Unicode 2.0
was when that changed, and the change was because 65536 characters
clearly wouldn't be enough - and that was due to the number of
characters needed for other things than those you're complaining
about. Every spec since then has not changed anything that affects
storage. There are still, today, quite a lot of unallocated blocks of
characters (we're really using only about two planes' worth so far,
maybe three), but even if Unicode specified just two planes of 64K
characters each, you wouldn't be able to save much on transmission
(UTF-8 is already flexible and uses only what you need; if a future
Unicode spec allows 64K planes, UTF-8 transmission will cost exactly
the same for all existing characters), and on an eight-bit-byte
system, the very best you'll be able to do is three bytes - which you
can do today, too; you already know 21 bits will do. So since the BMP
was proven insufficient (back in 1996), no subsequent changes have had
any costs in storage.

> Still if I were to expand on the criticisms here are some examples:
>
> Math-Greek: Consider the math-alpha block
> http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Mathematical_Alphanumeric_Symbols_block
>
> Now imagine a beginning student not getting the difference between font, glyph,
> character. To me this block represents this same error cast into concrete and
> dignified by the (supposed) authority of the unicode consortium.
>
> There are probably dozens of other such stupidities like distinguishing kelvin K from latin K as if that is the business of the unicode consortium

A lot of these kinds of characters come from a need to unambiguously
transliterate text stored in other encodings. I don't personally
profess to understand the reasoning behind the various
indistinguishable characters, but I'm aware that there are a lot of
tricky questions to be decided; and if once the Consortium decides to
allocate a character, that character must remain forever allocated.

> My real reservations about unicode come from their work in areas that I happen to know something about
>
> Music: To put music simply as a few mostly-meaningless 'dingbats' like ♩ ♪ ♫ is perhaps ok
> However all this stuff http://xahlee.info/comp/unicode_music_symbols.html
> makes no sense (to me) given that music (ie standard western music written in staff notation) is inherently 2 dimensional -- multi-voiced, multi-staff, chordal

The placement on the page is up to the display library. You can
produce a PDF that places the note symbols at their correct positions,
and requires no images to render sheet music.

> Sanskrit/Devanagari:
> Consists of bogus letters that dont exist in devanagari
> The letter ऄ (0904) is found here http://unicode.org/charts/PDF/U0900.pdf
> But not here http://en.wikipedia.org/wiki/Devanagari#Vowels
> So I call it bogus-devanagari
>
> Contrariwise an important letter in vedic pronunciation the double-udatta is missing
> http://list.indology.info/pipermail/indology_list.indology.info/2000-April/021070.html
>
> All of which adds up to the impression that the unicode consortium occasionally fails to do due diligence

Which proves that they're not perfect. Don't forget, they can always
add more characters later.

ChrisA

Steven D'Aprano

unread,

Feb 26, 2015, 6:09:55 PM2/26/15

to

Chris Angelico wrote:

> Unicode
> isn't about taking everyone's separate character sets and numbering
> them all so we can reference characters from anywhere; if you wanted
> that, you'd be much better off with something that lets you specify a
> code page in 16 bits and a character in 8, which is roughly the same
> size as Unicode anyway.

Well, except for the approximately 25% of people in the world whose native
language has more than 256 characters.

It sounds like you are referring to some sort of "shift code" system. Some
legacy East Asian encodings use a similar scheme, and depending on how they
are implemented they have great disadvantages. For example, Shift-JIS
suffers from a number of weaknesses including that a single byte corrupted
in transmission can cause large swaths of the following text to be
corrupted. With Unicode, a single corrupted byte can only corrupt a single
code point.

--
Steven

Chris Angelico

unread,

Feb 26, 2015, 6:23:45 PM2/26/15

to pytho...@python.org

On Fri, Feb 27, 2015 at 10:09 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> Chris Angelico wrote:
>
>> Unicode
>> isn't about taking everyone's separate character sets and numbering
>> them all so we can reference characters from anywhere; if you wanted
>> that, you'd be much better off with something that lets you specify a
>> code page in 16 bits and a character in 8, which is roughly the same
>> size as Unicode anyway.
>
> Well, except for the approximately 25% of people in the world whose native
> language has more than 256 characters.

You could always allocate multiple code pages to one language. But
since I'm not advocating this system, I'm only guessing at solutions
to its problems.

> It sounds like you are referring to some sort of "shift code" system. Some
> legacy East Asian encodings use a similar scheme, and depending on how they
> are implemented they have great disadvantages. For example, Shift-JIS
> suffers from a number of weaknesses including that a single byte corrupted
> in transmission can cause large swaths of the following text to be
> corrupted. With Unicode, a single corrupted byte can only corrupt a single
> code point.

That's exactly what I was hinting at. There are plenty of systems like
that, and they are badly flawed compared to a simple universal system
for a number of reasons. One is the corruption issue you mention;
another is that a simple memory-based text search becomes utterly
useless (to locate text in a document, you'd need to do a whole lot of
stateful parsing - not to mention the difficulties of doing
"similar-to" searches across languages); concatenation of text also
becomes a stateful operation, and so do all sorts of other simple
manipulations. Unicode may demand a bit more storage in certain
circumstances (where an eight-bit encoding might have handled your
entire document), but it's so much easier for the general case.

ChrisA

Steven D'Aprano

unread,

Feb 26, 2015, 8:05:38 PM2/26/15

to

Rustom Mody wrote:

> Emoticons (or is it emoji) seems to have some (regional?) takeup?? Dunno…
> In any case I'd like to stay clear of political(izable) questions

Emoji is the term used in Japan, gradually spreading to the rest of the
word. Emoticons, I believe, should be restricted to the practice of using
ASCII-only digraphs and trigraphs such as :-) (colon, hyphen, right-parens)
to indicate "smileys".

I believe that emoji will eventually lead to Unicode's victory. People will
want smileys and piles of poo on their mobile phones, and from there it
will gradually spread to everywhere. All they need to do to make victory
inevitable is add cartoon genitals...

>> I think that this part of your post is more 'unprofessional' than the
>> character blocks. It is very jarring and seems contrary to your main
>> point.
>
> Ok I need a word for
> 1. I have no need for this
> 2. 99.9% of the (living) on this planet also have no need for this

0.1% of the living is seven million people. I'll tell you what, you tell me
which seven million people should be relegated to second-class status, and
I'll tell them where you live.

:-)

[...]

> I clearly am more enthusiastic than knowledgeable about unicode.
> But I know my basic CS well enough (as I am sure you and Chris also do)
>
> So I dont get how 4 bytes is not more expensive than 2.

Obviously it is. But it's only twice as expensive, and in computer science
terms that counts as "close enough". It's quite common for data structures
to "waste" space by using "no more than twice as much space as needed",
e.g. Python dicts and lists.

The whole Unicode range U+0000 to U+10FFFF needs only 21 bits, which fits
into three bytes. Nevertheless, there's no three-byte UTF encoding, because
on modern hardware it is more efficient to "waste" an entire extra byte per
code point and deal with an even multiple of bytes.

> Yeah I know you can squeeze a unicode char into 3 bytes or even 21 bits
> You could use a clever representation like UTF-8 or FSR.
> But I dont see how you can get out of this that full-unicode costs more
> than exclusive BMP.

Are you missing a word there? Costs "no more" perhaps?

> eg consider the case of 32 vs 64 bit executables.
> The 64 bit executable is generally larger than the 32 bit one
> Now consider the case of a machine that has say 2GB RAM and a 64-bit
> processor. You could -- I think -- make a reasonable case that all those
> all-zero hi-address-words are 'waste'.

Sure. The whole point of 64-bit processors is to enable the use of more than
2GB of RAM. One might as well say that using 32-bit processors is wasteful
if you only have 64K of memory. Yes it is, but the only things which use
16-bit or 8-bit processors these days are embedded devices.

[...]

> Math-Greek: Consider the math-alpha block
>
http://en.wikipedia.org/wiki/Mathematical_operators_and_symbols_in_Unicode#Mathematical_Alphanumeric_Symbols_block
>
> Now imagine a beginning student not getting the difference between font,
> glyph,
> character. To me this block represents this same error cast into concrete
> and dignified by the (supposed) authority of the unicode consortium.

Not being privy to the internal deliberations of the Consortium, it is
sometimes difficult to tell why two symbols are sometimes declared to be
mere different glyphs for the same character, and other times declared to
be worthy of being separate characters.

E.g. I think we should all agree that the English "A" and the French "A"
shouldn't count as separate characters, although the Greek "Α" and
Russian "А" do.

In the case of the maths symbols, it isn't obvious to me what the deciding
factors were. I know that one of the considerations they use is to consider
whether or not users of the symbols have a tradition of treating the
symbols as mere different glyphs, i.e. stylistic variations. In this case,
I'm pretty sure that mathematicians would *not* consider:

U+2115 DOUBLE-STRUCK CAPITAL N "ℕ"
U+004E LATIN CAPITAL LETTER N "N"

as mere stylistic variations. If you defined a matrix called ℕ, you would
probably be told off for using the wrong symbol, not for using the wrong
formatting.

On the other hand, I'm not so sure about

U+210E PLANCK CONSTANT "ℎ"

versus a mere lowercase h (possibly in italic).

> There are probably dozens of other such stupidities like distinguishing
> kelvin K from latin K as if that is the business of the unicode consortium

But it *is* the business of the Unicode consortium. They have at least two
important aims:

- to be able to represent every possible human-language character;

- to allow lossless round-trip conversion to all existing legacy encodings
(for the subset of Unicode handled by that encoding).

The second reason is why Unicode includes code points for degree-Celsius and
degree-Fahrenheit, rather than just using °C and °F like sane people.
Because some idiot^W code-page designer back in the 1980s or 90s decided to
add single character ℃ and ℉. So now Unicode has to be able to round-trip
(say) "°C℃" without loss.

I imagine that the same applies to U+212A KELVIN SIGN K.

> My real reservations about unicode come from their work in areas that I
> happen to know something about
>
> Music: To put music simply as a few mostly-meaningless 'dingbats' like ♩ ♪
> ♫ is perhaps ok However all this stuff
> http://xahlee.info/comp/unicode_music_symbols.html
> makes no sense (to me) given that music (ie standard western music written
> in staff notation) is inherently 2 dimensional -- multi-voiced,
> multi-staff, chordal

(1) Text can also be two dimensional.
(2) Where you put the symbol on the page is a separate question from whether
or not the symbol exists.

> Consists of bogus letters that dont exist in devanagari
> The letter ऄ (0904) is found here http://unicode.org/charts/PDF/U0900.pdf
> But not here http://en.wikipedia.org/wiki/Devanagari#Vowels
> So I call it bogus-devanagari

Hmm, well I love Wikipedia as much as the next guy, but I think that even
Jimmy Wales would suggest that Wikipedia is not a primary source for what
counts as Devanagari vowels. What makes you think that Wikipedia is right
and Unicode is wrong?

That's not to say that Unicode hasn't made some mistakes. There are a few
deprecated code points, or code points that have been given the wrong name.
Oops. Mistakes happen.

> Contrariwise an important letter in vedic pronunciation the double-udatta
> is missing
>
http://list.indology.info/pipermail/indology_list.indology.info/2000-April/021070.html

I quote:

I do not see any need for a "double udaatta". Perhaps "double
ANudaatta" is meant here?

I don't know Sanskrit, but if somebody suggested that Unicode doesn't
support English because the important letter "double-oh" (as
in "moon", "spoon", "croon" etc.) was missing, I wouldn't be terribly
impressed. We have a "double-u" letter, why not "double-oh"?

Another quote:

I should strongly recommend not to hurry with a standardization
proposal until the text collection of Vedic texts has been finished

In other words, even the experts in Vedic texts don't yet know all the
characters which they may or may not need.

--
Steven

Dave Angel

unread,

Feb 26, 2015, 8:58:13 PM2/26/15

to pytho...@python.org

On 02/26/2015 08:05 PM, Steven D'Aprano wrote:

> Rustom Mody wrote:
>

>
>> eg consider the case of 32 vs 64 bit executables.
>> The 64 bit executable is generally larger than the 32 bit one
>> Now consider the case of a machine that has say 2GB RAM and a 64-bit
>> processor. You could -- I think -- make a reasonable case that all those
>> all-zero hi-address-words are 'waste'.
>
> Sure. The whole point of 64-bit processors is to enable the use of more than
> 2GB of RAM. One might as well say that using 32-bit processors is wasteful
> if you only have 64K of memory. Yes it is, but the only things which use
> 16-bit or 8-bit processors these days are embedded devices.

But the 2gig means electrical address lines out of the CPU are wasted,
not address space. A 64 bit processor and 64bit OS means you can have
more than 4gig in a process space, even if over half of it has to be in
the swap file. Linear versus physical makes a big difference.

(Although I believe Seymour Cray was quoted as saying that virtual
memory is a crock, because "you can't fake what you ain't got.")

--
DaveA

Steven D'Aprano

unread,

Feb 27, 2015, 12:58:56 AM2/27/15

to

Dave Angel wrote:

> (Although I believe Seymour Cray was quoted as saying that virtual
> memory is a crock, because "you can't fake what you ain't got.")

If I recall correctly, disk access is about 10000 times slower than RAM, so
virtual memory is *at least* that much slower than real memory.

--
Steven

Dave Angel

unread,

Feb 27, 2015, 2:31:10 AM2/27/15

to pytho...@python.org

It's so much more complicated than that, that I hardly know where to
start. I'll describe a generic processor/OS/memory/disk architecture;
there will be huge differences between processor models even from a
single manufacturer.

First, as soon as you add swapping logic to your
processor/memory-system, you theoretically slow it down. And in the
days of that quote, Cray's memory was maybe 50 times as fast as the
memory used by us mortals. So adding swapping logic would have slowed
it down quite substantially, even when it was not swapping. But that
logic is inside the CPU chip these days, and presumably thoroughly
optimized.

Next, statistically, a program uses a small subset of its total program
& data space in its working set, and the working set should reside in
real memory. But when the program greatly increases that working set,
and it approaches the amount of physical memory, then swapping becomes
more frenzied, and we say the program is thrashing. Simple example, try
sorting an array that's about the size of available physical memory.

Next, even physical memory is divided into a few levels of caching, some
on-chip and some off. And the caching is done in what I call strips,
where accessing just one byte causes the whole strip to be loaded from
non-cached memory. I forget the current size for that, but it's maybe
64 to 256 bytes or so.

If there are multiple processors (not multicore, but actual separate
processors), then each one has such internal caches, and any writes on
one processor may have to trigger flushes of all the other processors
that happen to have the same strip loaded.

The processor not only prefetches the next few instructions, but decodes
and tentatively executes them, subject to being discarded if a
conditional branch doesn't go the way the processor predicted. So some
instructions execute in zero time, some of the time.

Every address of instruction fetch, or of data fetch or store, goes
through a couple of layers of translation. Segment register plus offset
gives linear address. Lookup those in tables to get physical address,
and if table happens not to be in on-chip cache, swap it in. If
physical address isn't valid, a processor exception causes the OS to
potentially swap something out, and something else in.

Once we're paging from the swapfile, the size of the read is perhaps 4k.
And that read is regardless of whether we're only going to use one
byte or all of it.

The ratio between an access which was in the L1 cache and one which
required a page to be swapped in from disk? Much bigger than your
10,000 figure. But hopefully it doesn't happen a big percentage of the
time.

Many, many other variables, like the fact that RAM chips are not
directly addressable by bytes, but instead count on rows and columns.
So if you access many bytes in the same row, it can be much quicker than
random access. So simple access time specifications don't mean as much
as it would seem; the controller has to balance the RAM spec with the
various cache requirements.
--
DaveA

wxjm...@gmail.com

unread,

Feb 27, 2015, 4:07:20 AM2/27/15

to

Le vendredi 27 février 2015 02:05:38 UTC+1, Steven D'Aprano a écrit :
>
> E.g. I think we should all agree that the English "A" and the French "A"
> shouldn't count as separate characters, although the Greek "Α" and
> Russian "А" do.
>

Yes. Simple and logical explaination.
Unicode does not handle languages per se, it encodes scripts
for languages.

Steven D'Aprano

unread,

Feb 27, 2015, 6:55:10 AM2/27/15

to

Dave Angel wrote:

> On 02/27/2015 12:58 AM, Steven D'Aprano wrote:
>> Dave Angel wrote:
>>
>>> (Although I believe Seymour Cray was quoted as saying that virtual
>>> memory is a crock, because "you can't fake what you ain't got.")
>>
>> If I recall correctly, disk access is about 10000 times slower than RAM,
>> so virtual memory is *at least* that much slower than real memory.
>>
>
> It's so much more complicated than that, that I hardly know where to
> start.

[snip technical details]

As interesting as they were, none of those details will make swap faster,
hence my comment that virtual memory is *at least* 10000 times slower than
RAM.

--
Steven

Dave Angel

unread,

Feb 27, 2015, 9:03:21 AM2/27/15

to pytho...@python.org

The term "virtual memory" is used for many aspects of the modern memory
architecture. But I presume you're using it in the sense of "running in
a swapfile" as opposed to running in physical RAM.

Yes, a page fault takes on the order of 10,000 times as long as an
access to a location in L1 cache. I suspect it's a lot smaller though
if the swapfile is on an SSD drive. The first byte is that slow.

But once the fault is resolved, the nearby bytes are in physical memory,
and some of them are in L3, L2, and L1. So you're not running in the
swapfile any more. And even when you run off the end of the page,
fetching the sequentially adjacent page from a hard disk is much faster.
And if the disk has well designed buffering, faster yet. The OS tries
pretty hard to keep the swapfile unfragmented.

The trick is to minimize the number of page faults, especially to random
locations. If you're getting lots of them, it's called thrashing.

There are tools to help with that. To minimize page faults on code,
linking with a good working-set-tuner can help, though I don't hear of
people bothering these days. To minimize page faults on data, choosing
one's algorithm carefully can help. For example, in scanning through a
typical matrix, row order might be adjacent locations, while column
order might be scattered.

Not really much different than reading a text file. If you can arrange
to process it a line at a time, rather than reading the whole file into
memory, you generally minimize your round-trips to disk. And if you
need to randomly access it, it's quite likely more efficient to memory
map it, in which case it temporarily becomes part of the swapfile system.

--
DaveA

Chris Angelico

unread,

Feb 27, 2015, 9:22:42 AM2/27/15

to pytho...@python.org

On Sat, Feb 28, 2015 at 1:02 AM, Dave Angel <da...@davea.name> wrote:
> The term "virtual memory" is used for many aspects of the modern memory
> architecture. But I presume you're using it in the sense of "running in a
> swapfile" as opposed to running in physical RAM.

Given that this started with a quote about "you can't fake what you
ain't got", I would say that, yes, this refers to using hard disk to
provide more RAM.

If you're trying to use the pagefile/swapfile as if it's more memory
("I have 256MB of memory, but 10GB of swap space, so that's 10GB of
memory!"), then yes, these performance considerations are huge. But
suppose you need to run a program that's larger than your available
RAM. On MS-DOS, sometimes you'd need to work with program overlays (a
concept borrowed from older systems, but ones that I never worked on,
so I'm going back no further than DOS here). You get a *massive*
complexity hit the instant you start using them, whether your program
would have been able to fit into memory on some systems or not. Just
making it possible to have only part of your code in memory places
demands on your code that you, the programmer, have to think about.
With virtual memory, though, you just write your code as if it's all
in memory, and some of it may, at some times, be on disk. Less code to
debug = less time spent debugging. The performance question is largely
immaterial (you'll be using the disk either way), but the savings on
complexity are tremendous. And then when you do find yourself running
on a system with enough RAM? No code changes needed, and full
performance. That's where virtual memory shines.

It's funny how the world changes, though. Back in the 90s, virtual
memory was the key. No home computer ever had enough RAM. Today? A
home-grade PC could easily have 16GB... and chances are you don't need
all of that. So we go for the opposite optimization: disk caching.
Apart from when I rebuild my "Audio-Only Frozen" project [1] and the
caches get completely blasted through, heaps and heaps of my work can
be done inside the disk cache. Hey, Sikorsky, got any files anywhere
on the hard disk matching *Pastel*.iso case insensitively? *chug chug
chug* Nope. Okay. Sikorsky, got any files matching *Pas5*.iso case
insensitively? *zip* Yeah, here it is. I didn't tell the first search
to hold all that file system data in memory; the hard drive controller
managed it all for me, and I got the performance benefit. Same as the
above: the main benefit is that this sort of thing requires zero
application code complexity. It's all done in a perfectly generic way
at a lower level.

ChrisA

Dave Angel

unread,

Feb 27, 2015, 10:25:39 AM2/27/15

to pytho...@python.org

In 1973, I did manual swapping to an external 8k ramdisk. It was a box
that sat on the floor and contained 8k of core memory (not
semiconductor). The memory was non-volatile, so it contained the
working copy of my code. Then I built a small swapper that would bring
in the set of routines currently needed. My onboard RAM (semiconductor)
was 1.5k, which had to hold the swapper, the code, and the data. I was
writing a GPS system for shipboard use, and the final version of the
code had to fit entirely in EPROM, 2k of it. But debugging EPROM code
is a pain, since every small change took half an hour to make new chips.

Later, I built my first PC with 512k of RAM, and usually used much of it
as a ramdisk, since programs didn't use nearly that amount.

--
DaveA

alister

unread,

Feb 27, 2015, 11:01:04 AM2/27/15

to

On Sat, 28 Feb 2015 01:22:15 +1100, Chris Angelico wrote:

>
> If you're trying to use the pagefile/swapfile as if it's more memory ("I
> have 256MB of memory, but 10GB of swap space, so that's 10GB of
> memory!"), then yes, these performance considerations are huge. But
> suppose you need to run a program that's larger than your available RAM.
> On MS-DOS, sometimes you'd need to work with program overlays (a concept
> borrowed from older systems, but ones that I never worked on, so I'm
> going back no further than DOS here). You get a *massive* complexity hit
> the instant you start using them, whether your program would have been
> able to fit into memory on some systems or not. Just making it possible
> to have only part of your code in memory places demands on your code
> that you, the programmer, have to think about. With virtual memory,
> though, you just write your code as if it's all in memory, and some of
> it may, at some times, be on disk. Less code to debug = less time spent
> debugging. The performance question is largely immaterial (you'll be
> using the disk either way), but the savings on complexity are
> tremendous. And then when you do find yourself running on a system with
> enough RAM? No code changes needed, and full performance. That's where
> virtual memory shines.

> ChrisA

I think there is a case for bringing back the overlay file, or at least
loading larger programs in sections
only loading the routines as they are required could speed up the start
time of many large applications.
examples libre office, I rarely need the mail merge function, the word
count and may other features that could be added into the running
application on demand rather than all at once.

obviously with large memory & virtual mem there is no need to un-install
them once loaded.

--
Ralph's Observation:
It is a mistake to let any mechanical object realise that you
are in a hurry.

Chris Angelico

unread,

Feb 27, 2015, 11:12:57 AM2/27/15

to pytho...@python.org

On Sat, Feb 28, 2015 at 3:00 AM, alister
<alister.n...@ntlworld.com> wrote:
> I think there is a case for bringing back the overlay file, or at least
> loading larger programs in sections
> only loading the routines as they are required could speed up the start
> time of many large applications.
> examples libre office, I rarely need the mail merge function, the word
> count and may other features that could be added into the running
> application on demand rather than all at once.

Downside of that is twofold: firstly the complexity that I already
mentioned, and secondly you pay the startup cost on first usage. So
you might get into the program a bit faster, but as soon as you go to
any feature you didn't already hit this session, the program pauses
for a bit and loads it. Sometimes startup cost is the best time to do
this sort of thing.

Of course, there is an easy way to implement exactly what you're
asking for: use separate programs for everything, instead of expecting
a megantic office suite[1] to do everything for you. Just get yourself
a nice simple text editor, then invoke other programs - maybe from a
terminal, or maybe from within the editor - to do the rest of the
work. A simple disk cache will mean that previously-used programs
start up quickly.

ChrisA

[1] It's slightly less bloated than the gigantic office suite sold by
a top-end software company.

alister

unread,

Feb 27, 2015, 11:46:01 AM2/27/15

to

On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote:

> On Sat, Feb 28, 2015 at 3:00 AM, alister
> <alister.n...@ntlworld.com> wrote:
>> I think there is a case for bringing back the overlay file, or at least
>> loading larger programs in sections only loading the routines as they
>> are required could speed up the start time of many large applications.
>> examples libre office, I rarely need the mail merge function, the word
>> count and may other features that could be added into the running
>> application on demand rather than all at once.
>
> Downside of that is twofold: firstly the complexity that I already
> mentioned, and secondly you pay the startup cost on first usage. So you
> might get into the program a bit faster, but as soon as you go to any
> feature you didn't already hit this session, the program pauses for a
> bit and loads it. Sometimes startup cost is the best time to do this
> sort of thing.
>

If the modules are small enough this may not be noticeable but yes I do
accept there may be delays on first usage.

As to the complexity it has been my observation that as the memory
footprint available to programmers has increase they have become less &
less skilled at writing code.

of course my time as a professional programmer was over 20 years ago on 8
bit micro controllers with 8k of ROM (eventually, original I only had 2k
to play with) & 128 Bytes (yes bytes!) of RAM so I am very out of date.

I now play with python because it is so much less demanding of me which
probably makes me just a guilty :-)

> Of course, there is an easy way to implement exactly what you're asking
> for: use separate programs for everything, instead of expecting a
> megantic office suite[1] to do everything for you. Just get yourself a
> nice simple text editor, then invoke other programs - maybe from a
> terminal, or maybe from within the editor - to do the rest of the work.
> A simple disk cache will mean that previously-used programs start up
> quickly.

Libre office was sighted as just one example
Video editing suites are another that could be used as an example
(perhaps more so, does the rendering engine need to be loaded until you
start generating the output? a small delay here would be insignificant)

>
> ChrisA
>
> [1] It's slightly less bloated than the gigantic office suite sold by a
> top-end software company.

--
You don't sew with a fork, so I see no reason to eat with knitting
needles.
-- Miss Piggy, on eating Chinese Food

Chris Angelico

unread,

Feb 27, 2015, 12:46:20 PM2/27/15

to pytho...@python.org

On Sat, Feb 28, 2015 at 3:45 AM, alister
<alister.n...@ntlworld.com> wrote:
> On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote:
>
>> On Sat, Feb 28, 2015 at 3:00 AM, alister
>> <alister.n...@ntlworld.com> wrote:
>>> I think there is a case for bringing back the overlay file, or at least
>>> loading larger programs in sections only loading the routines as they
>>> are required could speed up the start time of many large applications.
>>> examples libre office, I rarely need the mail merge function, the word
>>> count and may other features that could be added into the running
>>> application on demand rather than all at once.
>>
>> Downside of that is twofold: firstly the complexity that I already
>> mentioned, and secondly you pay the startup cost on first usage. So you
>> might get into the program a bit faster, but as soon as you go to any
>> feature you didn't already hit this session, the program pauses for a
>> bit and loads it. Sometimes startup cost is the best time to do this
>> sort of thing.
>>
> If the modules are small enough this may not be noticeable but yes I do
> accept there may be delays on first usage.
>
> As to the complexity it has been my observation that as the memory
> footprint available to programmers has increase they have become less &
> less skilled at writing code.

Perhaps, but on the other hand, the skill of squeezing code into less
memory is being replaced by other skills. We can write code that takes
the simple/dumb approach, let it use an entire megabyte of memory, and
not care about the cost... and we can write that in an hour, instead
of spending a week fiddling with it. Reducing the development cycle
time means we can add all sorts of cool features to a program, all
while the original end user is still excited about it. (Of course, a
comparison between today's World Wide Web and that of the 1990s
suggests that these cool features aren't necessarily beneficial, but
still, we have the option of foregoing austerity.)

> Video editing suites are another that could be used as an example
> (perhaps more so, does the rendering engine need to be loaded until you
> start generating the output? a small delay here would be insignificant)

Hmm, I'm not sure that's actually a big deal, because your *data* will
dwarf the code. I can fire up sox and avconv, both fairly large
programs, and their code will all sit comfortably in memory; but then
they get to work on my data, and suddenly my hard disk is chewing
through 91GB of content. Breaking up avconv into a dozen pieces
wouldn't make a dent in 91GB.

ChrisA

Grant Edwards

unread,

Feb 27, 2015, 12:46:23 PM2/27/15

to

Nonsense. On all of my machines, virtual memory _is_ RAM almost all
of the time. I don't do the type of things that force the usage of
swap.

--
Grant Edwards grant.b.edwards Yow! ... I want FORTY-TWO
at TRYNEL FLOATATION SYSTEMS
gmail.com installed within SIX AND A
HALF HOURS!!!

Grant Edwards

unread,

Feb 27, 2015, 12:47:22 PM2/27/15

to

On 2015-02-27, Grant Edwards <inv...@invalid.invalid> wrote:
> On 2015-02-27, Steven D'Aprano <steve+comp....@pearwood.info> wrote: Dave Angel wrote:
>>> On 02/27/2015 12:58 AM, Steven D'Aprano wrote: Dave Angel wrote:
>>>>
>>>>> (Although I believe Seymour Cray was quoted as saying that virtual
>>>>> memory is a crock, because "you can't fake what you ain't got.")
>>>>
>>>> If I recall correctly, disk access is about 10000 times slower than RAM,
>>>> so virtual memory is *at least* that much slower than real memory.
>>>>
>>> It's so much more complicated than that, that I hardly know where to
>>> start.
>>
>> [snip technical details]
>>
>> As interesting as they were, none of those details will make swap faster,
>> hence my comment that virtual memory is *at least* 10000 times slower than
>> RAM.
>
> Nonsense. On all of my machines, virtual memory _is_ RAM almost all
> of the time. I don't do the type of things that force the usage of
> swap.

And on some of the embedded systems I work on, _all_ virtual memory is
RAM 100.000% of the time.

--
Grant Edwards grant.b.edwards Yow! Don't SANFORIZE me!!
at
gmail.com

MRAB

unread,

Feb 27, 2015, 2:14:54 PM2/27/15

to pytho...@python.org

On 2015-02-27 16:45, alister wrote:
> On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote:
>
>> On Sat, Feb 28, 2015 at 3:00 AM, alister
>> <alister.n...@ntlworld.com> wrote:
>>> I think there is a case for bringing back the overlay file, or at least
>>> loading larger programs in sections only loading the routines as they
>>> are required could speed up the start time of many large applications.
>>> examples libre office, I rarely need the mail merge function, the word
>>> count and may other features that could be added into the running
>>> application on demand rather than all at once.
>>
>> Downside of that is twofold: firstly the complexity that I already
>> mentioned, and secondly you pay the startup cost on first usage. So you
>> might get into the program a bit faster, but as soon as you go to any
>> feature you didn't already hit this session, the program pauses for a
>> bit and loads it. Sometimes startup cost is the best time to do this
>> sort of thing.
>>
> If the modules are small enough this may not be noticeable but yes I do
> accept there may be delays on first usage.
>

I suppose you could load the basic parts first so that the user can
start working, and then load the additional features in the background.

blue

unread,

Feb 27, 2015, 3:13:33 PM2/27/15

to

On Wednesday, February 11, 2015 at 1:38:12 AM UTC+2, vlya...@gmail.com wrote:
> I defined function Fatalln in "mydef.py" and it works fine if i call it from "mydef.py", but when i try to call it from "test.py" in the same folder:
> import mydef
> ...
> Fatalln "my test"
> i have NameError: name 'Fatalln' is not defined
> I also tried include('mydef.py') with the same result...
> What is the right syntax?
> Thanks

...try to set your python utf-8 encode .
and read the FAQ or python manual

Dave Angel

unread,

Feb 27, 2015, 3:52:45 PM2/27/15

to pytho...@python.org

I can't say how Linux handles it (I'd like to know, but haven't needed
to yet), but in Windows (NT, XP, etc), a DLL is not "loaded", but rather
mapped. And it's not copied into the swapfile, it's mapped directly
from the DLL. The mapping mode is "copy-on-write" which means that
read=only portions are swapped directly from the DLL, on first usage,
while read-write portions (eg. static/global variables, relocation
modifications) are copied on first use to the swap file. I presume
EXE's are done the same way, but never had a need to know.

If that's the case on the architectures you're talking about, then the
problem of slow loading is not triggered by the memory usage, but by
lots of initialization code. THAT's what should be deferred for
seldom-used portions of code.

The main point of a working-set-tuner is to group sections of code
together that are likely to be used together. To take an extreme case,
all the fatal exception handlers should be positioned adjacent to each
other in linear memory, as it's unlikely that any of them will be
needed, and the code takes up no time or space in physical memory.

Also (in Windows), a DLL can be pre-relocated, so that it has a
preferred address to be loaded into memory. If that memory is available
when it gets loaded (actually mapped), then no relocation needs to
happen, which saves time and swap space.

In the X86 architecture, most code is self-relocating, everything is
relative. But references to other DLL's and jump tables were absolute,
so they needed to be relocated at load time, when final locations were
nailed down.

Perhaps the authors of bloated applications have forgotten how to do
these, as the defaults in the linker puts all DLL's in the same
location, meaning all but the first will need relocating. But system
DLL's are (were) each given unique addresses.

On one large project, I added the build step of assigning these base
addresses. Each DLL had to start on a 64k boundary, and I reserved some
fractional extra space between them in case one would grow. Then every
few months, we double-checked that they didn't overlap, and if necessary
adjusted the start addresses. We didn't just automatically assign
closest addresses, because frequently some of the DLL's would be updated
independently of the others.
--
DaveA

Chris Angelico

unread,

Feb 27, 2015, 4:05:12 PM2/27/15

to pytho...@python.org

On Sat, Feb 28, 2015 at 7:52 AM, Dave Angel <da...@davea.name> wrote:
> If that's the case on the architectures you're talking about, then the
> problem of slow loading is not triggered by the memory usage, but by lots of
> initialization code. THAT's what should be deferred for seldom-used
> portions of code.

s/should/can/

It's still not a clear case of "should", as it's all a big pile of
trade-offs. A few weeks ago I made a very deliberate change to a
process to force some code to get loaded and initialized earlier, to
prevent an unexpected (and thus surprising) slowdown on first use. (It
was, in fact, a Python 'import' statement, so all I had to do was add
a dummy import in the main module - with, of course, a comment making
it clear that this was necessary, even though the name wasn't used.)

But yes, seldom-used code can definitely have its initialization
deferred if you need to speed up startup.

ChrisA

alister

unread,

Feb 27, 2015, 5:09:56 PM2/27/15

to

On Fri, 27 Feb 2015 19:14:00 +0000, MRAB wrote:

>>
> I suppose you could load the basic parts first so that the user can
> start working, and then load the additional features in the background.
>

quite possible
my opinion on this is very fluid
it may work for some applications, it probably wouldn't for others.

with python it is generally considered good practice to import all
modules at the start of a program but there are valid cases fro only
importing a module if actually needed.

--
Some people have parts that are so private they themselves have no
knowledge of them.

alister

unread,

Feb 27, 2015, 5:13:23 PM2/27/15

to

On Sat, 28 Feb 2015 04:45:04 +1100, Chris Angelico wrote:
> Perhaps, but on the other hand, the skill of squeezing code into less
> memory is being replaced by other skills. We can write code that takes
> the simple/dumb approach, let it use an entire megabyte of memory, and
> not care about the cost... and we can write that in an hour, instead of
> spending a week fiddling with it. Reducing the development cycle time
> means we can add all sorts of cool features to a program, all while the
> original end user is still excited about it. (Of course, a comparison
> between today's World Wide Web and that of the 1990s suggests that these
> cool features aren't necessarily beneficial, but still, we have the
> option of foregoing austerity.)
>
>

> ChrisA

again I am fluid on this 'Clever' programming is often counter productive
& unmaintainable, but the again lazy programming can also be just as bad
fro this, there is no "one size fits all" solution but the modern
environment does make lazy programming very easy.

--
After all, all he did was string together a lot of old, well-known
quotations.
-- H.L. Mencken, on Shakespeare

Rustom Mody

unread,

Mar 3, 2015, 1:04:06 PM3/3/15

to

On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:

> On 2/26/2015 8:24 AM, Chris Angelico wrote:
> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
> >> Wrote something up on why we should stop using ASCII:
> >> http://blog.languager.org/2015/02/universal-unicode.html
>
> I think that the main point of the post, that many Unicode chars are
> truly planetary rather than just national/regional, is excellent.

<snipped>

> You should add emoticons, but not call them or the above 'gibberish'.

> I think that this part of your post is more 'unprofessional' than the
> character blocks. It is very jarring and seems contrary to your main point.

Ok Done

References to gibberish removed from
http://blog.languager.org/2015/02/universal-unicode.html

What I was trying to say expanded here
http://blog.languager.org/2015/03/whimsical-unicode.html
[Hope the word 'whimsical' is less jarring and more accurate than 'gibberish']

wxjm...@gmail.com

unread,

Mar 3, 2015, 1:37:06 PM3/3/15

to

========

Emoji and Dingbats are now part of Unicode.
They should be considered as well as a "1" or a "a"
or a "mathematical alpha".
So, there is nothing special to say about them.

jmf

Chris Angelico

unread,

Mar 3, 2015, 1:44:11 PM3/3/15

to pytho...@python.org

On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody <rusto...@gmail.com> wrote:
> What I was trying to say expanded here
> http://blog.languager.org/2015/03/whimsical-unicode.html
> [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish']

Re footnote #4: ½ is a single character for compatibility reasons.
⅟₁₀₀ doesn't need to be a single character, because there are
countably infinite vulgar fractions and only 0x110000 Unicode
characters.

ChrisA

Terry Reedy

unread,

Mar 3, 2015, 6:30:54 PM3/3/15

to pytho...@python.org

I agree with both.

--
Terry Jan Reedy

Rustom Mody

unread,

Mar 3, 2015, 9:53:52 PM3/3/15

to

On Wednesday, March 4, 2015 at 12:14:11 AM UTC+5:30, Chris Angelico wrote:

> On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody wrote:
> > What I was trying to say expanded here
> > http://blog.languager.org/2015/03/whimsical-unicode.html
> > [Hope the word 'whimsical' is less jarring and more accurate than 'gibberish']
>
> Re footnote #4: ½ is a single character for compatibility reasons.

> ⅟₁₀₀ ...
^^^

Neat
Thanks
[And figured out some of quopri module along the way figuring that out]

Steven D'Aprano

unread,

Mar 3, 2015, 9:54:40 PM3/3/15

to

Rustom Mody wrote:

> On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
>> On 2/26/2015 8:24 AM, Chris Angelico wrote:
>> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
>> >> Wrote something up on why we should stop using ASCII:
>> >> http://blog.languager.org/2015/02/universal-unicode.html
>>
>> I think that the main point of the post, that many Unicode chars are
>> truly planetary rather than just national/regional, is excellent.
>
> <snipped>
>
>> You should add emoticons, but not call them or the above 'gibberish'.
>> I think that this part of your post is more 'unprofessional' than the
>> character blocks. It is very jarring and seems contrary to your main
>> point.
>
> Ok Done
>
> References to gibberish removed from
> http://blog.languager.org/2015/02/universal-unicode.html

I consider it unethical to make semantic changes to a published work in
place without acknowledgement. Fixing minor typos or spelling errors, or
dead links, is okay. But any edit that changes the meaning should be
commented on, either by an explicit note on the page itself, or by striking
out the previous content and inserting the new.

As for the content of the essay, it is currently rather unfocused. It
appears to be more of a list of "here are some Unicode characters I think
are interesting, divided into subgroups, oh and here are some I personally
don't have any use for, which makes them silly" than any sort of discussion
about the universality of Unicode. That makes it rather idiosyncratic and
parochial. Why should obscure maths symbols be given more importance than
obscure historical languages?

I think that the universality of Unicode could be explained in a single
sentence:

"It is the aim of Unicode to be the one character set anyone needs to
represent every character, ideogram or symbol (but not necessarily distinct
glyph) from any existing or historical human language."

I can expand on that, but in a nutshell that is it.

You state:

"APL and Z Notation are two notable languages APL is a programming language
and Z a specification language that did not tie themselves down to a
restricted charset ..."

but I don't think that is correct. I'm pretty sure that neither APL nor Z
allowed you to define new characters. They might not have used ASCII alone,
but they still had a restricted character set. It was merely less
restricted than ASCII.

You make a comment about Cobol's relative unpopularity, but (1) Cobol
doesn't require you to write out numbers as English words, and (2) Cobol is
still used, there are uncounted billions of lines of Cobol code being used,
and if the number of Cobol programmers is less now than it was 16 years
ago, there are still a lot of them. Academics and FOSS programmers don't
think much of Cobol, but it has to count as one of the most amazing success
stories in the field of programming languages, despite its lousy design.

You list ideographs such as Cuneiform under "Icons". They are not icons.
They are a mixture of symbols used for consonants, syllables, and
logophonetic, consonantal alphabetic and syllabic signs. That sits them
firmly in the same categories as modern languages with consonants, ideogram
languages like Chinese, and syllabary languages like Cheyenne.

Just because native readers of Cuneiform are all dead doesn't make Cuneiform
unimportant. There are probably more people who need to write Cuneiform
than people who need to write APL source code.

You make a comment:

"To me – a unicode-layman – it looks unprofessional… Billions of computing
devices world over, each having billions of storage words having their
storage wasted on blocks such as these??"

But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
Why are you so worried about an (illusionary) minor optimization?

Whether code points are allocated or not doesn't affect how much space they
take up. There are millions of unused Unicode code points today. If they
are allocated tomorrow, the space your documents take up will not increase
one byte.

Allocating code points to Cuneiform has not increased the space needed by
Unicode at all. Two bytes alone is not enough for even existing human
languages (thanks China). For hardware related reasons, it is faster and
more efficient to use four bytes than three, so the obvious and "dumb" (in
the simplest thing which will work) way to store Unicode is UTF-32, which
takes a full four bytes per code point, regardless of whether there are
65537 code points or 1114112. That makes it less expensive than floating
point numbers, which take eight. Would you like to argue that floating
point doubles are "unprofessional" and wasteful?

As Dave pointed out, and you apparently agreed with him enough to quote him
TWICE (once in each of two blog posts), history of computing is full of
premature optimizations for space. (In fact, some of these may have been
justified by the technical limitations of the day.) Technically Unicode is
also limited, but it is limited to over one million code points, 1114112 to
be exact, although some of them are reserved as invalid for technical
reasons, and there is no indication that we'll ever run out of space in
Unicode.

In practice, there are three common Unicode encodings that nearly all
Unicode documents will use.

* UTF-8 will use between one and (by memory) four bytes per code
point. For Western European languages, that will be mostly one
or two bytes per character.

* UTF-16 uses a fixed two bytes per code point in the Basic Multilingual
Plane, which is enough for nearly all Western European writing and
much East Asian writing as well. For the rest, it uses a fixed four
bytes per code point.

* UTF-32 uses a fixed four bytes per code point. Hardly anyone uses
this as a storage format.

In *all three cases*, the existence of hieroglyphs and cuneiform in Unicode
doesn't change the space used. If you actually include a few hieroglyphs to
your document, the space increases only by the actual space used by those
hieroglyphs: four bytes per hieroglyph. At no time does the existence of a
single hieroglyph in your document force you to expand the non-hieroglyph
characters to use more space.

> What I was trying to say expanded here
> http://blog.languager.org/2015/03/whimsical-unicode.html

You have at least two broken links, referring to a non-existent page:

http://blog.languager.org/2015/03/unicode-universal-or-whimsical.html

This essay seems to be even more rambling and unfocused than the first. What
does the cost of semi-conductor plants have to do with whether or not
programmers support Unicode in their applications?

Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte
Order Mark. But if you interpret it as an explicit UTF-8 signature or mark,
it isn't so silly. If your text begins with the UTF-8 mark, treat it as
UTF-8. It's no more silly than any other heuristic, like HTML encoding tags
or text editor's encoding cookies.

Your discussion of "complexifiers and simplifiers" doesn't seem to be
terribly relevant, or at least if it is relevant, you don't give any reason
for it. The whole thing about Moore's Law and the cost of semi-conductor
plants seems irrelevant to Unicode except in the most over-generalised
sense of "things are bigger today than in the past, we've gone from
five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point?

You agree that 16-bits are not enough, and yet you critice Unicode for using
more than 16-bits on wasteful, whimsical gibberish like Cuneiform? That is
an inconsistent position to take.

UTF-16 is not half-arsed Unicode support. UTF-16 is full Unicode support.

The problem is when your language treats UTF-16 as a fixed-width two-byte
format instead of a variable-width, two- or four-byte format. (That's more
or less like the old, obsolete, UCS-2 standard.) There are all sorts of
good ways to solve the problem of surrogate pairs and the SMPs in UTF-16.
If some programming languages or software fails to do so, they are buggy,
not UTF-16.

After explaining that 16 bits are not enough, you then propose a 16 bit
standard. /face-palm

UTF-16 cannot break the fixed with invariant, because it has no fixed width
invariant. That's like arguing against UTF-8 because it breaks the fixed
width invariant "all characters are single byte ASCII characters".

If you cannot handle SMP characters, you are not supporting Unicode.

You suggest that Chinese users should be looking at Big5 or GB. I really,
really don't think so.

- Neither is universal. What makes you think that Chinese writers need
to use maths symbols, or include (say) Thai or Russian in their work
any less than Western writers do?

- Neither even support all of Chinese. Big5 supports Traditional
Chinese, but not Simplified Chinese. GB supports Simplified
Chinese, but not Traditional Chinese.

- Big5 likewise doesn't support placenames, many people's names, and
other less common parts of Chinese.

- Big5 is a shift-system, like Shift-JIS, and suffers from the same sort
of data corruption issues.

- There is no one single Big5 standard, but a whole lot of vendor
extensions.

You say:

"I just want to suggest that the Unicode consortium going overboard in
adding zillions of codepoints of nearly zero usefulness, is in fact
undermining unicode’s popularity and spread."

Can you demonstrate this? Can you show somebody who says "Well, I was going
to support full Unicode, but since they added a snowman, I'm going to stick
to ASCII"?

The "whimsical" characters you are complaining about were important enough
to somebody to spend significant amounts of time and money to write up a
proposal, have it go through the Unicode Consortium bureaucracy, and
eventually have it accepted. That's not easy or cheap, and people didn't
add a snowman on a whim. They did it because there are a whole lot of
people who want a shared standard for map symbols.

It is easy to mock what is not important to you. I daresay kids adding emoji
to their 10 character tweets would mock all the useless maths symbols in
Unicode too.

--
Steven

Chris Angelico

unread,

Mar 3, 2015, 10:15:13 PM3/3/15

to pytho...@python.org

On Wed, Mar 4, 2015 at 1:54 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> It is easy to mock what is not important to you. I daresay kids adding emoji
> to their 10 character tweets would mock all the useless maths symbols in
> Unicode too.

Definitely! Who ever sings "do you wanna build an integral sign"?

ChrisA

Rustom Mody

unread,

Mar 3, 2015, 11:05:28 PM3/3/15

to

On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
>
> > On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
> >> On 2/26/2015 8:24 AM, Chris Angelico wrote:
> >> > On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
> >> >> Wrote something up on why we should stop using ASCII:
> >> >> http://blog.languager.org/2015/02/universal-unicode.html
> >>
> >> I think that the main point of the post, that many Unicode chars are
> >> truly planetary rather than just national/regional, is excellent.
> >
> > <snipped>
> >
> >> You should add emoticons, but not call them or the above 'gibberish'.
> >> I think that this part of your post is more 'unprofessional' than the
> >> character blocks. It is very jarring and seems contrary to your main
> >> point.
> >
> > Ok Done
> >
> > References to gibberish removed from
> > http://blog.languager.org/2015/02/universal-unicode.html
>
> I consider it unethical to make semantic changes to a published work in
> place without acknowledgement. Fixing minor typos or spelling errors, or
> dead links, is okay. But any edit that changes the meaning should be
> commented on, either by an explicit note on the page itself, or by striking
> out the previous content and inserting the new.

Dunno What you are grumping about…

Anyway the attribution is made more explicit – footnote 5 in
http://blog.languager.org/2015/03/whimsical-unicode.html.

Note Terry Reedy's post who mainly objected was already acked earlier.
Ive just added one more ack¹
And JFTR the 'publication' (O how archaic!) is the whole blog not a single page just as it is for any other dead-tree publication.

>
> As for the content of the essay, it is currently rather unfocused.

True.

It
> appears to be more of a list of "here are some Unicode characters I think
> are interesting, divided into subgroups, oh and here are some I personally
> don't have any use for, which makes them silly" than any sort of discussion
> about the universality of Unicode. That makes it rather idiosyncratic and
> parochial. Why should obscure maths symbols be given more importance than
> obscure historical languages?

Idiosyncratic ≠ parochial

>
> I think that the universality of Unicode could be explained in a single
> sentence:
>
> "It is the aim of Unicode to be the one character set anyone needs to
> represent every character, ideogram or symbol (but not necessarily distinct
> glyph) from any existing or historical human language."
>
> I can expand on that, but in a nutshell that is it.
>
>
> You state:
>
> "APL and Z Notation are two notable languages APL is a programming language
> and Z a specification language that did not tie themselves down to a
> restricted charset ..."

Tsk Tsk – dihonest snipping. I wrote

| APL and Z Notation are two notable languages APL is a programming language
| and Z a specification language that did not tie themselves down to a

| restricted charset even in the day that ASCII ruled.

so its clear that the restricted applies to ASCII

>
> You list ideographs such as Cuneiform under "Icons". They are not icons.
> They are a mixture of symbols used for consonants, syllables, and
> logophonetic, consonantal alphabetic and syllabic signs. That sits them
> firmly in the same categories as modern languages with consonants, ideogram
> languages like Chinese, and syllabary languages like Cheyenne.

Ok changed to iconic.
Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they were languages.
In 2015 when someone sees them and recognizes them, they are 'those things that
Sumerians/Egyptians wrote' No one except a rare expert knows those languages

>
> Just because native readers of Cuneiform are all dead doesn't make Cuneiform
> unimportant. There are probably more people who need to write Cuneiform
> than people who need to write APL source code.
>
> You make a comment:
>
> "To me – a unicode-layman – it looks unprofessional… Billions of computing
> devices world over, each having billions of storage words having their
> storage wasted on blocks such as these??"
>
> But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
> Why are you so worried about an (illusionary) minor optimization?

2 < 4 as far as I am concerned.
[If you disagree one man's illusionary is another's waking]

Thanks corrected

>
> This essay seems to be even more rambling and unfocused than the first. What
> does the cost of semi-conductor plants have to do with whether or not
> programmers support Unicode in their applications?
>
> Your point about the UTF-8 "BOM" is valid only if you interpret it as a Byte
> Order Mark. But if you interpret it as an explicit UTF-8 signature or mark,
> it isn't so silly. If your text begins with the UTF-8 mark, treat it as
> UTF-8. It's no more silly than any other heuristic, like HTML encoding tags
> or text editor's encoding cookies.
>
> Your discussion of "complexifiers and simplifiers" doesn't seem to be
> terribly relevant, or at least if it is relevant, you don't give any reason
> for it. The whole thing about Moore's Law and the cost of semi-conductor
> plants seems irrelevant to Unicode except in the most over-generalised
> sense of "things are bigger today than in the past, we've gone from
> five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So what's your point?

- Most people need only 16 bits.
- Many notable examples of software fail going from 16 to 23.
- If you are a software writer, and you fail going 16 to 23 its ok but try to
give useful errors

>
> You agree that 16-bits are not enough, and yet you critice Unicode for using
> more than 16-bits on wasteful, whimsical gibberish like Cuneiform? That is
> an inconsistent position to take.

| ½-assed unicode support – BMP-only – is better than 1/100-assed⁴ support –
| ASCII. BMP-only Unicode is universal enough but within practical limits
| whereas full (7.0) Unicode is 'really' universal at a cost of performance and
| whimsicality.

Do you disagree that BMP-only = 16 bits?

>
> UTF-16 is not half-arsed Unicode support. UTF-16 is full Unicode support.
>
> The problem is when your language treats UTF-16 as a fixed-width two-byte
> format instead of a variable-width, two- or four-byte format. (That's more
> or less like the old, obsolete, UCS-2 standard.) There are all sorts of
> good ways to solve the problem of surrogate pairs and the SMPs in UTF-16.
> If some programming languages or software fails to do so, they are buggy,
> not UTF-16.
>
> After explaining that 16 bits are not enough, you then propose a 16 bit
> standard. /face-palm
>
> UTF-16 cannot break the fixed with invariant, because it has no fixed width
> invariant. That's like arguing against UTF-8 because it breaks the fixed
> width invariant "all characters are single byte ASCII characters".
>
> If you cannot handle SMP characters, you are not supporting Unicode.

7.0

>
>
> You suggest that Chinese users should be looking at Big5 or GB. I really,
> really don't think so.
>
> - Neither is universal. What makes you think that Chinese writers need
> to use maths symbols, or include (say) Thai or Russian in their work
> any less than Western writers do?
>
> - Neither even support all of Chinese. Big5 supports Traditional
> Chinese, but not Simplified Chinese. GB supports Simplified
> Chinese, but not Traditional Chinese.
>
> - Big5 likewise doesn't support placenames, many people's names, and
> other less common parts of Chinese.
>
> - Big5 is a shift-system, like Shift-JIS, and suffers from the same sort
> of data corruption issues.
>
> - There is no one single Big5 standard, but a whole lot of vendor
> extensions.
>
>
> You say:
>
> "I just want to suggest that the Unicode consortium going overboard in
> adding zillions of codepoints of nearly zero usefulness, is in fact
> undermining unicode’s popularity and spread."
>
> Can you demonstrate this? Can you show somebody who says "Well, I was going
> to support full Unicode, but since they added a snowman, I'm going to stick
> to ASCII"?

I gave a list of softwares which goof/break going BMP to 7.0 unicode

>
> The "whimsical" characters you are complaining about were important enough
> to somebody to spend significant amounts of time and money to write up a
> proposal, have it go through the Unicode Consortium bureaucracy, and
> eventually have it accepted. That's not easy or cheap, and people didn't
> add a snowman on a whim. They did it because there are a whole lot of
> people who want a shared standard for map symbols.
>
> It is easy to mock what is not important to you. I daresay kids adding emoji
> to their 10 character tweets would mock all the useless maths symbols in
> Unicode too.

Head para of section 5 has:
| However (the following) are (in the standard)! So lets use them!
Looks like mocking to you

The only mocking is at 5.1. And even here I dont mock the users of these blocks
– now or millenia ago. I only mock the unicode consortium for putting them into
unicode

----------------------
¹ And somewhere around here we get into Gödelian problems -- known to programmers
under the form "Write a program that prints itself". Likewise Acks.
I am going to deal with the Gödel-loop by the device:
- Address real issues/objects
- Smile at grumpiness

Rustom Mody

unread,

Mar 3, 2015, 11:16:14 PM3/3/15

to

Uh… 21
Thats what makes 3 chars per 64-bit word a possibility.
A possibility that can become realistic if/when Intel decides to add 'packed-unicode' string instructions.

Rustom Mody

unread,

Mar 3, 2015, 11:45:24 PM3/3/15

to

Maybe you missed this section:
http://blog.languager.org/2015/03/whimsical-unicode.html#half-assed

It lists some examples of software that somehow break/goof going from BMP-only
unicode to 7.0 unicode.

IOW the suggestion is that the the two-way classification
- ASCII
- Unicode

is less useful and accurate than the 3-way

- ASCII
- BMP
- Unicode

Personally I would be pleased if 𝛌 were used for the math-lambda and
λ left alone for Greek-speaking users' identifiers.
However one should draw a line between personal preferences and a univeral(izable) standard.
As of now, λ works in blogger whereas 𝛌 breaks blogger -- gets replaced by �.
Similar breakages are current in Java, Javascript, Emacs, Mysql, Idle and Windows, various fonts etc etc. [Only one of these is remotely connected with python]

So BMP is practical, 7.0 is idealistic. You are free too pick 😏😉

Chris Angelico

unread,

Mar 3, 2015, 11:55:24 PM3/3/15

to pytho...@python.org

On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody <rusto...@gmail.com> wrote:
>
> It lists some examples of software that somehow break/goof going from BMP-only
> unicode to 7.0 unicode.
>
> IOW the suggestion is that the the two-way classification
> - ASCII
> - Unicode
>
> is less useful and accurate than the 3-way
>
> - ASCII
> - BMP
> - Unicode

How is that more useful? Aside from storage optimizations (in which
the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
not significantly different from the rest of Unicode.

Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
do you keep talking about 7.0 as if it's a recent change?

ChrisA

Rustom Mody

unread,

Mar 4, 2015, 12:05:58 AM3/4/15

to

On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote:

> On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote:
> >
> > It lists some examples of software that somehow break/goof going from BMP-only
> > unicode to 7.0 unicode.
> >
> > IOW the suggestion is that the the two-way classification
> > - ASCII
> > - Unicode
> >
> > is less useful and accurate than the 3-way
> >
> > - ASCII
> > - BMP
> > - Unicode
>
> How is that more useful? Aside from storage optimizations (in which
> the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
> not significantly different from the rest of Unicode.

Sorry... Dont understand.

>
> Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
> do you keep talking about 7.0 as if it's a recent change?

It is 2015 as of now. 7.0 is the current standard.

The need for the adjective 'current' should be pondered upon.

In practice, standards change.
However if a standard changes so frequently that that users have to play catching cook
and keep asking: "Which version?" they are justified in asking "Are the standard-makers
doing due diligence?"

Steven D'Aprano

unread,

Mar 4, 2015, 3:14:42 AM3/4/15

to

Rustom Mody wrote:

> On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote:

>> I consider it unethical to make semantic changes to a published work in
>> place without acknowledgement. Fixing minor typos or spelling errors, or
>> dead links, is okay. But any edit that changes the meaning should be
>> commented on, either by an explicit note on the page itself, or by
>> striking out the previous content and inserting the new.
>
> Dunno What you are grumping about…

You published something on a blog. And then you edited it, not to correct a
typo, but to make a potentially substantial change to semantics, without
noting that fact.

I consider that unethical. Reputable journalists also consider it unethical
to change a published work in place without comment, that is why if they
have to correct an online post or article, they put a note (usually at the
bottom of the page) stating the nature of the correction made. E.g. "an
earlier version of this story stated blah, which is incorrect and has now
been corrected."

Putting the correction in another post is not good enough, for obvious
reasons. People don't read a blog as a unified single piece, they read it as
individual posts.

In this case, I *assume* that the change only changes the tone rather than
the actual meaning of the text, since I haven't seen the before-and-after
versions. I'm making a general comment about the ethics of blogging.

> And JFTR the 'publication' (O how archaic!) is the whole blog not a single
> page just as it is for any other dead-tree publication.

"Any other dead-tree publication"? An internet blog is not a dead-tree
publication.

And there's nothing archaic about publishing work on the Internet. What a
foolish thing to say.

>> As for the content of the essay, it is currently rather unfocused.
>
> True.
>
> It
>> appears to be more of a list of "here are some Unicode characters I think
>> are interesting, divided into subgroups, oh and here are some I
>> personally don't have any use for, which makes them silly" than any sort
>> of discussion about the universality of Unicode. That makes it rather
>> idiosyncratic and parochial. Why should obscure maths symbols be given
>> more importance than obscure historical languages?
>
> Idiosyncratic ≠ parochial

I know. That's why I said "idiosyncratic and parochial" rather than just
picking one. It is both.

[...]

>> You state:
>>
>> "APL and Z Notation are two notable languages APL is a programming
>> language and Z a specification language that did not tie themselves down
>> to a restricted charset ..."
>
> Tsk Tsk – dihonest snipping. I wrote
>
> | APL and Z Notation are two notable languages APL is a programming
> | language and Z a specification language that did not tie themselves down
> | to a restricted charset even in the day that ASCII ruled.
>
> so its clear that the restricted applies to ASCII

It is not clear at all, and in fact ASCII is irrelevant.

Even in the days that "ASCII ruled", there were dozens, maybe hundreds of
restricted charsets. EBCDIC, national variants of ASCII, mutations of it
like PETSCII (used on Commodore machines), 8-bit code pages...

APL was invented in 1964, the first public draft of ASCII was 1963 just one
year earlier. In 1964, ASCII was not commonly used in computing, it was a
seven-bit teleprinter code. ASCII didn't get fully established in computing
until 1968, when the US government mandated that starting from 1969 all
computers purchased by the government had to support ASCII.

When APL was invented, ASCII wasn't even relevant.

>> You list ideographs such as Cuneiform under "Icons". They are not icons.
>> They are a mixture of symbols used for consonants, syllables, and
>> logophonetic, consonantal alphabetic and syllabic signs. That sits them
>> firmly in the same categories as modern languages with consonants,
>> ideogram languages like Chinese, and syllabary languages like Cheyenne.
>
> Ok changed to iconic.
> Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform

o_O

People don't speak hieroglyphs, except in Asterisk The Gaul comics. People
speak words.

> they were languages. In 2015 when someone sees them and recognizes them,
> they are 'those things that Sumerians/Egyptians wrote' No one except a
> rare expert knows those languages

True. But there are people who are not "rare experts" but still have need to
use cuneiform or hieroglyphs in their works, just like not everybody who
writes about mathematics is "a rare expert" mathematician.

>> Just because native readers of Cuneiform are all dead doesn't make
>> Cuneiform unimportant. There are probably more people who need to write
>> Cuneiform than people who need to write APL source code.
>>
>> You make a comment:
>>
>> "To me – a unicode-layman – it looks unprofessional… Billions of
>> computing devices world over, each having billions of storage words
>> having their storage wasted on blocks such as these??"
>>
>> But that is nonsense, and it contradicts your earlier quoting of Dave
>> Angel. Why are you so worried about an (illusionary) minor optimization?
>
> 2 < 4 as far as I am concerned.
> [If you disagree one man's illusionary is another's waking]

You can't have it both ways. You acknowledge that 16-bits are not sufficient
for a universal character set, then criticize Unicode for using more than
16-bits. This is inconsistent and foolish.

[...]

>> Your discussion of "complexifiers and simplifiers" doesn't seem to be
>> terribly relevant, or at least if it is relevant, you don't give any
>> reason for it. The whole thing about Moore's Law and the cost of
>> semi-conductor plants seems irrelevant to Unicode except in the most
>> over-generalised sense of "things are bigger today than in the past,
>> we've gone from five-bit Baudot codes to 23 bit Unicode". Yeah, okay. So
>> what's your point?
>
> - Most people need only 16 bits.

I don't know about "most" people, but there are over one billion Chinese
whose native language simply doesn't fit into 16 bits.

> - Many notable examples of software fail going from 16 to 23.
> - If you are a software writer, and you fail going 16 to 23 its ok but try
> to give useful errors

No it isn't okay.

>> You agree that 16-bits are not enough, and yet you critice Unicode for
>> using more than 16-bits on wasteful, whimsical gibberish like Cuneiform?
>> That is an inconsistent position to take.
>
> | ½-assed unicode support – BMP-only – is better than 1/100-assed⁴ support
> | –
> | ASCII. BMP-only Unicode is universal enough but within practical limits
> | whereas full (7.0) Unicode is 'really' universal at a cost of
> | performance and whimsicality.
>
> Do you disagree that BMP-only = 16 bits?

That point is not in question.

Unicode was extended beyond 16 bits because 16 bits *is not enough* even for
existing human languages in common use.

As for performance, you contradict yourself. You've quoted Dave TWICE about
all these artificial limits imposed which turned out to be too low, and here
you are doing exactly the same thing.

[...]

>> You say:
>>
>> "I just want to suggest that the Unicode consortium going overboard in
>> adding zillions of codepoints of nearly zero usefulness, is in fact
>> undermining unicode’s popularity and spread."
>>
>> Can you demonstrate this? Can you show somebody who says "Well, I was
>> going to support full Unicode, but since they added a snowman, I'm going
>> to stick to ASCII"?
>
> I gave a list of softwares which goof/break going BMP to 7.0 unicode

Irrelevant to my question.

You didn't say that Unicode was being undermined by buggy programming
languages, you stated it was being undermined by the addition of characters
of "nearly zero usefulness". Citation please.

>>
>> The "whimsical" characters you are complaining about were important
>> enough to somebody to spend significant amounts of time and money to
>> write up a proposal, have it go through the Unicode Consortium
>> bureaucracy, and eventually have it accepted. That's not easy or cheap,
>> and people didn't add a snowman on a whim. They did it because there are
>> a whole lot of people who want a shared standard for map symbols.
>>
>> It is easy to mock what is not important to you. I daresay kids adding
>> emoji to their 10 character tweets would mock all the useless maths
>> symbols in Unicode too.
>
> Head para of section 5 has:
> | However (the following) are (in the standard)! So lets use them!
> Looks like mocking to you

No. The part where you say they are "gibberish" or "whimsical" and make zero
effort to understand why they were added is mocking. The part where your
argument basically boils down to "I personally have no need for these
characters, therefore the Unicode Consortium is silly for adding them."

> The only mocking is at 5.1. And even here I dont mock the users of these
> blocks – now or millenia ago. I only mock the unicode consortium for
> putting them into unicode

Exactly.

--
Steve

Steven D'Aprano

unread,

Mar 4, 2015, 3:16:36 AM3/4/15

to

Chris Angelico wrote:

> Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
> do you keep talking about 7.0 as if it's a recent change?

This is the Internet. Lack of knowledge about something doesn't prevent
people from having opinions about it.

--
Steven

wxjm...@gmail.com

unread,

Mar 4, 2015, 5:17:07 AM3/4/15

to

Le mercredi 4 mars 2015 09:14:42 UTC+1, Steven D'Aprano a écrit :
>
> o_O
>
> People don't speak hieroglyphs, except in Asterisk The Gaul comics. People
> speak words.
>
>

http://www.asterix.com/asterix-de-a-a-z/les-personnages/tumeheris.html

jmf

Steven D'Aprano

unread,

Mar 5, 2015, 9:06:32 AM3/5/15

to

Rustom Mody wrote:

> On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote:
>> On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote:
>> >
>> > It lists some examples of software that somehow break/goof going from
>> > BMP-only unicode to 7.0 unicode.
>> >
>> > IOW the suggestion is that the the two-way classification
>> > - ASCII
>> > - Unicode
>> >
>> > is less useful and accurate than the 3-way
>> >
>> > - ASCII
>> > - BMP
>> > - Unicode
>>
>> How is that more useful? Aside from storage optimizations (in which
>> the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
>> not significantly different from the rest of Unicode.
>
> Sorry... Dont understand.

Chris is suggesting that going from BMP to all of Unicode is not the hard
part. Going from ASCII to the BMP part of Unicode is the hard part. If you
can do that, you can go the rest of the way easily.

I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8
and UTF-32, since that goes against the grain of the system. You would have
to program in artificial restrictions that otherwise don't exist.

UTF-16 is different, and that's probably why you think supporting all of
Unicode is hard. With UTF-16, there really is an obvious distinction
between the BMP and the SMP: that's where you jump from a single 2-byte
unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8
or UTF-32:

- In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you
support the SMP or not doesn't change the fact that you have to deal
with multi-byte characters.

- In UTF-32, everything is fixed-width whether it is in the BMP or not.

In both cases, supporting the SMPs is no harder than supporting the BMP.
It's only UTF-16 that makes the SMP seem hard.

Conclusion: faulty implementations of UTF-16 which incorrectly handle
surrogate pairs should be replaced by non-faulty implementations, or
changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume
that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be
upgraded.

Wrong conclusion: SMPs are unnecessary and unneeded, and we need a new
standard that is just like obsolete Unicode version 1.

Unicode version 1 is obsolete for a reason. 16 bits is not enough for even
existing languages, let alone all the code points and characters that are
used in human communication.

>> Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
>> do you keep talking about 7.0 as if it's a recent change?
>
> It is 2015 as of now. 7.0 is the current standard.
>
> The need for the adjective 'current' should be pondered upon.

What's your point?

The UTF encodings have not changed since they were first introduced. They
have been stable for at least twenty years: UTF-8 has existed since 1993,
and UTF-16 since 1996.

Since version 2.0 of Unicode in 1996, the standard has made "stability
guarantees" that no code points will be renamed or removed. Consequently,
there has only been one version which removed characters, version 1.1.
Since then, new versions of the standard have only added characters, never
moved, renamed or deleted them.

http://unicode.org/policies/stability_policy.html

Some highlights in Unicode history:

Unicode 1.0 (1991): initial version, defined 7161 code points.

In January 1993, Rob Pike and Ken Thompson announced the design and working
implementation of the UTF-8 encoding.

1.1 (1993): defined 34233 characters, finalised Han Unification. Removed
some characters from the 1.0 set. This is the first and only time any code
points have been removed.

2.0 (1996): First version to include code points in the Supplementary
Multilingual Planes. Defined 38950 code points. Introduced the UTF-16
encoding.

3.1 (2001): Defined 94205 code points, including 42711 additional Han
ideographs, bringing the total number of CJK code points alone to 71793,
too many to fit in 16 bits.

2006: The People's Republic Of China mandates support for the GB-18030
character set for all software products sold in the PRC. GB-18030 supports
the entire Unicode range, include the SMPs. Since this date, all software
sold in China must support the SMPs.

6.0 (2010): The first emoji or emoticons were added to Unicode.

7.0 (2014): 113021 code points defined in total.

> In practice, standards change.
> However if a standard changes so frequently that that users have to play
> catching cook and keep asking: "Which version?" they are justified in
> asking "Are the standard-makers doing due diligence?"

Since Unicode has stability guarantees, and the encodings have not changed
in twenty years and will not change in the future, this argument is bogus.
Updating to a new version of the standard means, to a first approximation,
merely allocating some new code points which had previously been undefined
but are now defined.

(Code points can be flagged deprecated, but they will never be removed.)

--
Steven

wxjm...@gmail.com

unread,

Mar 5, 2015, 9:59:33 AM3/5/15

to

===============
===============

(2012): Some idiots tried to reinvent "Unicode" and
they failed.

jmf

rand...@fastmail.us

unread,

Mar 5, 2015, 3:00:03 PM3/5/15

to pytho...@python.org

On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote:
> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
> UTF-8
> and UTF-32, since that goes against the grain of the system. You would
> have
> to program in artificial restrictions that otherwise don't exist.

UTF-8 is already restricted from representing values above 0x10FFFF,
whereas UTF-8 can "naturally" represent values up to 0x1FFFFF in four
bytes, up to 0x3FFFFFF in five bytes, and 0x7FFFFFFF in six bytes. If
anything, the BMP represents a natural boundary, since it coincides with
values that can be represented in three bytes. Likewise, UTF-32 can
obviously represent values up to 0xFFFFFFFF. You're programming in
artificial restrictions either way, it's just a question of what those
restrictions are.

Steven D'Aprano

unread,

Mar 5, 2015, 5:34:01 PM3/5/15

to

Good points, but they don't greatly change my conclusion. If you are
implementing UTF-8 or UTF-32, it is no harder to deal with code points in
the SMP than those in the BMP.

--
Steven

Rustom Mody

unread,

Mar 5, 2015, 11:53:19 PM3/5/15

to

On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
>
> > On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote:
> >> On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody wrote:
> >> >
> >> > It lists some examples of software that somehow break/goof going from
> >> > BMP-only unicode to 7.0 unicode.
> >> >
> >> > IOW the suggestion is that the the two-way classification
> >> > - ASCII
> >> > - Unicode
> >> >
> >> > is less useful and accurate than the 3-way
> >> >
> >> > - ASCII
> >> > - BMP
> >> > - Unicode
> >>
> >> How is that more useful? Aside from storage optimizations (in which
> >> the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
> >> not significantly different from the rest of Unicode.
> >
> > Sorry... Dont understand.
>
> Chris is suggesting that going from BMP to all of Unicode is not the hard
> part. Going from ASCII to the BMP part of Unicode is the hard part. If you
> can do that, you can go the rest of the way easily.

Depends where the going is starting from.
I specifically names Java, Javascript, Windows... among others.
Here's some quotes from the supplementary chars doc of Java
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html

| Supplementary characters are characters in the Unicode standard whose code
| points are above U+FFFF, and which therefore cannot be described as single
| 16-bit entities such as the char data type in the Java programming language.
| Such characters are generally rare, but some are used, for example, as part
| of Chinese and Japanese personal names, and so support for them is commonly
| required for government applications in East Asian countries...

| The introduction of supplementary characters unfortunately makes the
| character model quite a bit more complicated.

| Unicode was originally designed as a fixed-width 16-bit character encoding.
| The primitive data type char in the Java programming language was intended to
| take advantage of this design by providing a simple data type that could hold
| any character.... Version 5.0 of the J2SE is required to support version 4.0
| of the Unicode standard, so it has to support supplementary characters.

My conclusion: Early adopters of unicode -- Windows and Java -- were punished
for their early adoption. You can blame the unicode consortium, you can
blame the babel of human languages, particularly that some use characters
and some only (the equivalent of) what we call words.

Or you can skip the blame-game and simply note the fact that large segments of
extant code-bases are currently in bug-prone or plain buggy state.

This includes not just bug-prone-system code such as Java and Windows but
seemingly working code such as python 3.

>
> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8
> and UTF-32, since that goes against the grain of the system. You would have
> to program in artificial restrictions that otherwise don't exist.

Yes UTF-8 and UTF-32 make most of the objections to unicode 7.0 irrelevant.
Large segments of the

>
> UTF-16 is different, and that's probably why you think supporting all of
> Unicode is hard. With UTF-16, there really is an obvious distinction
> between the BMP and the SMP: that's where you jump from a single 2-byte
> unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8
> or UTF-32:
>
> - In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you
> support the SMP or not doesn't change the fact that you have to deal
> with multi-byte characters.
>
> - In UTF-32, everything is fixed-width whether it is in the BMP or not.
>
> In both cases, supporting the SMPs is no harder than supporting the BMP.
> It's only UTF-16 that makes the SMP seem hard.
>
> Conclusion: faulty implementations of UTF-16 which incorrectly handle
> surrogate pairs should be replaced by non-faulty implementations, or
> changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume
> that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be
> upgraded.

Imagine for a moment a thought experiment -- we are not on a python but a java
forum and please rewrite the above para.
Are you addressing the vanilla java programmer? Language implementer? Designer?
The Java-funders -- earlier Sun, now Oracle?

Its not about new code points; its about "Fits in 2 bytes" to "Does not fit in 2 bytes"

If you call that argument bogus I call you a non computer scientist.
[Essentially this is my issue with the consortium it seems to be working like
a bunch of linguists not computer scientists]

Here is Roy's Smith post that first started me thinking that something may
be wrong with SMP
https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ

Some parts are here some earlier and from my memory.
If details wrong please correct:
- 200 million records
- Containing 4 strings with SMP characters
- System made with python and mysql. SMP works with python, breaks mysql.
So whole system broke due to those 4 in 200,000,000 records

I know enough (or not enough) of unicode to be chary of statistical conclusions
from the above.
My conclusion is essentially an 'existence-proof':

SMP-chars can break systems.
The breakage is costly-fied by the combination
- layman statistical assumptions
- BMP → SMP exercises different code-paths

It is necessary but not sufficient to test print "hello world" in ASCII, BMP, SMP.
You also have to write the hello world in the database -- mysql
Read it from the webform -- javascript
etc etc

You could also choose do with "astral crap" (Roy's words) what we all do with
crap -- throw it out as early as possible.

Chris Angelico

unread,

Mar 6, 2015, 12:20:35 AM3/6/15

to pytho...@python.org

On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody <rusto...@gmail.com> wrote:
> My conclusion: Early adopters of unicode -- Windows and Java -- were punished
> for their early adoption. You can blame the unicode consortium, you can
> blame the babel of human languages, particularly that some use characters
> and some only (the equivalent of) what we call words.
>
> Or you can skip the blame-game and simply note the fact that large segments of
> extant code-bases are currently in bug-prone or plain buggy state.

For most of the 1990s, I was writing code in REXX, on OS/2. An even
earlier adopter, REXX didn't have Unicode support _at all_, but
instead had facilities for working with DBCS strings. You can't get
everything right AND be the first to produce anything. Python didn't
make Unicode strings the default until 3.0, but that's not Unicode's
fault.

> This includes not just bug-prone-system code such as Java and Windows but
> seemingly working code such as python 3.
>

> Here is Roy's Smith post that first started me thinking that something may
> be wrong with SMP
> https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ
>
> Some parts are here some earlier and from my memory.
> If details wrong please correct:
> - 200 million records
> - Containing 4 strings with SMP characters
> - System made with python and mysql. SMP works with python, breaks mysql.
> So whole system broke due to those 4 in 200,000,000 records
>
> I know enough (or not enough) of unicode to be chary of statistical conclusions
> from the above.
> My conclusion is essentially an 'existence-proof':

Hang on hang on. Why are you blaming Python or SMP characters for
this? The problem here is MySQL, which doesn't adequately cope with
the full Unicode range. (Or, didn't then, or doesn't with its default
settings. I believe you can configure current versions of MySQL to
work correctly, though I haven't actually checked. PostgreSQL gets it
right, that's good enough for me.)

> SMP-chars can break systems.
> The breakage is costly-fied by the combination
> - layman statistical assumptions
> - BMP → SMP exercises different code-paths

Broken systems can be shown up by anything. Suppose you have a program
that breaks when it gets a NUL character (not unknown in C code); is
the fault with the Unicode consortium for allocating something at
codepoint 0, or the code that can't cope with a perfectly normal
character?

> You could also choose do with "astral crap" (Roy's words) what we all do with
> crap -- throw it out as early as possible.

There's only one character that fits that description, and that's
1F4A9. Everything else is just "astral characters", and you shouldn't
have any difficulties with them.

ChrisA

Rustom Mody

unread,

Mar 6, 2015, 4:03:11 AM3/6/15

to

On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:

Strawman.

Lets please stick to UTF-16 shall we?

Now tell me:
- Is it broken or not?
- Is it widely used or not?
- Should programmers be careful of it or not?
- Should programmers be warned about it or not?

Rustom Mody

unread,

Mar 6, 2015, 4:06:41 AM3/6/15

to

On Friday, March 6, 2015 at 2:33:11 PM UTC+5:30, Rustom Mody wrote:
> Lets please stick to UTF-16 shall we?
>
> Now tell me:
> - Is it broken or not?
> - Is it widely used or not?
> - Should programmers be careful of it or not?
> - Should programmers be warned about it or not?

Also:
Can a programmer who is away from UTF-16 in one part of the system (say by using python3)
assume he is safe all over?

Chris Angelico

unread,

Mar 6, 2015, 4:54:48 AM3/6/15

to pytho...@python.org

On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody <rusto...@gmail.com> wrote:
>> Broken systems can be shown up by anything. Suppose you have a program
>> that breaks when it gets a NUL character (not unknown in C code); is
>> the fault with the Unicode consortium for allocating something at
>> codepoint 0, or the code that can't cope with a perfectly normal
>> character?
>
> Strawman.

Not really, no. I know of lots of programs that can't handle embedded
NULs, and which fail in various ways when given them (the most common
is simple truncation, but it's by far not the only way). And it's
exactly the same: a program that purports to handle arbitrary Unicode
text should be able to handle arbitrary Unicode text, not "Unicode
text as long as it contains only codepoints within the range X-Y". It
doesn't matter whether the code chokes on U+0000, U+005C, U+FFFC, or
U+1F4A3 - if your code blows up, it's a failure in your code.

> Lets please stick to UTF-16 shall we?
>
> Now tell me:
> - Is it broken or not?
> - Is it widely used or not?
> - Should programmers be careful of it or not?
> - Should programmers be warned about it or not?

No, UTF-16 is not itself broken. (It would be if we expected
codepoints >0x10FFFF, and it's because of UTF-16 that that's the cap
on Unicode, but it's looking unlikely that we'll be needing any more
than that anyway.) What's broken is code that tries to treat UTF-16 as
if it's UCS-2, and then breaks on surrogate pairs.

Yes, it's widely used. Programmers should probably be warned about it,
but only because its tradeoffs are generally poorer than UTF-8's. If
you use it correctly, there's no problem.

> Also:
> Can a programmer who is away from UTF-16 in one part of the system (say by using python3)
> assume he is safe all over?

I don't know what you mean here. Do you mean that your Python 3
program is "at risk" in some way because there might be some other
program that misuses UTF-16? Well, sure. And there might be some other
program that misuses buffer sizes, SQL queries, or shell invocations,
and makes your overall system vulnerable to buffer overruns or
injection attacks. These are significantly more likely AND more
serious than UTF-16 misuses. And you still have not proven anything
about SMP characters being a problem, but only that code can be
broken. Broken code is still broken code, no matter what your actual
brokenness.

ChrisA

Rustom Mody

unread,

Mar 6, 2015, 5:07:54 AM3/6/15

to

On Friday, March 6, 2015 at 3:24:48 PM UTC+5:30, Chris Angelico wrote:

> On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody wrote:
> >> Broken systems can be shown up by anything. Suppose you have a program
> >> that breaks when it gets a NUL character (not unknown in C code); is
> >> the fault with the Unicode consortium for allocating something at
> >> codepoint 0, or the code that can't cope with a perfectly normal
> >> character?
> >
> > Strawman.
>
> Not really, no. I know of lots of programs that can't handle embedded
> NULs, and which fail in various ways when given them (the most common
> is simple truncation, but it's by far not the only way).

Ah well if you insist on pursuing the nul-char example...
No the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0
Nor the code that "can't cope with a perfectly normal character?"

But with C for having a data structure called string with a 'hole' in it.

Yes some other program/library/API etc connected to the python one

> Well, sure. And there might be some other
> program that misuses buffer sizes, SQL queries, or shell invocations,
> and makes your overall system vulnerable to buffer overruns or
> injection attacks. These are significantly more likely AND more
> serious than UTF-16 misuses. And you still have not proven anything
> about SMP characters being a problem, but only that code can be
> broken. Broken code is still broken code, no matter what your actual
> brokenness.

Roy Smith (and many other links Ive cited) prove exactly that - an
SMP character broke the code.

Note: I have no objection to people supporting full unicode 7.
Im just saying it may be significantly harder than just "Use python3 and you are done"

rand...@fastmail.us

unread,

Mar 6, 2015, 8:33:30 AM3/6/15

to pytho...@python.org

On Fri, Mar 6, 2015, at 04:06, Rustom Mody wrote:
> Also:
> Can a programmer who is away from UTF-16 in one part of the system (say
> by using python3)
> assume he is safe all over?

The most common failure of UTF-16 support, supposedly, is in programs
misusing the number of code units (for length or random access) as a
proxy for the number of characters.

However, when do you _really_ want the number of characters? You may
want to use it for, for example, the number of columns in a 'monospace'
font, which you've already screwed up because you haven't accounted for
double-wide characters or combining marks. Or you may want the position
that pressing an arrow key or backspace or forward-delete a number of
times will reach, which has its own rules in e.g. Indic languages (and
also fails on Latin with combining marks).

Chris Angelico

unread,

Mar 6, 2015, 8:39:41 AM3/6/15

to pytho...@python.org

On Sat, Mar 7, 2015 at 12:33 AM, <rand...@fastmail.us> wrote:
> However, when do you _really_ want the number of characters? You may
> want to use it for, for example, the number of columns in a 'monospace'
> font, which you've already screwed up because you haven't accounted for
> double-wide characters or combining marks. Or you may want the position
> that pressing an arrow key or backspace or forward-delete a number of
> times will reach, which has its own rules in e.g. Indic languages (and
> also fails on Latin with combining marks).

Number of code points is the most logical way to length-limit
something. If you want to allow users to set their display names but
not to make arbitrarily long ones, limiting them to X code points is
the safest way (and preferably do an NFC or NFD normalization before
counting, for consistency); this means you disallow pathological cases
where every base character has innumerable combining marks added.

ChrisA

rand...@fastmail.us

unread,

Mar 6, 2015, 9:03:30 AM3/6/15

to pytho...@python.org

On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote:
> Number of code points is the most logical way to length-limit
> something. If you want to allow users to set their display names but
> not to make arbitrarily long ones, limiting them to X code points is
> the safest way (and preferably do an NFC or NFD normalization before
> counting, for consistency);

Why are you length-limiting it? Storage space? Limit it in whatever
encoding they're stored in. Why are combining marks "pathological" but
surrogate characters not? Display space? Limit it by columns. If you're
going to allow a Japanese user's name to be twice as wide, you've got a
problem when you go to display it.

> this means you disallow pathological cases
> where every base character has innumerable combining marks added.

No it doesn't. If you limit it to, say, fifty, someone can still post
two base characters with twenty combining marks each. If you actually
want to disallow this, you've got to do more work. You've disallowed
some of the pathological cases, some of the time, by coincidence. And
limiting the number of UTF-8 bytes, or the number of UTF-16 code points,
will accomplish this just as well.

Now, if you intend to _silently truncate_ it to the desired length, you
certainly don't want to leave half a character in, of course. But who's
to say the base character plus first few combining marks aren't also
"half a character"? If you're _splitting_ a string, rather than merely
truncating it, you probably don't want those combining marks at the
beginning of part two.

Chris Angelico

unread,

Mar 6, 2015, 9:11:42 AM3/6/15

to pytho...@python.org

On Sat, Mar 7, 2015 at 1:03 AM, <rand...@fastmail.us> wrote:
> On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote:
>> Number of code points is the most logical way to length-limit
>> something. If you want to allow users to set their display names but
>> not to make arbitrarily long ones, limiting them to X code points is
>> the safest way (and preferably do an NFC or NFD normalization before
>> counting, for consistency);
>
> Why are you length-limiting it? Storage space? Limit it in whatever
> encoding they're stored in. Why are combining marks "pathological" but
> surrogate characters not? Display space? Limit it by columns. If you're
> going to allow a Japanese user's name to be twice as wide, you've got a
> problem when you go to display it.

To prevent people from putting three paragraphs of lipsum in and
calling it a username.

>> this means you disallow pathological cases
>> where every base character has innumerable combining marks added.
>
> No it doesn't. If you limit it to, say, fifty, someone can still post
> two base characters with twenty combining marks each. If you actually
> want to disallow this, you've got to do more work. You've disallowed
> some of the pathological cases, some of the time, by coincidence. And
> limiting the number of UTF-8 bytes, or the number of UTF-16 code points,
> will accomplish this just as well.

They can, but then they're limited to two base characters. They can't
have fifty base characters with twenty combining marks each. That's
the point.

> Now, if you intend to _silently truncate_ it to the desired length, you
> certainly don't want to leave half a character in, of course. But who's
> to say the base character plus first few combining marks aren't also
> "half a character"? If you're _splitting_ a string, rather than merely
> truncating it, you probably don't want those combining marks at the
> beginning of part two.

So you truncate to the desired length, then if the first character of
the trimmed-off section is a combining mark (based on its Unicode
character types), you keep trimming until you've removed a character
which isn't. Then, if you no longer have any content whatsoever,
reject the name. Simple.

ChrisA

rand...@fastmail.us

unread,

Mar 6, 2015, 9:28:04 AM3/6/15

to pytho...@python.org

On Fri, Mar 6, 2015, at 09:11, Chris Angelico wrote:
> To prevent people from putting three paragraphs of lipsum in and
> calling it a username.

Limiting by UTF-8 bytes or UTF-16 units works just as well for that.

> So you truncate to the desired length, then if the first character of
> the trimmed-off section is a combining mark (based on its Unicode
> character types), you keep trimming until you've removed a character
> which isn't. Then, if you no longer have any content whatsoever,
> reject the name. Simple.

My entire point was that UTF-32 doesn't save you from that, so it cannot
be called a deficiency of UTF-16. My point is there are very few
problems to which "count of Unicode code points" is the only right
answer - that UTF-32 is good enough for but that are meaningfully
impacted by a naive usage of UTF-16, to the point where UTF-16 is
something you have to be "safe" from.

Steven D'Aprano

unread,

Mar 6, 2015, 9:50:22 AM3/6/15

to

Rustom Mody wrote:

> On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:

[snip example of an analogous situation with NULs]

> Strawman.

Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
they really should say is "Yes, that's a good argument, I'm afraid I can't
argue against it, at least not without considerable thought", I'd be a
wealthy man...

> Lets please stick to UTF-16 shall we?
>
> Now tell me:
> - Is it broken or not?

The UTF-16 standard is not broken. It is a perfectly adequate variable-width
encoding, and considerably better than most other variable-width encodings.

However, many implementations of UTF-16 are faulty, and assume a
fixed-width. *That* is broken, not UTF-16.

(The difference between specification and implementation is critical.)

> - Is it widely used or not?

It's quite widely used.

> - Should programmers be careful of it or not?

Programmers should be aware whether or not any specific language uses UTF-16
and whether the implementation is buggy. That will help them decide whether
or not to use that language.

> - Should programmers be warned about it or not?

I'm in favour of people having more knowledge rather than less. I don't
believe that ignorance is bliss, except perhaps in the case that a giant
asteroid the size of Texas is heading straight for us.

Programmers should be aware of the limitations or bugs in any UTF-16
implementation they are likely to run into. Hence my general
recommendation:

- For transmission over networks or storage on permanent media (e.g. the
content of text files), use UTF-8. It is well-implemented by nearly all
languages that support Unicode, as far as I know.

- If you are designing your own language, your implementation of Unicode
strings should use something like Python's FSR, or UTF-8 with tweaks to
make string indexing O(1) rather than O(N), or correctly-implemented
UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) If, in
2015, you design your Unicode implementation as if UTF-16 is a fixed 2-byte
per code point format, you fail.

- If you are using an existing language, be aware of any bugs and
limitations in its Unicode implementation. You may or may not be able to
work around them, but at least you can decide whether or not you wish to
try.

- If you are writing your own file system layer, it's 2015 fer fecks sake,
file names should be Unicode strings, not bytes! (That's one part of the
Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
system, whichever you please, but again remember that both are
variable-width formats.

--
Steven

Chris Angelico

unread,

Mar 6, 2015, 10:27:36 AM3/6/15

to pytho...@python.org

On Sat, Mar 7, 2015 at 1:50 AM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> Rustom Mody wrote:
>
>> On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
>
> [snip example of an analogous situation with NULs]
>
>> Strawman.
>
> Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> they really should say is "Yes, that's a good argument, I'm afraid I can't
> argue against it, at least not without considerable thought", I'd be a
> wealthy man...

If I had a dollar for every time anyone said "If I had <insert
currency unit here> for every time...", I'd go meta all day long and
profit from it... :)

> - If you are writing your own file system layer, it's 2015 fer fecks sake,
> file names should be Unicode strings, not bytes! (That's one part of the
> Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> system, whichever you please, but again remember that both are
> variable-width formats.

I agree that that part of the Unix model needs to change, but there
are two viable ways to move forward:

1) Keep file names as bytes, but mandate that they be valid UTF-8
streams, and recommend that they be decoded UTF-8 for display to a
human
2) Change the entire protocol stack from the file system upwards so
that file names become Unicode strings.

Trouble with #2 is that file names need to be passed around somehow,
which means bytes in memory. So ultimately, #2 really means "keep file
names as bytes, and mandate an encoding all the way up the stack"...
so it's a massive documentation change that really comes down to the
same thing as #1.

This is one area where, as I understand it, Mac OS got it right. It's
time for other Unix variants to adopt the same policy. The bulk of
file names will be ASCII-only anyway, so requiring UTF-8 won't affect
them; a lot of others are already UTF-8; so all we need is a
transition scheme for the remaining ones. If there's a known FS
encoding, it ought to be possible to have a file system conversion
tool that goes through everything, decodes, re-encodes UTF-8, and then
flags the file system as UTF-8 compliant. All that'd be left would be
the file names that are broken already - ones that don't decode in the
FS encoding - and there's nothing to be done with them but wrap them
up into something probably-meaningless-but reversible.

When can we start doing this? ext5?

ChrisA

wxjm...@gmail.com

unread,

Mar 6, 2015, 10:38:03 AM3/6/15

to

===========

Sorry, but
it's time to learn and to understand UNICODE.
(It is no so complicate).

jmf

Rustom Mody

unread,

Mar 6, 2015, 11:21:10 AM3/6/15

to

On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote:
> Rustom Mody wrote:
>
> > On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
>
> [snip example of an analogous situation with NULs]
>
> > Strawman.
>
> Sigh. If I had a dollar for every time somebody cried "Strawman!" when what
> they really should say is "Yes, that's a good argument, I'm afraid I can't
> argue against it, at least not without considerable thought", I'd be a
> wealthy man...

Missed my addition? Here it is again – grammar slightly corrected.

===========

Ah well if you insist on pursuing the nul-char example...

- No, the unicode consortium (or ASCII equivalent) is not wrong in allocating codepoint 0

- No, the code that "can't cope with a perfectly normal character" is not wrong

- It is C that is wrong for designing a buggy string data structure that cannot
contain a valid char.
===========

In fact Chris' nul-char example is so strongly supporting my argument – bugginess of UTF-16 –
it is perhaps too strong even for me.

To elaborate:
Take the buggy-plane analogy I gave in
http://blog.languager.org/2015/03/whimsical-unicode.html

If a plane model crashes once in 10,000 flights compared to others that crash once in
one million flights we can call it bug-prone though not strictly buggy – it does fly
9999 times safely!
OTOH if a plane is guaranteed to crash we can all it a buggy plane.

C's string is not bug-prone its plain buggy as it cannot represent strings
with nulls.

I would not go that far for UTF-16.
It is bug-inviting but it can also be implemented correctly

FSR is possible in python for very specific pythonic reasons
- dynamicness
- immutable strings

Drop either and FSR is impossible

> If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed
> 2-byte per code point format, you fail.

Seems obvious enough.
So lets see...
Here's a 2-line python program -- runs well enough when run as a command.
Program:
=========
pp = "💩"
print (pp)
=========
Try open it in idle3 and you get (at least I get):

$ idle3 ff.py
Traceback (most recent call last):
File "/usr/bin/idle3", line 5, in <module>
main()
File "/usr/lib/python3.4/idlelib/PyShell.py", line 1562, in main
if flist.open(filename) is None:
File "/usr/lib/python3.4/idlelib/FileList.py", line 36, in open
edit = self.EditorWindow(self, filename, key)
File "/usr/lib/python3.4/idlelib/PyShell.py", line 126, in __init__
EditorWindow.__init__(self, *args)
File "/usr/lib/python3.4/idlelib/EditorWindow.py", line 294, in __init__
if io.loadfile(filename):
File "/usr/lib/python3.4/idlelib/IOBinding.py", line 236, in loadfile
self.text.insert("1.0", chars)
File "/usr/lib/python3.4/idlelib/Percolator.py", line 25, in insert
self.top.insert(index, chars, tags)
File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 81, in insert
self.addcmd(InsertCommand(index, chars, tags))
File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 116, in addcmd
cmd.do(self.delegate)
File "/usr/lib/python3.4/idlelib/UndoDelegator.py", line 219, in do
text.insert(self.index1, self.chars, self.tags)
File "/usr/lib/python3.4/idlelib/ColorDelegator.py", line 82, in insert
self.delegate.insert(index, chars, tags)
File "/usr/lib/python3.4/idlelib/WidgetRedirector.py", line 148, in __call__
return self.tk_call(self.orig_and_operation + args)
_tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl

So who/what is broken?

>
> - If you are using an existing language, be aware of any bugs and
> limitations in its Unicode implementation. You may or may not be able to
> work around them, but at least you can decide whether or not you wish to
> try.
>
> - If you are writing your own file system layer, it's 2015 fer fecks sake,
> file names should be Unicode strings, not bytes! (That's one part of the
> Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
> system, whichever you please, but again remember that both are
> variable-width formats.

Correct.
Windows is broken for using UTF-16
Linux is broken for conflating UTF-8 and byte string.

Lot of breakage out here dont you think?
May be related to the equation

UTF-16 = UCS-2 + Duct-tape

??

Steven D'Aprano

unread,

Mar 6, 2015, 11:26:19 AM3/6/15

to

rand...@fastmail.us wrote:

> My point is there are very few
> problems to which "count of Unicode code points" is the only right
> answer - that UTF-32 is good enough for but that are meaningfully
> impacted by a naive usage of UTF-16, to the point where UTF-16 is
> something you have to be "safe" from.

I'm not sure why you care about the "count of Unicode code points", although
that *is* a problem. Not for end-user reasons like "how long is my
password?", but because it makes your job as a programmer harder.

[steve@ando ~]$ python2.7 -c "print (len(u'\U00004444:\U00014445'))"
4
[steve@ando ~]$ python3.3 -c "print (len(u'\U00004444:\U00014445'))"
3

It's hard to reason about your code when something as fundamental as the
length of a string is implementation-dependent. (By the way, the right
answer should be 3, not 4.)

But an even more important problem is that broken-UTF-16 lets you create
invalid, impossible Unicode strings *by accident*. Naturally you can create
broken Unicode if you assemble strings of surrogates yourself, but
broken-UTF-16 means it can happen from otherwise innocuous operations like
reversing a string:

py> s = u'\U00004444:\U00014445' # Python 2.7 narrow build
py> s[::-1]
u'\udc45\ud811:\u4444'

It's hard for me to demonstrate that the reversed string is broken because
the shell I am using does an amazingly good job of handling broken Unicode.
Even if I print it, the shell just prints missing-character glyphs instead
of crashing (fortunately for me!). But the first two code points are in
illegal order:

\udc45 is a high surrogate, and must follow a low surrogate;
\ud811 is a low surrogate, and must precede a high surrogate;

I'm not convinced you should be allowed to create Unicode strings containing
mismatched surrogates like this deliberately, but you certainly shouldn't
be able to do so by accident.

--
Steven

Chris Angelico

unread,

Mar 6, 2015, 11:45:57 AM3/6/15

to pytho...@python.org

On Sat, Mar 7, 2015 at 3:20 AM, Rustom Mody <rusto...@gmail.com> wrote:
> C's string is not bug-prone its plain buggy as it cannot represent strings
> with nulls.
>
> I would not go that far for UTF-16.
> It is bug-inviting but it can also be implemented correctly

C's standard library string handling functions are restricted in that
they handle a 255-byte alphabet. They do not handle Unicode, they do
not handle NUL, that is simply how they are. But I never said I was
talking about the C standard library. If you type a text string into a
GUI entry field, or encode it quoted-printable and pass it to a web
server, or whatever, you shouldn't know or care about what language
the program is written in; and if that program barfs on a NUL, that's
a limitation. That limitation might be caused by its naive use of
strcpy() when it should have used memcpy(), but that's not your
problem.

It's exactly the same here: if your program chokes on an SMP
character, I don't care what your program was written in or what
library functions your program called on. All I care is that your
program - repeated for emphasis, *your* program - failed on that
input. It's up to you to choose your underlying functions
appropriately.

>> - If you are designing your own language, your implementation of Unicode
>> strings should use something like Python's FSR, or UTF-8 with tweaks to
>> make string indexing O(1) rather than O(N), or correctly-implemented
>> UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)
>
> FSR is possible in python for very specific pythonic reasons
> - dynamicness
> - immutable strings
>
> Drop either and FSR is impossible

I don't know what you mean by "dynamicness". What you do need is a
Unicode string type, such that the application program isn't aware of
the underlying bytes, but simply treats this string as a sequence of
code points. The immutability isn't technically a requirement, but it
does make the FSR much more manageable; in a language with mutable
strings, it's probably more efficient to use UTF-32 for simplicity,
but it's up to the language designer to figure that out. (It might be
best to use something like the FSR, but where strings are never
narrowed after being widened, so it'd be possible for an ASCII-only
string to be stored UTF-32. That has consequences for comparisons, but
might give a reasonable hybrid of storage and mutation performance.)

> _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
>
> So who/what is broken?

The exception is pretty clear on that point. Tcl can't handle SMP
characters. So it's Tcl that's broken. Unless there's evidence to the
contrary, that's what I would expect to be the case.

> Correct.
> Windows is broken for using UTF-16
> Linux is broken for conflating UTF-8 and byte string.
>
> Lot of breakage out here dont you think?
> May be related to the equation
>
> UTF-16 = UCS-2 + Duct-tape

UTF-16 is an encoding that was designed to be backward-compatible with
UCS-2, just as UTF-8 was designed to be compatible with ASCII. Call it
what you will, but backward compatibility is pretty important. Look at
things like DES3 - if you use the same key three times, it's
compatible with DES.

Linux isn't "broken" for conflating UTF-8 and byte strings. Linux is
flawed in that it defines file names to be byte strings, which means
that every file system could be different in what it actually uses as
the encoding. Since file names exist for the benefit of humans, they
should be treated as text, so we should work with them as text. But
for reasons of backward compatibility, Linux hasn't yet changed.

Windows isn't broken for using UTF-16. I think it's a poor trade-off,
given that so many file names are ASCII-only; and, of course, if any
program treats a Windows file name as UCS-2, then that program is
broken. But UTF-16 is not itself broken, any more than UTF-7 is. And
UTF-7 is a lot harder to work with.

ChrisA

wxjm...@gmail.com

unread,

Mar 6, 2015, 2:41:36 PM3/6/15

to

=============

1) A copy/paste of pp = ... from google group into
my Python interactive interpreter without intermediate
state.
2) Some manipulations.
3) A copy/paste from my interpreter into google group.

I hope the rendering will be correct.

Python 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)] on win32
>>> eta runs etazero.py...
...etazero has been executed
>>> pp = "💩"
>>> print(pp)
💩
>>> len(pp)
2
>>> pp + pp + 'abcéœ€' + pp
'💩💩abcéœ€💩'
>>>
>>> # ok, nine glyphs, individually seleectable.
>>>

Note:

len(pp) = 2 because of Py32. This is a deliberate
choice to keep the Py32 "behaviour" in my interpreter.

but also note:

The code point is correctly displayed with a single "glyph".
All the cut/copy/paste (eg word, pdf, ...), cursor mouvement,
selection, caret position, text wrapping, char typing, ... mainly
for rendering purpose is done with my internal "artillary",
full unicode.

In my other GUI applications, everything is working fine,
including string lenghts, because my "artillary" work and
also handle glyphs (including diacritical signs).
Honestly, I'm no sure about bidi; however Hebrew I'm able
to test is working fine.

jmf

wxjm...@gmail.com

unread,

Mar 6, 2015, 2:59:04 PM3/6/15

to

======
Rest Numéro 2.

Re-cut/copy/paste of what I sent into my
intepreter.

>>>
>>> len('💩💩abcéœ€💩')
12
>>>

Ok, fine.
Windows, Firefox, utf-16, ... are not so bad.

jmf

Terry Reedy

unread,

Mar 7, 2015, 1:11:53 AM3/7/15

to pytho...@python.org

tcl
The possible workaround is for Idle to translate "💩" to "\U0001f4a9"
(10 chars) before sending it to tk.

But some perspective. In the console interpreter:

>>> print("\U0001f4a9")

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
File "C:\Programs\Python34\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9'
in posit
ion 0: character maps to <undefined>

So what is broken? Windows Command Prompt.

More perspective. tk/Idle *will* print *something* for every BMP char.
Command Prompt will not. It does not even do ucs-2 correctly. So
which is more broken? Windows Command Prompt. Who has perhaps
1,000,000 times more resources, Microsoft? or the tcl/tk group? I think
we all know.

--
Terry Jan Reedy

wxjm...@gmail.com

unread,

Mar 7, 2015, 2:43:28 AM3/7/15

to

Well...

D:\jm>cd wuni

D:\jm\wuni>jmtest2
Py 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)]
Quelques caractères: «abcéœ€ßÜÆŸçñö»
Loop: empty string => quit
—>abc
Votre entrée était : abc 3 caractère(s)
—>abcéœ€
Votre entrée était : abcéœ€ 6 caractère(s)
—>abcéœ€\u20acz\u03b1\z\u0430z
Wahrscheinlich falsches \uxxxx, (single, invalid backslash)
—>abcéœ€\u20acz\u03b1z\u0430z
Votre entrée était : abcéœ€€zαzаz 12 caractère(s)
—>Москва\\Zürich\\Αθήνα
Votre entrée était : Москва\Zürich\Αθήνα 19 caractère(s)
—>
Fin

D:\jm\wuni>

Python is "more broken" than the Windows terminal.

C# works, Ruby works, julia works, go works, Python? NOT

jmf

wxjm...@gmail.com

unread,

Mar 7, 2015, 3:56:09 AM3/7/15

to

Le samedi 7 mars 2015 07:11:53 UTC+1, Terry Reedy a écrit :

> tcl
> The possible workaround is for Idle to translate "💩" to "\U0001f4a9"
> (10 chars) before sending it to tk.
>

Both are correct. It's a question of perspective.

In an interpreter, which presents the "soul" of the
language, "\U0001f4a9" has more sense than a glyph.

For a general application, for an end user, displaying
a glyph makes more sense.

See, my previous comments.

----

Windows terminal:
I do not wish to defend MS, but despite its
"unicode limitations", it is working very
well and it is certainly not buggy.
Anyway, for serious apps, one writes GUI apps.

tcl/tk? Yes, it is buggy and unusable (at least
on Windows).

jmf

wxjm...@gmail.com

unread,

Mar 7, 2015, 4:08:17 AM3/7/15

to

Le samedi 7 mars 2015 09:56:09 UTC+1, wxjm...@gmail.com a écrit :
>
> tcl/tk? Yes, it is buggy and unusable (at least
> on Windows).
>
> jmf

Important addendum.
Not because it does not handle non BMP (SMP)
chars. It's buggy with the BMP chars.

jmf

Steven D'Aprano

unread,

Mar 7, 2015, 6:09:48 AM3/7/15

to

Rustom Mody wrote:

> On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote:

[...]

I see you are blaming everyone except the people actually to blame.

It is 2015. Unicode 2.0 introduced the SMPs in 1996, almost twenty years
ago, the same year as 1.0 release of Java. Java has had eight major new
releases since then. Oracle, and Sun before them, are/were serious, tier-1,
world-class major IT companies. Why haven't they done something about
introducing proper support for Unicode in Java? It's not hard -- if Python
can do it using nothing but volunteers, Oracle can do it. They could even
do it in a backwards-compatible way, by leaving the existing APIs in place
and adding new APIs.

As for Microsoft, as a member of the Unicode Consortium they have no excuse.
But I think you exaggerate the lack of support for SMPs in Windows. Some
parts of Windows have no SMP support, but they tend to be the oldest and
less important (to Microsoft) parts, like the command prompt.

Anyone have Powershell and like to see how well it supports SMP?

This Stackoverflow question suggests that post-Windows 2000, the Windows
file system has proper support for code points in the supplementary planes:

http://stackoverflow.com/questions/7870014/how-does-windows-wchar-t-handle-unicode-characters-outside-the-basic-multilingua

Or maybe not.

> Or you can skip the blame-game and simply note the fact that large
> segments of extant code-bases are currently in bug-prone or plain buggy
> state.
>
> This includes not just bug-prone-system code such as Java and Windows but
> seemingly working code such as python 3.

What Unicode bugs do you think Python 3.3 and above have?

>> I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
>> UTF-8 and UTF-32, since that goes against the grain of the system. You
>> would have to program in artificial restrictions that otherwise don't
>> exist.
>
> Yes UTF-8 and UTF-32 make most of the objections to unicode 7.0
> irrelevant.

Glad you agree about that much at least.

[...]

>> Conclusion: faulty implementations of UTF-16 which incorrectly handle
>> surrogate pairs should be replaced by non-faulty implementations, or
>> changed to UTF-8 or UTF-32; incomplete Unicode implementations which
>> assume that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should
>> be upgraded.
>
> Imagine for a moment a thought experiment -- we are not on a python but a
> java forum and please rewrite the above para.

There is no need to re-write it. If Java's only implementation of Unicode
assumes that code points are 16 bits only, then Java needs a new Unicode
implementation. (I assume that the existing one cannot be changed for
backwards-compatibility reasons.)

> Are you addressing the vanilla java programmer? Language implementer?
> Designer? The Java-funders -- earlier Sun, now Oracle?

The last three should be considered the same people.

The vanilla Java programmer is not responsible for the short-comings of
Java's implementation.

[...]

>> > In practice, standards change.
>> > However if a standard changes so frequently that that users have to
>> > play catching cook and keep asking: "Which version?" they are justified
>> > in asking "Are the standard-makers doing due diligence?"
>>
>> Since Unicode has stability guarantees, and the encodings have not
>> changed in twenty years and will not change in the future, this argument
>> is bogus. Updating to a new version of the standard means, to a first
>> approximation, merely allocating some new code points which had
>> previously been undefined but are now defined.
>>
>> (Code points can be flagged deprecated, but they will never be removed.)
>
> Its not about new code points; its about "Fits in 2 bytes" to "Does not
> fit in 2 bytes"

I quote you again:

"if a standard changes so frequently..."

The move to more than 16 bits happened once. It happened almost 20 years
ago. In what way does this count as frequent changes?

> If you call that argument bogus I call you a non computer scientist.

I am not a computer scientist, and the argument remains bogus. Unicode does
not change "frequently", and changes are backward-compatible.

> [Essentially this is my issue with the consortium it seems to be working
> [like a bunch of linguists not computer scientists]

That's rather like complaining that some computer game looks like it was
designed by games players instead of theoreticians. "Why, people have FUN
playing this, almost like it was designed by professionals who think about
gaming!!!"

Unicode is a standard intended for the handling of human languages. It is
intended as a real-life working standard, not some theoretical toy for
academics to experiment with. It is designed to be used, not to have papers
written about it. The character set part of it has effectively been
designed by linguists, and that is a good thing. But the encoding side of
things has been designed by practising computer programmers such as Rob
Pike and Ken Thompson. You might have heard of them.

> Here is Roy's Smith post that first started me thinking that something may
> be wrong with SMP
> https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ

There are plenty of things wrong with some implementations of Unicode, those
that assume all code points are two bytes.

There may be a few things wrong with the current Unicode standard, such as
missing characters, characters given the wrong name, and so forth.

But there's nothing wrong with the design of the SMP. It allows the great
majority of text, probably 99% or more, to use two bytes (UTF-16) or no
more than three bytes (UTF-8), while only relatively specialised uses need
four bytes for some code points.

> Some parts are here some earlier and from my memory.
> If details wrong please correct:
> - 200 million records
> - Containing 4 strings with SMP characters
> - System made with python and mysql. SMP works with python, breaks mysql.
> So whole system broke due to those 4 in 200,000,000 records

No, they broke because MySQL has buggy Unicode handling.

Bugs are not unusual. I used to have a version of Apple's Hypercard which
would lock up the whole operating system if you tried to display the
string "0^0" in a message dialog. Given that classic Mac OS was not a
proper multi-tasking OS like Unix or OS-X or even Windows, this was a real
pain. My conclusion from that is that that version of Hypercard was buggy.
What is your conclusion?

> I know enough (or not enough) of unicode to be chary of statistical
> conclusions from the above.
> My conclusion is essentially an 'existence-proof':
>
> SMP-chars can break systems.

Oh come on. How about this instead?

X can break systems, for every conceivable value of X.

> The breakage is costly-fied by the combination
> - layman statistical assumptions
> - BMP → SMP exercises different code-paths
>
> It is necessary but not sufficient to test print "hello world" in ASCII,
> BMP, SMP. You also have to write the hello world in the database -- mysql
> Read it from the webform -- javascript
> etc etc

Yes. This is called "integration testing". That's what professionals do.

> You could also choose do with "astral crap" (Roy's words) what we all do
> with crap -- throw it out as early as possible.

And when Roy's customers demand that his product support emoji, or complain
that they cannot spell their own name because of his parochial and ignorant
idea of "crap", perhaps he will consider doing what he should have done
from the beginning:

Stop using MySQL, which is a joke of a database[1], and use Postgres which
does not have this problem.

[1] So I have been told.

--
Steven

Chris Angelico

unread,

Mar 7, 2015, 6:34:42 AM3/7/15

to pytho...@python.org

On Sat, Mar 7, 2015 at 10:09 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> Stop using MySQL, which is a joke of a database[1], and use Postgres which
> does not have this problem.

I agree with the recommendation, though to be fair to MySQL, it is now
possible to store full Unicode. Though personally, I think the whole
"UTF8MB3 vs UTF8MB4" split is an embarrassment and should be abolished
*immediately* - not "we may change the meaning of UTF8 to be an alias
for UTF8MB4 in the future", just completely abolish the distinction
right now. (And deprecate the longer words.) There should be no reason
to build any kind of "UTF-8 but limited to three bytes" encoding for
anything. Ever.

But at least you can, if you configure things correctly, store any
Unicode character in your TEXT field.

ChrisA

Marko Rauhamaa

unread,

Mar 7, 2015, 6:53:24 AM3/7/15

to

Steven D'Aprano <steve+comp....@pearwood.info>:

> Rustom Mody wrote:
>> My conclusion: Early adopters of unicode -- Windows and Java -- were
>> punished for their early adoption. You can blame the unicode
>> consortium, you can blame the babel of human languages, particularly
>> that some use characters and some only (the equivalent of) what we
>> call words.
>
> I see you are blaming everyone except the people actually to blame.

I don't think you need to blame anybody. I think the UCS-2 mistake was
both deplorable and very understandable. At the time it looked like the
magic bullet to get out of the 8-bit mess. While 16-bit wide wchar_t's
looked like a hugely expensive price, it was deemed forward-looking to
pay it anyway to resolve the character set problem once and for all.

Linux was lucky to join the fray late enough to benefit from the bad
UCS-2 experience. That said, UTF-8 does suffer badly from its not being
a bijective mapping.

(Linux didn't quite dodge the bullet with pthreads, threads being
another sad fad of the 1990's. The hippies that cooked up the fork
system call should be awarded the next Millennium Prize. That foresight
or stroke of luck has withstood the challenge of half a century.)

> But there's nothing wrong with the design of the SMP. It allows the
> great majority of text, probably 99% or more, to use two bytes
> (UTF-16) or no more than three bytes (UTF-8), while only relatively
> specialised uses need four bytes for some code points.

The main dream was a fixed-width encoding scheme. People thought 16 bits
would be enough. The dream is so precious and true to us in the West
that people don't want to give it up.

It may yet be that UTF-32 replaces all previous schemes since it has all
the benefits of ASCII and only one drawback: redundancy. Maybe one day
we'll declare the byte 32 bits wide and be done with it. In some many
other aspects, 32-bit "bytes" are the de-facto reality already. Even C
coders routinely use 32 bits to express boolean values.

> And when Roy's customers demand that his product support emoji, or
> complain that they cannot spell their own name because of his
> parochial and ignorant idea of "crap", perhaps he will consider doing
> what he should have done from the beginning:

That's a recurring theme: Why didn't we do IPv6 from the get-go? Why
didn't we do multi-user from the get-go? Why didn't we do localization
from the get-go?

There comes a point when you have to release to start making money. You
then suffer the consequences until your company goes bankrupt.

Marko

Chris Angelico

unread,

Mar 7, 2015, 7:03:22 AM3/7/15

to pytho...@python.org

On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa <ma...@pacujo.net> wrote:
> The main dream was a fixed-width encoding scheme. People thought 16 bits
> would be enough. The dream is so precious and true to us in the West
> that people don't want to give it up.

So... use Pike, or Python 3.3+?

ChrisA

Mark Lawrence

unread,

Mar 7, 2015, 9:07:38 AM3/7/15

to pytho...@python.org

On 07/03/2015 12:02, Chris Angelico wrote:
> On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa <ma...@pacujo.net> wrote:

>> The main dream was a fixed-width encoding scheme. People thought 16 bits
>> would be enough. The dream is so precious and true to us in the West
>> that people don't want to give it up.
>

> So... use Pike, or Python 3.3+?
>
> ChrisA
>

Cue obligatory cobblers from our RUE.

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

Mark Lawrence

unread,

Mar 7, 2015, 9:14:16 AM3/7/15

to pytho...@python.org

On 07/03/2015 11:09, Steven D'Aprano wrote:
> Rustom Mody wrote:
>
>>

>> This includes not just bug-prone-system code such as Java and Windows but
>> seemingly working code such as python 3.
>
> What Unicode bugs do you think Python 3.3 and above have?
>

Methinks somebody has been drinking too much loony juice. Either that
or taking too much notice of our RUE. Not that I've done a proper
analysis, but to my knowledge there's nothing like the number of issues
on the bug tracker for Unicode bugs for Python 3 compared to Python 2.

wxjm...@gmail.com

unread,

Mar 7, 2015, 10:29:08 AM3/7/15

to

Le samedi 7 mars 2015 12:53:24 UTC+1, Marko Rauhamaa a écrit :
>
> It may yet be that UTF-32 replaces all previous schemes since it has all
> the benefits of ASCII and only one drawback: redundancy. Maybe one day
> we'll declare the byte 32 bits wide and be done with it. In some many
> other aspects, 32-bit "bytes" are the de-facto reality already. Even C
> coders routinely use 32 bits to express boolean values.
>

Like many, I'm using utf-32 every day on my win7 box with
2 Gb of ram.
I never meet once a problem.

jmf