On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
>> [...]
>> The problem with UCS-4 is that every character requires four bytes.
>> [...]
> I'm aware of this (and all the blah blah blah you are explaining). This
> always the same song. Memory.
Exactly. The reason it is always the same song is because it is an important song.
> Let me ask. Is Python an 'american" product for us-users or is it a tool
> for everybody [*]?
It is a product for everyone, which is exactly why PEP 393 is so important. PEP 393 means that users who have only a few non-BMP characters don't have to pay the cost of UCS-4 for every single string in their application, only for the ones that actually require it. PEP 393 means that using Unicode strings is now cheaper for everybody.
You seem to be arguing that the way forward is not to make Unicode cheaper for everyone, but to make ASCII strings more expensive so that everyone suffers equally. I reject that idea.
> Is there any reason why non ascii users are somehow penalized compared
> to ascii users?
Of course there is a reason.
If you want to represent 1114111 different characters in a string, as Unicode supports, you can't use a single byte per character, or even two bytes. That is a fact of basic mathematics. Supporting 1114111 characters must be more expensive than supporting 128 of them.
But why should you carry the cost of 4-bytes per character just because someday you *might* need a non-BMP character?
> This flexible string representation is a regression (ascii users or
> not).
No it is not. It is a great step forward to more efficient Unicode.
And it means that now Python can correctly deal with non-BMP characters without the nonsense of UTF-16 surrogates:
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
> Proof that is acceptable to everybody please, not just yourself.
I cann't, I'm only facing the fact it works slower on my
Windows platform.
As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.
Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.
Intuitively I expect there is some kind slow down between
all these "strings" conversion.
When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.
This are simply my conclusions.
Related question.
Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3, I attempted to toy with sizeof and stuct, but without
success.
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
> Proof that is acceptable to everybody please, not just yourself.
I cann't, I'm only facing the fact it works slower on my
Windows platform.
As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.
Imagine an editor, I type an "a", internally the text is
saved as ascii, then I type en "é", the text can only
be saved in at least latin-1. Then I enter an "€", the text
become an internal ucs-4 "string". The remove the "€" and so
on.
Intuitively I expect there is some kind slow down between
all these "strings" conversion.
When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.
This are simply my conclusions.
Related question.
Does any body know a way to get the size of the internal
"string" in bytes? In the narrow or wide build it is easy,
I can encode with the "unicode_internal" codec. In Py 3.3, I attempted to toy with sizeof and stuct, but without
success.
Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes:
> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters > using two code points. This is fragile and doesn't work very well, > because string-handling methods can break the surrogate pairs apart, > leaving you with invalid unicode string. Not good.)
...
> With PEP 393, each Python string will be stored in the most efficient > format possible:
Can you explain the issue of "breaking surrogate pairs apart" a little
more? Switching between encodings based on the string contents seems
silly at first glance. Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages. I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.
> Le samedi 18 ao�t 2012 19:28:26 UTC+2, Mark Lawrence a �crit :
>> Proof that is acceptable to everybody please, not just yourself.
> I cann't, I'm only facing the fact it works slower on my
> Windows platform.
> As I understand (I think) the undelying mechanism, I
> can only say, it is not a surprise that it happens.
> Imagine an editor, I type an "a", internally the text is
> saved as ascii, then I type en "�", the text can only
> be saved in at least latin-1. Then I enter an "�", the text
> become an internal ucs-4 "string". The remove the "�" and so
> on.
[snip]
"a" will be stored as 1 byte/codepoint.
Adding "�", it will still be stored as 1 byte/codepoint.
Adding "�", it will still be stored as 2 bytes/codepoint.
But then you wouldn't be adding them one at a time in Python, you'd be
building a list and then joining them together in one operation.
+comp.lang.pyt...@pearwood.info> wrote:
> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
> > Is there any reason why non ascii users are somehow penalized compared
> > to ascii users?
> Of course there is a reason.
> If you want to represent 1114111 different characters in a string, as
> Unicode supports, you can't use a single byte per character, or even two
> bytes. That is a fact of basic mathematics. Supporting 1114111 characters
> must be more expensive than supporting 128 of them.
> But why should you carry the cost of 4-bytes per character just because
> someday you *might* need a non-BMP character?
> Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
>> using two code points. This is fragile and doesn't work very well,
>> because string-handling methods can break the surrogate pairs apart,
>> leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more? Switching between encodings based on the string contents seems
> silly at first glance. Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages. I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.
On a narrow build, codepoints outside the BMP are stored as a surrogate
pair (2 codepoints). On a wide build, all codepoints can be represented
without the need for surrogate pairs.
The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.
> On Aug 18, 10:59 pm, Steven D'Aprano <steve
> +comp.lang.pyt...@pearwood.info> wrote:
>> On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
>>> Is there any reason why non ascii users are somehow penalized compared
>>> to ascii users?
>> Of course there is a reason.
>> If you want to represent 1114111 different characters in a string, as
>> Unicode supports, you can't use a single byte per character, or even two
>> bytes. That is a fact of basic mathematics. Supporting 1114111 characters
>> must be more expensive than supporting 128 of them.
>> But why should you carry the cost of 4-bytes per character just because
>> someday you *might* need a non-BMP character?
print(timeit("c in a", "c = '…'; a = 'a'*10000"))
3.3: .05 (independent of len(a)!)
3.2: 5.8 100 times slower! Increase len(a) and the ratio can be made as high as one wants!
print(timeit("a.encode()", "a = 'a'*1000"))
3.2: 1.5
3.3: .26
Similar with encoding='utf-8' added to call.
Jim, please stop the ranting. It does not help improve Python. utf-32 is not a panacea; it has problems of time, space, and system compatibility (Windows and others). Victor Stinner, whatever he may have once thought and said, put a *lot* of effort into making the new implementation both correct and fast.
On your replace example
>>> imeit.timeit("('ab…' * 1000).replace('…', '……')")
> 61.919225272152346
>>> timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
> 1.2918679017971044
I do not see the point of changing both length and replacement. For me, the time is about the same for either replacement. I do see about the same slowdown ratio for 3.3 versus 3.2 I also see it for pure search without replacement.
print(timeit("c in a", "c = '…'; a = 'a'*1000+c"))
# .6 in 3.2.3, 1.2 in 3.3.0
This does not make sense to me and I will ask about it.
> I thing it's time to leave the discussion and to go to bed.
In plain English, duck out cos I'm losing.
> You can take the problem the way you wish, Python 3.3 is "slower"
> than Python 3.2.
I'll ask for the second time. Provide proof that is acceptable to everybody and not just yourself.
> If you see the present status as an optimisation, I'm condidering
> this as a regression.
Considering does not equate to proof. Where are the figures which back up your claim?
> I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
> the correct solution.
I look forward to seeing your patch on the bug tracker. If and only if you can find something that needs patching, which from the course of this thread I think is highly unlikely.
> To be extreme, tools using pure utf-16 or utf-32 are, at least,
> considering all the citizen on this planet in the same way.
On Sun, Aug 19, 2012 at 4:26 AM, Paul Rubin <no.em...@nospam.invalid> wrote:
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more? Switching between encodings based on the string contents seems
> silly at first glance. Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages. I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.
UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
few thousand bytes, how do you locate the 273rd character? You have to
scan from the beginning. The same applies when surrogate pairs are
used to represent single characters, unless the representation leaks
and a surrogate is indexed as two - which is where the breaking-apart
happens.
Chris Angelico <ros...@gmail.com> writes:
> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
> few thousand bytes, how do you locate the 273rd character?
How often do you need to do that, as opposed to traversing the string by
iteration? Anyway, you could use a rope-like implementation, or an
index structure over the string.
On Sun, Aug 19, 2012 at 12:11 PM, Paul Rubin <no.em...@nospam.invalid> wrote:
> Chris Angelico <ros...@gmail.com> writes:
>> UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
>> few thousand bytes, how do you locate the 273rd character?
> How often do you need to do that, as opposed to traversing the string by
> iteration? Anyway, you could use a rope-like implementation, or an
> index structure over the string.
Well, imagine if Python strings were stored in UTF-8. How would you slice it?
>>> "asdfqwer"[4:]
'qwer'
That's a not uncommon operation when parsing strings or manipulating
data. You'd need to completely rework your algorithms to maintain a
position somewhere.
Chris Angelico <ros...@gmail.com> writes:
>>>> "asdfqwer"[4:]
> 'qwer'
> That's a not uncommon operation when parsing strings or manipulating
> data. You'd need to completely rework your algorithms to maintain a
> position somewhere.
Scanning 4 characters (or a few dozen, say) to peel off a token in
parsing a UTF-8 string is no big deal. It gets more expensive if you
want to index far more deeply into the string. I'm asking how often
that is done in real code. Obviously one can concoct hypothetical
examples that would suffer.
On Sun, Aug 19, 2012 at 12:35 PM, Paul Rubin <no.em...@nospam.invalid> wrote:
> Chris Angelico <ros...@gmail.com> writes:
>>>>> "asdfqwer"[4:]
>> 'qwer'
>> That's a not uncommon operation when parsing strings or manipulating
>> data. You'd need to completely rework your algorithms to maintain a
>> position somewhere.
> Scanning 4 characters (or a few dozen, say) to peel off a token in
> parsing a UTF-8 string is no big deal. It gets more expensive if you
> want to index far more deeply into the string. I'm asking how often
> that is done in real code. Obviously one can concoct hypothetical
> examples that would suffer.
Sure, four characters isn't a big deal to step through. But it still
makes indexing and slicing operations O(N) instead of O(1), plus you'd
have to zark the whole string up to where you want to work. It'd be
workable, but you'd have to redo your algorithms significantly; I
don't have a Python example of parsing a huge string, but I've done it
in other languages, and when I can depend on indexing being a cheap
operation, I'll happily do exactly that.
Chris Angelico <ros...@gmail.com> writes:
> Sure, four characters isn't a big deal to step through. But it still
> makes indexing and slicing operations O(N) instead of O(1), plus you'd
> have to zark the whole string up to where you want to work.
I know some systems chop the strings into blocks of (say) a few
hundred chars, so you can immediately get to the correct
block, then scan into the block to get to the desired char offset.
> I don't have a Python example of parsing a huge string, but I've done
> it in other languages, and when I can depend on indexing being a cheap
> operation, I'll happily do exactly that.
I'd be interested to know what the context was, where you parsed
a big unicode string in a way that required random access to
the nth character in the string.
> print(timeit("c in a", "c = '…'; a = 'a'*1000+c"))
> # .6 in 3.2.3, 1.2 in 3.3.0
> This does not make sense to me and I will ask about it.
I did ask on pydef list and paraphrased responses include:
1. 'My system gives opposite ratios.'
2. 'With a default of 1000000 repetitions in a loop, the reported times are microseconds per operation and thus not practically significant.'
3. 'There is a stringbench.py with a large number of such micro benchmarks.'
I believe there are also whole-application benchmarks that try to mimic real-world mixtures of operations.
People making improvements must consider performance on multiple systems and multiple benchmarks. If someone wants to work on search speed, they cannot just optimize that one operation on one system.
On Sun, Aug 19, 2012 at 1:10 PM, Paul Rubin <no.em...@nospam.invalid> wrote:
> Chris Angelico <ros...@gmail.com> writes:
>> I don't have a Python example of parsing a huge string, but I've done
>> it in other languages, and when I can depend on indexing being a cheap
>> operation, I'll happily do exactly that.
> I'd be interested to know what the context was, where you parsed
> a big unicode string in a way that required random access to
> the nth character in the string.
It's something I've done in C/C++ fairly often. Take one big fat
buffer, slice it and dice it as you get the information you want out
of it. I'll retain and/or calculate indices (when I'm not using
pointers, but that's a different kettle of fish). Generally, I'm
working with pure ASCII, but port those same algorithms to Python and
you'll easily be able to read in a file in some known encoding and
manipulate it as Unicode.
It's not so much 'random access to the nth character' as an efficient
way of jumping forward. For instance, if I know that the next thing is
a literal string of n characters (that I don't care about), I want to
skip over that and keep parsing. The Adobe Message Format is
particularly noteworthy in this, but it's a stupid format and I don't
recommend people spend too much time reading up on it (unless you like
that sensation of your brain trying to escape through your ear).
Chris Angelico <ros...@gmail.com> writes:
> Generally, I'm working with pure ASCII, but port those same algorithms
> to Python and you'll easily be able to read in a file in some known
> encoding and manipulate it as Unicode.
If it's pure ASCII, you can use the bytes or bytearray type.
> It's not so much 'random access to the nth character' as an efficient
> way of jumping forward. For instance, if I know that the next thing is
> a literal string of n characters (that I don't care about), I want to
> skip over that and keep parsing.
I don't understand how this is supposed to work. You're going to read a
large unicode text file (let's say it's UTF-8) into a single big string?
So the runtime library has to scan the encoded contents to find the
highest numbered codepoint (let's say it's mostly ascii but has a few
characters outside the BMP), expand it all (in this case) to UCS-4
giving 4x memory bloat and requiring decoding all the UTF-8 regardless,
and now we should worry about the efficiency of skipping n characters?
Since you have to decode the n characters regardless, I'd think this
skipping part should only be an issue if you have to do it a lot of
times.
This is a long post. If you don't feel like reading an essay, skip to the very bottom and read my last few paragraphs, starting with "To recap".
On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:
> Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> writes:
>> (There is an extension to UCS-2, UTF-16, which encodes non-BMP
>> characters using two code points. This is fragile and doesn't work very
>> well, because string-handling methods can break the surrogate pairs
>> apart, leaving you with invalid unicode string. Not good.)
> ...
>> With PEP 393, each Python string will be stored in the most efficient
>> format possible:
> Can you explain the issue of "breaking surrogate pairs apart" a little
> more? Switching between encodings based on the string contents seems
> silly at first glance.
Forget encodings! We're not talking about encodings. Encodings are used for converting text as bytes for transmission over the wire or storage on disk. PEP 393 talks about the internal representation of text within Python, the C-level data structure.
In 3.2, that data structure depends on a compile-time switch. In a "narrow build", text is stored using two-bytes per character, so the string "len" (as in the name of the built-in function) will be stored as
006c 0065 006e
(or possibly 6c00 6500 6e00, depending on whether your system is LittleEndian or BigEndian), plus object-overhead, which I shall ignore.
Since most identifiers are ASCII, that's already using twice as much memory as needed. This standard data structure is called UCS-2, and it only handles characters in the Basic Multilingual Plane, the BMP (roughly the first 64000 Unicode code points). I'll come back to that.
In a "wide build", text is stored as four-bytes per character, so "len" is stored as either:
Now memory is cheap, but it's not *that* cheap, and no matter how much memory you have, you can always use more.
This system is called UCS-4, and it can handle the entire Unicode character set, for now and forever. (If we ever need more that four-bytes worth of characters, it won't be called Unicode.)
Remember I said that UCS-2 can only handle the 64K characters [technically: code points] in the Basic Multilingual Plane? There's an extension to UCS-2 called UTF-16 which extends it to the entire Unicode range. Yes, that's the same name as the UTF-16 encoding, because it's more or less the same system.
UTF-16 says "let's represent characters in the BMP by two bytes, but characters outside the BMP by four bytes." There's a neat trick to this: the BMP doesn't use the entire two-byte range, so there are some byte pairs which are illegal in UCS-2 -- they don't correspond to *any* character. UTF-16 used those byte pairs to signal "this is half a character, you need to look at the next pair for the rest of the character".
Nifty hey? These pairs-of-pseudocharacters are called "surrogate pairs".
Except this comes at a big cost: you can no longer tell how long a string is by counting the number of bytes, which is fast, because sometimes four bytes is two characters and sometimes it's one and you can't tell which it will be until you actually inspect all four bytes.
Copying sub-strings now becomes either slow, or buggy. Say you want to grab the 10th characters in a string. The fast way using UCS-2 is to simply grab bytes 8 and 9 (remember characters are pairs of bytes and we start counting at zero) and you're done. Fast and safe if you're willing to give up the non-BMP characters.
It's also fast and safe if you use USC-4, but then everything takes twice as much space, so you probably end up spending so much time copying null bytes that you're probably slower anyway. Especially when your OS starts paging memory like mad.
But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8 and 9 are half of a surrogate pair, and you've now split the pair and ended up with an invalid string. That's what Python 3.2 does, it fails to handle surrogate pairs properly:
py> s = chr(0xFFFF + 1)
py> a, b = s
py> a
'\ud800'
py> b
'\udc00'
I've just split a single valid Unicode character into two invalid characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that my data is now junk.
Since any character can be a surrogate pair, you have to scan every pair of bytes in order to index a string, or work out it's length, or copy a substring. It's not enough to just check if the last pair is a surrogate.
When you don't, you have bugs like this from Python 3.2:
So variable-width data structures like UTF-8 or UTF-16 are crap for the internal representation of strings -- they are either fast or correct but cannot be both.
But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4 is too because ASCII-only strings like identifiers end up being four times as big as they need to be. 1-byte schemes like Latin-1 are unspeakable because they only handle 256 characters, fewer if you don't count the C0 and C1 control codes.
PEP 393 to the rescue! What if you could encode pure-ASCII strings like "len" using one byte per character, and BMP strings using two bytes per character (UCS-2), and fall back to four bytes (UCS-4) only when you really need it?
The benefits are:
* Americans and English-Canadians and Australians and other barbarians of that ilk who only use ASCII save a heap of memory;
* people who mostly use non-BMP characters only pay the cost of four-
bytes per character for strings that actually *need* four-bytes per character;
* people who use lots of non-BMP characters are no worse off.
The costs are:
* string routines need to be smarter -- they have to handle three different data structures (ASCII, UCS-2, UCS-4) instead of just one;
* there's a certain amount of overhead when creating a string -- you have to work out which in-memory format to use, and that's not necessarily trivial, but at least it's a once-off cost when you create the string;
* people who misunderstand what's going on get all upset over micro-
benchmarks.
> Strings are immutable so I don't understand why
> not use UTF-8 or UTF-16 for everything. UTF-8 is more efficient in
> Latin-based alphabets and UTF-16 may be more efficient for some other
> languages. I think even UCS-4 doesn't completely fix the surrogate pair
> issue if it means the only thing I can think of.
To recap:
* Variable-byte formats like UTF-8 and UTF-16 mean that basic string operations are not O(1) but are O(N). That means they are slow, or buggy, pick one.
* Fixed width UCS-2 doesn't handle the full Unicode range, only the BMP. That's better than it sounds: the BMP supports most character sets, but not all. Still, there are people who need the supplementary planes, and UCS-2 lets them down.
* Fixed width UCS-4 does handle the full Unicode range, without surrogates, but at the cost of using 2-4 times more string memory for the vast majority of users.
* PEP 393 doesn't use variable-width characters, but variable-width strings. Instead of choosing between 1, 2 and 4 bytes per character, it chooses *per string*. This keeps basic string operations O(1) instead of O(N), saves memory where possible, while still supporting the full Unicode range without a compile-time option.
On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote:
>> > I'm aware of this (and all the blah blah blah you are explaining).
>> > This always the same song. Memory.
>> Exactly. The reason it is always the same song is because it is an
>> important song.
> No offense here. But this is an *american* answer.
I am not American.
I am not aware that computers outside of the USA, and Australia, have unlimited amounts of memory. You must be very lucky.
> The same story as the coding of text files, where "utf-8 == ascii" and
> the rest of the world doesn't count.
On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:
> As I understand (I think) the undelying mechanism, I can only say, it is
> not a surprise that it happens.
> Imagine an editor, I type an "a", internally the text is saved as ascii,
> then I type en "é", the text can only be saved in at least latin-1. Then
> I enter an "€", the text become an internal ucs-4 "string". The remove
> the "€" and so on.
Firstly, that is not what Python does. For starters, € is in the BMP, and so is nearly every character you're ever going to use unless you are Asian or a historian using some obscure ancient script. NONE of the examples you have shown in your emails have included 4-byte characters, they have all been ASCII or UCS-2.
You are suffering from a misunderstanding about what is going on and misinterpreting what you have seen.
In *both* Python 3.2 and 3.3, both é and € are represented by two bytes. That will not change. There is a tiny amount of fixed overhead for strings, and that overhead is slightly different between the versions, but you'll never notice the difference.
Secondly, how a text editor or word processor chooses to store the text that you type is not the same as how Python does it. A text editor is not going to be creating a new immutable string after every key press. That will be slow slow SLOW. The usual way is to keep a buffer for each paragraph, and add and subtract characters from the buffer.
> Intuitively I expect there is some kind slow down between all these
> "strings" conversion.
Your intuition is wrong. Strings are not converted from ASCII to USC-2 to USC-4 on the fly, they are converted once, when the string is created.
The tests we ran earlier, e.g.:
('ab…' * 1000).replace('…', 'œ…')
show the *worst possible case* for the new string handling, because all we do is create new strings. First we create a string 'ab…', then we create another string 'ab…'*1000, then we create two new strings '…' and 'œ…', and finally we call replace and create yet another new string.
But in real applications, once you have created a string, you don't just immediately create a new one and throw the old one away. You likely do work with that string:
steve@runes:~$ python3.2 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.41 usec per loop
steve@runes:~$ python3.3 -m timeit "s = 'abcœ…'*1000; n = len(s); flag = s.startswith(('*', 'a'))"
100000 loops, best of 3: 2.29 usec per loop
Once you start doing *real work* with the strings, the overhead of deciding whether they should be stored using 1, 2 or 4 bytes begins to fade into the noise.
> When I tested this flexible representation, a few months ago, at the
> first alpha release. This is precisely what, I tested. String
> manipulations which are forcing this internal change and I concluded the
> result is not brillant. Realy, a factor 0.n up to 10.
Like I said, if you really think that there is a significant, repeatable slow-down on Windows, report it as a bug.
> Does any body know a way to get the size of the internal "string" in
> bytes?