String Identity Test

Avetis KAZARIAN

unread,

Mar 4, 2009, 2:22:07 AM3/4/09

to

After reading the discussion about the same subject ( From: "Thomas
Moore" <jsfrank.c...@msa.hinet.net> Date: Tue, 1 Nov 2005 21:45:56
+0800 ), I tried myself some tests with some confusing results (I'm a
beginner with Python, I'm coming from PHP)

# 1. Short alpha-numeric String without space

a = "b747"
b = "b747"

>>> a is b
True

# 2. Long alpha-numeric String without space

a =
"averylongstringbutreallyaveryveryverylongstringwithabout68characters"
b =
"averylongstringbutreallyaveryveryverylongstringwithabout68characters"

>>> a is b
True

# 3. Short alpha-numeric String with space

a = "x y"
b = "x y"

>>> a is b
False

# 4. Long alpha-numeric String with space

a = "I love Python it s so much better than PHP but sometimes
confusing"
b = "I love Python it s so much better than PHP but sometimes
confusing"

>>> a is b
False

# 5. Empty String

a = ""
b = ""

>>> a is b
True

# 6. Whitecharacter String : space

a = " "
b = " "

>>> a is b
False

# 7. Whitecharacter String : new line

a = "\n"
b = "\n"

>>> a is b
False

# 8. Non-ASCII without space

a = "é"
b = "é"

>>> a is b
False

# 9. Non-ASCII with space

a = "é à"
b = "é à"

>>> a is b
False

It seems that any strict ASCII alpha-numeric string is instantiated as
an unique object, like a "singleton" ( a = "x" and b = "x" => a is b )
and that any non strict ASCII alpha-numeric string is instantiated as
a new object every time with a new id.

Conclusion :

How does Python manage strings as objects?

--
Avétis KAZARIAN

Gary Herron

unread,

Mar 4, 2009, 2:56:58 AM3/4/09

to pytho...@python.org

However the implementors want.

That may seem a flippant answer, but it's actually accurate. The choice
of whether a new string reuses an existing string or creates a new one
is *not* a Python question, but rather a question of implementation.
It's a matter of efficiency, and as such each implementation/version of
Python may make its own choices. Writing a program that depends on the
string identity policy would be considered an erroneous program, and
should be avoided.

The question now is: Why do you care? The properties of strings do
not depend on the implementation's choice, so you shouldn't care because
of programming considerations. Perhaps it's just a matter of curiosity
on your part.

Gary Herron

>
> --
> Avétis KAZARIAN
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Terry Reedy

unread,

Mar 4, 2009, 3:02:52 AM3/4/09

to pytho...@python.org

Avetis KAZARIAN wrote:
> After reading the discussion about the same subject ( From: "Thomas
> Moore" <jsfrank.c...@msa.hinet.net> Date: Tue, 1 Nov 2005 21:45:56
> +0800 ), I tried myself some tests with some confusing results (I'm a
> beginner with Python, I'm coming from PHP)

For immutable objects, identity is essentially irrelevant. Whether an
implementation conserves space by reusing immutable objects with a given
value, and if so, how so, depends on the particular version of a
particular implementation. Unless one in interested in interpreter
implementation, I advise against paying too much attention to the issue.
It seems to generate more confusion than enlightenment.

> How does Python manage strings as objects?

Python the language does not 'manage' objects. Particular interpreters
do what they do. The CPython sources are decently readable.

tjr

Avetis KAZARIAN

unread,

Mar 4, 2009, 4:07:44 AM3/4/09

to

Gary Herron wrote:
> The question now is: Why do you care? The properties of strings do
> not depend on the implementation's choice, so you shouldn't care because
> of programming considerations. Perhaps it's just a matter of curiosity
> on your part.
>
> Gary Herron

Well, it's not about curiosity, it's more about performance.

I will make a PHP example (a really quite simple )

PHP :

Stat 1 : $aVeryLongString == $anOtherVeryLongString
Stat 2 : $aVeryLongString === $anOtherVeryLongString

Stat 2 is really faster than Stat 1 (due to the binary comparison)

As I said, I'm coming from PHP, so I was wondering if there was such a
difference in Python.

Because I was trying to use "is" as for "===".

Tino Wildenhain

unread,

Mar 4, 2009, 4:32:58 AM3/4/09

to Avetis KAZARIAN, pytho...@python.org

Please keep in mind in both cases there is nothing "for free".
To have identity, you would need to have the same object - which
in case of a string means the interpreter has to find out about
existing string with exactly the same contents and reference it
instead of creating a new object in memory. This takes about at least
the same time (if not more) then just run the compare with both strings
when you need (aka == ).

If you only have a few strings but compare them often, you could
profit from identity and the overhead of installing it would
be neglectable (and you can force this in python with "internal")
but in this case I'd think calculating and working with a hash
instead should be preferred.

Regards
Tino Wildenhain

Christian Heimes

unread,

Mar 4, 2009, 4:33:03 AM3/4/09

to pytho...@python.org

Avetis KAZARIAN schrieb:

Python uses some tricks to speed up string comparison. The struct of the
string type contains the length of the string and it caches the hash of
the string, too.

s1 == s2 is broken down to several steps. Here is the Python equivalent
of the C code:

# for strings, identity is always equality
if s1 is s2:
return True

# compare the size
if len(s1) != len(s2):
return False

# special case strings with a length of one
if len(s1) == 1 and s1[0] == s2[0]:
return True

# compare the hash
if hash(s1) != hash(s2):
return False

# if size and hash are equal compare every char* of the str
for i in xrange(len(s1)):
if s1[i] != s2[i]:
return False

# it's really the same thing
return True

Christian

Avetis KAZARIAN

unread,

Mar 4, 2009, 4:43:32 AM3/4/09

to

Everything's clear now.

Thanks all (especially Christian and Tino) :]

Peter Otten

unread,

Mar 4, 2009, 5:08:22 AM3/4/09

to

Avetis KAZARIAN wrote:

So you have two very long strings that may be equal. How did you get them?
If you read them from a file, that took much more time than the comparison.

If they are sufficiently likely to be not equal just read them in smaller
chunks and compare these. If you want to compare multiple combinations use
hashes.

If 'a is b' worked like 'a == b' for arbitrary string that would mean that
the python implementation had done a lot of unnecessary 'a == b'
comparisons behind the scene or at least calculated a lot of hash values,
i. e. the ability to use the fast operation would in effect slow down your
program.

Peter

Steve Holden

unread,

Mar 4, 2009, 9:24:25 AM3/4/09

to pytho...@python.org

Suppose you write

a = b

Thereafter, unless some further assignment is made to either a or b, you
are guaranteed that "a is b" returns True.

This is pretty much the only guarantee you have. There is no guarantee
(across all implementations) that

a = some-expression

b = some-equivalent-expression

will leave "a is b" True.

Does PHP really keep only one copy of every string? Sounds like that
could slow string creation down a little. Essentially it's keeping all
strings in a set. Of course you could do that in Python if you wanted,
but it would certainly slow things down.

Anyway, thanks for looking at Python. I hope you continue to enjoy it!

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

Gabriel Genellina

unread,

Mar 4, 2009, 9:30:44 AM3/4/09

to pytho...@python.org

En Wed, 04 Mar 2009 07:07:44 -0200, Avetis KAZARIAN <avet...@gmail.com>
escribió:

PHP '==' has no direct correspondence in Python. '===' in PHP is more like
'==' in Python (but not exactly the same).
In PHP, $x === $y is true if both variables are of the same type *and*
both have the same value. $x == $y checks only the values, doing type
conversions as needed, even string -> number; there is no equivalent
operator in Python. PHP === is called "identity" but isn't related to the
"is" operator in Python; there is no identity test in PHP with the Python
semantics.

PHP:
1 == 1
TRUE

1 == 1.0
TRUE

1 == "1"
TRUE

1 == "1.0"
TRUE

1 === 1
TRUE

1 === 1.0
FALSE

1 === "1"
FALSE

1 === "1.0"
FALSE

array(1,2,3) == array(1,2,3)
TRUE

array(1,2,3) === array(1,2,3)
TRUE

Python:
1 == 1
True

1 == 1.0
True

1 == "1"
False

1 == "1.0"
False

[1,2,3] == [1,2,3]
True

[1,2,3] is [1,2,3]
False

So, don't try to translate concepts from one language to another. (Ok,
it's natural to try to do that if you know PHP, but doesn't work. You have
to know the differences).
--
Gabriel Genellina

Avetis KAZARIAN

unread,

Mar 4, 2009, 10:05:25 AM3/4/09

to

Steve Holden wrote:
> Does PHP really keep only one copy of every string?

Not at all.

I might have said something confusing if you understood that...

> So, don't try to translate concepts from one language to another.
>

> --
> Gabriel Genellina

I'll try ;]

S Arrowsmith

unread,

Mar 4, 2009, 10:15:09 AM3/4/09

to

Avetis KAZARIAN <avet...@gmail.com> wrote:
>It seems that any strict ASCII alpha-numeric string is instantiated as

>an unique object, like a "singleton" ( a =3D "x" and b =3D "x" =3D> a is b =

>)
>and that any non strict ASCII alpha-numeric string is instantiated as
>a new object every time with a new id.

What no-one appears to have mentioned so far is that the purpose
of this implementation detail is to ensure that there is a single
instance of strings which are valid identifiers, so that you don't
go around creating and destroying string instances just to do an
attribute look-up on an object. A few strings which are not valid
as identifiers get swept up into this system:

>>> a = "1"
>>> b = "1"
>>> a is b
True

"Small" integers get a similar treatment:

>>> a = 256
>>> b = 256
>>> a is b
True
>>> a = 257
>>> b = 257
>>> a is b
False

But as as hopefully been made clear, all this is completely an
implementation detail. (Indeed, the range of "interned" integers
changed from 0--99 to -5--2356 a few versions ago.) So don't,
under any circumstances, rely on it, even when you understand
what's going on.

--
\S

under construction

Hendrik van Rooyen

unread,

Mar 5, 2009, 2:08:33 AM3/5/09

to pytho...@python.org

"S Arrowsmith" <si...intbox.UUCP> wrote:

> "Small" integers get a similar treatment:
>
> >>> a = 256
> >>> b = 256
> >>> a is b
> True
> >>> a = 257
> >>> b = 257
> >>> a is b
> False

This is weird - I would have thought that the limit
of "small" would be at 255 - the biggest number to
fit in a byte. 256 takes two bytes, so it must be
an arbitrary limit - could have been set at 300,
or 30 000...

- Hendrik

Bruno Desthuilliers

unread,

Mar 5, 2009, 7:32:25 AM3/5/09

to

Hendrik van Rooyen a écrit :

> "S Arrowsmith" <si...intbox.UUCP> wrote:
>
>> "Small" integers get a similar treatment:
>>
>>>>> a = 256
>>>>> b = 256
>>>>> a is b
>> True
>>>>> a = 257
>>>>> b = 257
>>>>> a is b
>> False
>
> This is weird - I would have thought that the limit
> of "small" would be at 255 - the biggest number to
> fit in a byte. 256 takes two bytes, so it must be
> an arbitrary limit

It is, and has changed from version to version.

Bruno Desthuilliers

unread,

Mar 5, 2009, 7:40:48 AM3/5/09

to

Avetis KAZARIAN a écrit :

> Well, it's not about curiosity, it's more about performance.

> Steve Holden wrote:
(snip)

>> So, don't try to translate concepts from one language to another.
>

> I'll try ;]

Also and FWIW:

1/ Python has some very handy tools when it comes to perfs - like a
couple profilers (to identify bottlenecks), or the timeit module (for
quick benchmarks).

2/ Most "best practice" idioms are frequently discussed here

3/ If you have performance problems related to wrong algorithm/data
structure, some of us here _really_ enjoy helping !-)

Welcome onboard.

Terry Reedy

unread,

Mar 5, 2009, 11:06:48 AM3/5/09

to pytho...@python.org

Hendrik van Rooyen wrote:

> "S Arrowsmith" <si...intbox.UUCP> wrote:
>
>> "Small" integers get a similar treatment:
>>
>>>>> a = 256
>>>>> b = 256
>>>>> a is b
>> True
>>>>> a = 257
>>>>> b = 257
>>>>> a is b
>> False
>

> This is weird - I would have thought that the limit
> of "small" would be at 255 - the biggest number to
> fit in a byte. 256 takes two bytes, so it must be

> an arbitrary limit - could have been set at 300,
> or 30 000...

'Small' also goes to -10 or so. 256 was included, at minuscule cost,
because it is a relatively common number, being the number of bytes.

Terry Reedy

unread,

Mar 5, 2009, 11:36:28 AM3/5/09

to pytho...@python.org

Terry Reedy wrote:
> Hendrik van Rooyen wrote:
>> "S Arrowsmith" <si...intbox.UUCP> wrote:
>>

>>> "Small" integers get a similar treatment:
>>>
>>>>>> a = 256
>>>>>> b = 256
>>>>>> a is b
>>> True
>>>>>> a = 257
>>>>>> b = 257
>>>>>> a is b
>>> False
>>

>> This is weird - I would have thought that the limit
>> of "small" would be at 255 - the biggest number to fit in a byte. 256
>> takes two bytes, so it must be

Ints take as least 4 bytes. It is commonness of usage that determined
caching. The range was expanded a few years ago in anticipation of the
new bytes type, whose contents are ints, not chars.

>> an arbitrary limit - could have been set at 300,
>> or 30 000...
>
> 'Small' also goes to -10 or so. 256 was included, at minuscule cost,
> because it is a relatively common number, being the number of bytes.

In fact, 3.0.1 starts with 36 internal references to the cached int 256!

>>> import sys
>>> sys.getrefcount(256)
38 # -2 for the function call

>>> sys.getrefcount(257)
2

>>> [sys.getrefcount(i)-2 for i in range(258)]

shows that only 15 cached ints start with more references. 0 has the
most with 724 (and that small actually goes to -5).

tjr