Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Interned Strings

71 views
Skip to first unread message

Dave

unread,
Jan 10, 2006, 5:31:47 PM1/10/06
to pytho...@python.org, zhu_...@yahoo.com
Hello All,

I'm trying to clarify how Python avoids byte by byte
string comparisons most of the time. As I understand,
dictionaries keep strings, their keys (hash values),
and caches of their keys. Caching keys helps to avoid
recalculation of a string's hash value. So, when two
strings need to be compared, only their cached keys
are compared, which improves performance as there is
no need for byte by byte comparison.

Also, there is a global interning dictionary that
keeps interned strings. What I don't understand is why
strings are interned. How does it help with string
comparisons?

Thank you.

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

"Martin v. Löwis"

unread,
Jan 10, 2006, 6:32:01 PM1/10/06
to Dave
Dave wrote:
> I'm trying to clarify how Python avoids byte by byte
> string comparisons most of the time. As I understand,
> dictionaries keep strings, their keys (hash values),
> and caches of their keys.

If you are talking about dictionaries with string keys,
you also have the dictionary values, of course.

> Caching keys helps to avoid
> recalculation of a string's hash value.

Correct (s/keys/string hash values/).

> So, when two
> strings need to be compared, only their cached keys
> are compared, which improves performance as there is
> no need for byte by byte comparison.

No. When comparing to strings s1 and s2, the following
operations occur:
1. is s1 and s2 the very same string? If yes, they
are equal.
2. else, do they have the same size, the same first
byte (which might be a null byte), and do they
compare equal, byte-by-byte?
If yes, they are equal, if not, they are not equal.
3. Is it perhaps some other compare operation (<,>

<=, >, >=) that we want to perform? Do the
slow algorithm.

As you can see, the string hash is never consulted when
comparing string objects. It is only consulted to
find the potential dictionary slot in the first place.

> Also, there is a global interning dictionary that
> keeps interned strings. What I don't understand is why
> strings are interned. How does it help with string
> comparisons?

Why you look up a dictionary entry, this happens:
1. compute the key hash.
2. find the corresponding dictionary slot
If the slot is empty, KeyError.
3. compare the slot key with the search key.
If they are equal: value found.
If they are different: collision, go to the
next key.

Interned strings speed up step 1 and step 3. If
you only have interned strings throughout, you always
also have the hash value. Of course, you had to
compute the hash value when adding the string
to the interning dictionary.

The real speedup is in step 3, and based on
the assumption that collisions won't happen:
if you lookup the key (e.g. to find the value
of a global variable), you find the slot using
the computed hash. Then:
1. if the slot is empty, it's a KeyError.
2. if the slot is not empty, you first compare
for pointer equality. As collisions are
supposedly unlikely, this will be an equal
string most of the time. Then, if you have
interning, it even will be the *same* string,
so you only need to compare pointers to find
out they are the same string.

So assuming all strings are interned, this is how
you do the dictionary lookup.
1. fetch the hash value from the key (no need to
compute it - it's already cached)
2. go to the slot in the dict.
3. if the slot is empty (==NULL): KeyError
4. otherwise: success.
As you can see, in that case, there is no need to
compare the string values. If there are collisions,
and if not all strings are interned, the algorithm
gets more complicated, but four items above are
assumed to be the common case.

Regards,
Martin

0 new messages