Comparing strings from the back?

Roy Smith

unread,

Sep 3, 2012, 9:54:01 PM9/3/12

to

There's been a bunch of threads lately about string implementations, and
that got me thinking (which is often a dangerous thing).

Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning? If the strings are indeed equal, it's the
same amount of work starting from either end. But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.

It doesn't seem un-plausible that this is the case. For example, most
of the filenames I work with begin with "/home/roy/". Most of the
strings I use as memcache keys have one of a small number of prefixes.
Most of the strings I use as logger names have common leading
substrings. Things like credit card and telephone numbers tend to have
much more entropy in the trailing digits.

On the other hand, hostnames (and thus email addresses) exhibit the
opposite pattern.

Anyway, it's just a thought. Has anybody studied this for real-life
usage patterns?

I'm also not sure how this work with all the possible UCS/UTF encodings.
With some of them, you may get the encoding semantics wrong if you don't
start from the front.

Chris Angelico

unread,

Sep 3, 2012, 10:07:22 PM9/3/12

to pytho...@python.org

On Tue, Sep 4, 2012 at 11:54 AM, Roy Smith <r...@panix.com> wrote:
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?

> I'm also not sure how this work with all the possible UCS/UTF encodings.
> With some of them, you may get the encoding semantics wrong if you don't
> start from the front.

No problem there; Python uses only fixed-width encodings. Also, any
canonical encoding can be safely compared byte-for-byte; two identical
Unicode strings will be bit-wise identical in (say) UTF-8.

There's issues of cache locality and such that quite possibly mean
it's not going to be faster overall, but it wouldn't be difficult to
tweak the Python sources, recompile, and run some tests. I'm sure
python-dev or python-list will be most happy to discuss some benchmark
figures!

ChrisA

Steven D'Aprano

unread,

Sep 3, 2012, 10:17:30 PM9/3/12

to

On Mon, 03 Sep 2012 21:54:01 -0400, Roy Smith wrote:

> Let's assume you're testing two strings for equality. You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning? If the strings are indeed equal, it's the
> same amount of work starting from either end. But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.

And if the strings have the same end and different beginnings:

running
jumping
cooking

then you'll find out quicker by starting at the beginning.

No general-purpose string comparison operator can make assumptions about
the content of the strings, like "they are nearly always pathnames on
Unix-like systems like /home/user/x and /home/user/y".

I suppose you could have two different operators, == for "test from the
front" and === for "test from the back", and push that decision onto the
user, who will then spend hours angsting about the "most efficient"
equality test and days painstakingly profiling his data to save four
nanoseconds at runtime.

Besides, then somebody will say "Yes, but what about the cases where the
prefix and the suffix are both equal, but the middle will be different?"
and propose a third string-equality operator ==== and then there will be
bloodshed.

On average, string equality needs to check half the characters in the
string. Any individual comparisons may require fewer, or more, than that,
but whichever way you scan the string, some comparisons will take more
and some will take fewer. With no general way of telling ahead of time
which will be which, on average you can't do better than O(N/2) and no
complexity justification for picking "start from the end" from "start
from the start".

--
Steven

Terry Reedy

unread,

Sep 4, 2012, 1:13:18 AM9/4/12

to pytho...@python.org

On 9/3/2012 9:54 PM, Roy Smith wrote:
> There's been a bunch of threads lately about string implementations, and
> that got me thinking (which is often a dangerous thing).
>
> Let's assume you're testing two strings for equality. You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning?

I posted the 3.3 str (in)equality compare a few days ago. It first
compares length and then kind, then the actual bytes in whatever chunks.
If the latter uses the C/system mem compare, that is faster than
anything coded in Python or even C.

--
Terry Jan Reedy

Dan Sommers

unread,

Sep 4, 2012, 12:56:51 AM9/4/12

to Steven D'Aprano, pytho...@python.org

On 2012-09-04 at 02:17:30 +0000,

Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> Besides, then somebody will say "Yes, but what about the cases where
> the prefix and the suffix are both equal, but the middle will be
> different?" and propose a third string-equality operator ==== and
> then there will be bloodshed.

Lisp has several equality forms, although they're for different kinds or
levels of equality rather than for different algorithms for determining
that equality. I imagine that a discussion similar to this one spawned
many of those forms. I agree with Steven: decisions such as this
belong to the application developer, not the standard library. Don't
forget the fourth string-equality operator for case-insensitive
comparisons and the fifth string-equality operator for strings that are
likely to be palindromes.

That said, if I really wanted bloodshed, I would propose = for the third
string-equality operator! ;-)

Dan

Mark Lawrence

unread,

Sep 4, 2012, 3:50:58 AM9/4/12

to pytho...@python.org

On 04/09/2012 05:56, Dan Sommers wrote:
>
> That said, if I really wanted bloodshed, I would propose = for the third
> string-equality operator! ;-)
>
> Dan
>

Dan "agent provocateur" Sommers? :)

--
Cheers.

Mark Lawrence.

Mark Lawrence

unread,

Sep 4, 2012, 3:56:11 AM9/4/12

to pytho...@python.org

Good grief, what timing!!! The dust has yet to settle over the
"Flexible string representation, unicode, typography, ..." thread and
you ask this. I'll just cross my fingers and hope that people stick
with facts, realistic benchmarks etc, and not good old fashioned FUD.

--
Cheers.

Mark Lawrence.

Alain Ketterlin

unread,

Sep 4, 2012, 5:58:08 AM9/4/12

to

Roy Smith <r...@panix.com> writes:

> There's been a bunch of threads lately about string implementations, and
> that got me thinking (which is often a dangerous thing).
>
> Let's assume you're testing two strings for equality. You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning? If the strings are indeed equal, it's the
> same amount of work starting from either end. But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.

I guess we all have examples with specific entropy distribution. Going
backwards on long strings can be profitable with file paths. If that is
the case, and if you spend a lot of time comparing strings, why not
store them in reverse?

> It doesn't seem un-plausible that this is the case. For example, most
> of the filenames I work with begin with "/home/roy/". Most of the
> strings I use as memcache keys have one of a small number of prefixes.
> Most of the strings I use as logger names have common leading
> substrings.

In that case I would split the paths, and use some sort of string-id, or
use intern() and then compare components with "is" instead of "==".
(Basically, turning the string of chars into a strings of
ids/ints/pointers.)

> Things like credit card and telephone numbers tend to have much more
> entropy in the trailing digits. On the other hand, hostnames (and thus
> email addresses) exhibit the opposite pattern.

Yes, real-world is a mess.

> Anyway, it's just a thought. Has anybody studied this for real-life
> usage patterns?

I've tried the "intern()" trick several times with lists of strings, and
was satisfied, but this was in specific cases (with a finite set of
strings). For "short" strings of chars (up to, say a hundred of
characters), I've never found anything significantly faster than the
default sweep (that was in C/C++, not python). For longer and/or more
structured strings/lists, making use of the structure is a good idea,
but I see no general pattern here (as you note).

> I'm also not sure how this work with all the possible UCS/UTF encodings.
> With some of them, you may get the encoding semantics wrong if you don't
> start from the front.

If you're testing for equality, I can't see how this could matter, even
with variable-length encodings.

If you're comparing different encodings, then you need different access
methods to random characters (but this doesn't affect the algorithm). If
you're using variable-length encoding, e.g., UTF-8, accessing at a
specific position is not possible.

-- Alain.

Johannes Bauer

unread,

Sep 4, 2012, 12:32:57 PM9/4/12

to

On 04.09.2012 04:17, Steven D'Aprano wrote:

> On average, string equality needs to check half the characters in the
> string.

How do you arrive at that conclusion? When comparing two random strings,
I just derived

n = (256 / 255) * (1 - 256 ^ (-c))

where n is the average number of character comparisons and c. The
rationale as follows: The first character has to be compared in any
case. The second with a probability of 1/256, the third with 1/(256^2)
and so on.

Best regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?
> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <hidbv3$om2$1...@speranza.aioe.org>

Steven D'Aprano

unread,

Sep 4, 2012, 2:07:29 PM9/4/12

to

On Tue, 04 Sep 2012 18:32:57 +0200, Johannes Bauer wrote:

> On 04.09.2012 04:17, Steven D'Aprano wrote:
>
>> On average, string equality needs to check half the characters in the
>> string.
>
> How do you arrive at that conclusion?

Take two non-empty strings of the same length, N. If the strings are
equal, you have to make N comparisons to be sure they are equal (you
check all N pairs of characters). If they are unequal, you have to check
each pair of characters up to the first pair that are different:

def string_equality(astr, bstr):
# Assumes lengths are equal.
for achr, bchr in zip(astr, bstr):
if achr != bchr:
return False
return True

For the unequal case, how many comparisons do you do? Ahead of time, we
know very little about the strings. We don't even know how many possible
characters there are (are they English alphanumeric, ASCII, Greek, Thai,
or from the full range of 1114111 Unicode code points?) or what their
probability distribution is.

A reasonable, conservative assumption is to calculate the largest
possible value of the average for random strings. That largest value
occurs when the alphabet is as small as possible, namely two characters.
In practice, strings come from a larger alphabet, up to 1114111 different
characters for full Unicode strings, so the average for them will be less
than the average we calculate now.

So for unequal strings, the number of comparisons is equally likely to be
1, 2, 3, ..., N. The average then is:

sum([1, 2, 3, ..., N])/N

which by a bit of simple algebra works out to be (N+1)/2, or half the
characters as I said.

(Note that this average assumes the strings are completely random. In
practice, strings tend to come from strongly non-uniform distributions of
characters. For instance, in English texts, 'e' is much more likely than
'q' and most characters are vanishingly rare, and in practice, strings
are likely to be highly non-random.)

--
Steven

Chris Angelico

unread,

Sep 4, 2012, 5:59:43 PM9/4/12

to pytho...@python.org

On Wed, Sep 5, 2012 at 2:32 AM, Johannes Bauer <dfnson...@gmx.de> wrote:
> How do you arrive at that conclusion? When comparing two random strings,
> I just derived
>
> n = (256 / 255) * (1 - 256 ^ (-c))
>
> where n is the average number of character comparisons and c. The
> rationale as follows: The first character has to be compared in any
> case. The second with a probability of 1/256, the third with 1/(256^2)
> and so on.

That would be for comparing two random areas of memory. Python strings
don't have 256 options per character; and in terms of actual strings,
there's so many possibilities. The strings that a program is going to
compare for equality are going to use a vastly restricted alphabet;
for a lot of cases, there might be only a few dozen plausible
characters.

But even so, it's going to scale approximately linearly with the
string length. If they're really random, then yes, there's little
chance that either a 1MB string or a 2MB string will be the same, but
with real data, they might very well have a long common prefix. So
it's still going to be more or less O(n).

ChrisA

Neil Hodgson

unread,

Sep 4, 2012, 10:18:28 PM9/4/12

to

Roy Smith:

> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning? If the strings are indeed equal, it's the
> same amount of work starting from either end.

Most people write loops that go forwards. This leads to the
processor designers prioritizing detection mechanisms like cache
prefetching for that case.

However, its not always the situation: a couple of years ago Intel
contributed a memcpy implementation to glibc that went backwards for a
performance improvement. An explanation from a related patch involves
speculative and committed operations and address bits on some processors
and quickly lost me:
paragraph 3 of
http://lists.freedesktop.org/archives/pixman/2010-August/000423.html

The memcpy patch was controversial as it broke Adobe Flash which
assumed memcpy was safe like memmove.

Neil

MRAB

unread,

Sep 4, 2012, 10:39:01 PM9/4/12

to pytho...@python.org

I don't think I've ever use memcpy, only memmove.

Roy Smith

unread,

Sep 4, 2012, 10:48:20 PM9/4/12

to

In article <-9CdnaqjTK6NKtvN...@westnet.com.au>,

Neil Hodgson <nhod...@iinet.net.au> wrote:

> The memcpy patch was controversial as it broke Adobe Flash

An added benefit!

Peter Otten

unread,

Sep 5, 2012, 4:29:14 AM9/5/12

to pytho...@python.org

Roy Smith wrote:

> There's been a bunch of threads lately about string implementations, and
> that got me thinking (which is often a dangerous thing).
>
> Let's assume you're testing two strings for equality. You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning? If the strings are indeed equal, it's the
> same amount of work starting from either end. But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.
>
> It doesn't seem un-plausible that this is the case. For example, most
> of the filenames I work with begin with "/home/roy/". Most of the
> strings I use as memcache keys have one of a small number of prefixes.
> Most of the strings I use as logger names have common leading
> substrings. Things like credit card and telephone numbers tend to have
> much more entropy in the trailing digits.

But do you actually compare them with each other? Even if you do I'd guess
that Python looks at the hash values most of the time.

> On the other hand, hostnames (and thus email addresses) exhibit the
> opposite pattern.

Nor do english words:

$ python3 reverse_compare2.py compare /usr/share/dict/words --word-len 8

comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head
1: 477222 ************************************************************
2: 18870 **
3: 2870
4: 435
5: 74
6: 17
7: 12

tail
1: 386633 ************************************************************
2: 74966 ***********
3: 29698 ****
4: 6536 *
5: 1475
6: 154
7: 28
8: 10

Chris Angelico

unread,

Sep 5, 2012, 4:33:04 AM9/5/12

to pytho...@python.org

On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <__pet...@web.de> wrote:
> comparing every pair in a sample of 1000 8-char words
> taken from '/usr/share/dict/words'
>
> head
> 1: 477222 ************************************************************
> 2: 18870 **

> ...

Not understanding this. What are the statistics, and what (if it's not
obvious from the previous answer) do they prove?

ChrisA

Johannes Bauer

unread,

Sep 5, 2012, 5:17:33 AM9/5/12

to

On 04.09.2012 20:07, Steven D'Aprano wrote:
> A reasonable, conservative assumption is to calculate the largest
> possible value of the average for random strings. That largest value
> occurs when the alphabet is as small as possible, namely two characters.
> In practice, strings come from a larger alphabet, up to 1114111 different
> characters for full Unicode strings, so the average for them will be less
> than the average we calculate now.
>
> So for unequal strings, the number of comparisons is equally likely to be
> 1, 2, 3, ..., N. The average then is:

That is exactly what I don't agree with. For unequal random strings, the
distribution is *not* equally likely to have the same amount of
comparisons. For demonstration, let's look at bitstrings and the string
"1010". There 15 bitstrings of same length which are unequal and I'll
put the number of comparisons needed next to them going left to right:

0000 1
0001 1
0010 1
0011 1
0100 1
0101 1
0110 1
0111 1
1100 2
1101 2
1110 2
1111 2
1000 3
1001 3
1011 4

i.e. about 1.73 (26/15) comparisons in average with random strings
compared to the 2.5 comparisons you end up with. My formula does respect
lazy termination (because the probabilty of higher comparisons gets
exponentially lower).

> sum([1, 2, 3, ..., N])/N
>
> which by a bit of simple algebra works out to be (N+1)/2, or half the
> characters as I said.

Yes, but it's a flawed assumption you're making since you're neglecting
lazy evaluation and early abort of comparison.

Johannes Bauer

unread,

Sep 5, 2012, 5:24:52 AM9/5/12

to

On 04.09.2012 23:59, Chris Angelico wrote:

>> n = (256 / 255) * (1 - 256 ^ (-c))
>>
>> where n is the average number of character comparisons and c. The
>> rationale as follows: The first character has to be compared in any
>> case. The second with a probability of 1/256, the third with 1/(256^2)
>> and so on.
>
> That would be for comparing two random areas of memory. Python strings
> don't have 256 options per character; and in terms of actual strings,
> there's so many possibilities.

The unit of comparison is completely arbitraty and can equally be used
to compare bitstrings if changed accordingly.

> The strings that a program is going to
> compare for equality are going to use a vastly restricted alphabet;
> for a lot of cases, there might be only a few dozen plausible
> characters.

This is very true. But if so, I would like to see the assumtion that
you're making (which might very well be plausible) in order to arrive at
a average of N/2 comparisons (which just seems *way* off).

> But even so, it's going to scale approximately linearly with the
> string length. If they're really random, then yes, there's little
> chance that either a 1MB string or a 2MB string will be the same, but
> with real data, they might very well have a long common prefix. So
> it's still going to be more or less O(n).

I understand what you're saying and surely this is true to some extent
(someone mentioned filenames, which is an excellent example). However in
order to derive a *formula* you need to formularize your assumtions.
With random data, it's definitely not O(n). And even in reality (with a
more limited alphabet) it usually will early abort:

Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
assumes O(1) comparisons! If the comparison operation itself were in
O(n), the total sorting complexity would be O(n^2 log(n)), which is
definitely false. Most comparisons will abort *very* early (after the
first character).

Best regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

> Zumindest nicht �ffentlich!
Ah, der neueste und bis heute genialste Streich unsere gro�en
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos �ber R�diger Thomas in dsa <hidbv3$om2$1...@speranza.aioe.org>

Johannes Bauer

unread,

Sep 5, 2012, 5:43:08 AM9/5/12

to

On 05.09.2012 11:24, Johannes Bauer wrote:

> Consider sorting a large dictionary. Sorting is in O(n log(n)), but this
> assumes O(1) comparisons! If the comparison operation itself were in
> O(n), the total sorting complexity would be O(n^2 log(n)), which is
> definitely false. Most comparisons will abort *very* early (after the
> first character).

I've just written a program to demonstrate this (below it's attached). I
used my aspell dictionary, which contains about 100k words. To have
complexity down I went for only words wich occur most often (in this
case 9 characters or ~13k words). Here's the results (note it's
randomized, so they'll always differ a little, but are pretty stable)

Max Cmp: 100
Sum Cmp: 32111
Avg Cmp: 2.393842254361115
Num words : 13414
Min wordlen: 9
Max wordlen: 9
Avg wordlen: 9.0

Going up in wordlength (only showing part now)

Avg Cmp: 2.2374929735806632
Num words : 10674
Avg wordlen: 10.0

Avg Cmp: 2.261727078891258
Num words : 7504
Avg wordlen: 11.0

Avg Cmp: 2.2335647202939977
Num words : 4898
Avg wordlen: 12.0

Avg Cmp: 2.1743341404358354
Num words : 2891
Avg wordlen: 13.0

Avg Cmp: 2.156782549420586
Num words : 1467
Avg wordlen: 14.0

Avg Cmp: 2.112449799196787
Num words : 747
Avg wordlen: 15.0

So, interestingly, in this real-world case(tm), the complexity does not
scale with O(n). It actually goes down (which is probably due to the
limit amount of words of higher length).

In summary, let's please not debate "feelings" or such. Either
measurements (with clear indiciation on what data they're based on and
how the test was conducted) or derivations (with an explanation of the
assumtions). Feelings can be very misleading in cases like this.

Best regards,
Johannes

#!/usr/bin/python3
import random

class Word():
def __init__(self, s):
self._s = s
self._c = 0

def __lt__(self, other):
if len(self) < len(other):
return True
elif len(self) > len(other):
return False

for i in range(len(self)):
self._c += 1
if self._s[i] < other._s[i]:
return True
return False

def __repr__(self):
return "%s<%d>" % (self._s, self._c)

def __len__(self):
return len(self._s)

def getcmp(self):
return self._c

wordlist = [ ]
for line in open("dict", "r"):
word = Word(line[:-1])
if len(word) == 9:
wordlist.append(word)

random.shuffle(wordlist)
wordlist.sort()

comparisons = [ word.getcmp() for word in wordlist ]
print("Max Cmp:", max(comparisons))
print("Sum Cmp:", sum(comparisons))
print("Avg Cmp:", sum(comparisons) / len(comparisons))

wordlengths = [ len(word) for word in wordlist ]
print("Min wordlen:", min(wordlengths))
print("Max wordlen:", max(wordlengths))
print("Avg wordlen:", sum(wordlengths) / len(wordlengths))

Peter Otten

unread,

Sep 5, 2012, 5:48:38 AM9/5/12

to pytho...@python.org

Chris Angelico wrote:

> On Wed, Sep 5, 2012 at 6:29 PM, Peter Otten <__pet...@web.de> wrote:
>> comparing every pair in a sample of 1000 8-char words
>> taken from '/usr/share/dict/words'
>>
>> head
>> 1: 477222 ************************************************************
>> 2: 18870 **
>> ...

tail
1: 386633 ************************************************************
2: 74966 ***********

> Not understanding this. What are the statistics,

I measured how many chars they have in common for all combinations of 1000
words taken from /usr/share/dict/words.

477222 pairs have one char in common, 18870 pairs have two chars in common
when compared from the *beginning* of the string.

386633 pairs have one char in common, 74966 pairs have two chars in common
when compared from the *end* of the string.

and what (if it's not obvious from the previous answer) do they prove?

They demonstrate that for two words from that particular corpus it is likely
that a[::-1] == b[::-1] has to take more (1.34 versus 1.05 on average)
characters into account than a == b, i. e. comparing from the back should be
slower rather than faster.

If that doesn't help, here's the code ;)

import random
import itertools
from contextlib import contextmanager
from collections import defaultdict

@contextmanager
def open_corpus(args):
with open(args.name) as instream:
words = (line.strip() for line in instream)
#words = (word for word in words if not word.endswith("'s"))
yield words

def pick(items, count):
items = list(items)
return random.sample(items, count)

def count_common(a, b):
i = 0
for i, (x, y) in enumerate(zip(a, b), 1):
if x != y:
break
return i

def corpus(args):
with open_corpus(args) as words:
wordlens = (len(line.strip()) for line in words)
freq = defaultdict(int)
for wordlen in wordlens:
freq[wordlen] += 1
show_histogram(freq)

def show_histogram(freq):
peak = max(freq.values())
freq_width = len(str(peak))
max_freq = max(freq)
len_width = len(str(max_freq))
for i in range(min(freq), max_freq+1):
value = freq[i]
print("{:{}}: {:{}} {}".format(
i, len_width, value,
freq_width, "*" * (60 * value // peak)))

def compare(args):
with open_corpus(args) as words:
words = pick(
(word for word in words if len(word) == args.word_len),
args.sample_size)
n = 0
head_common = defaultdict(int)
tail_common = defaultdict(int)
n_tail = n_head = 0
for n, (a, b) in enumerate(itertools.combinations(words, 2), 1):
cc = count_common(a, b)
n_head += cc
head_common[cc] += 1
ccr = count_common(a[::-1], b[::-1])
n_tail += ccr
tail_common[ccr] += 1

print(n, "combinations")
print("average common chars (head) {:.3}".format(n_head/n))
print("average common chars (tail) {:.3}".format(n_tail/n))

print("comparing every pair in a "
"sample of {sample} {len}-char words\n"
"taken from {name!r}".format(
sample=args.sample_size,
len=args.word_len,
name=args.name))
print()
print("head")
show_histogram(head_common)
print()
print("tail")
show_histogram(tail_common)

def main():
import argparse
parser = argparse.ArgumentParser()
parsers = parser.add_subparsers()

pcorpus = parsers.add_parser("corpus")
pcorpus.add_argument("name")
pcorpus.set_defaults(func=corpus)

pcompare = parsers.add_parser("compare")
pcompare.add_argument("name")
pcompare.add_argument("--sample-size", type=int, default=1000)
pcompare.add_argument("--word-len", type=int, default=8)
pcompare.set_defaults(func=compare)
args = parser.parse_args()
args.func(args)

if __name__ == "__main__":
main()

Steven D'Aprano

unread,

Sep 5, 2012, 10:30:07 AM9/5/12

to

On Wed, 05 Sep 2012 11:43:08 +0200, Johannes Bauer wrote:

> On 05.09.2012 11:24, Johannes Bauer wrote:
>
>> Consider sorting a large dictionary. Sorting is in O(n log(n)), but
>> this assumes O(1) comparisons! If the comparison operation itself were
>> in O(n), the total sorting complexity would be O(n^2 log(n)),

This shows some significant confusion about Big Oh complexity analysis
here. When you say that sorting the list of words is O(N log N), the N
refers to the number of words in the dictionary. But when you speak about
comparisons being O(N), the N doesn't refer to the size of the dict, but
the size of the individual strings being compared. There is no reasonable
comparison algorithm where the effort required to compare two strings
depends on the size of the dictionary they have come from.

Since these are *completely different Ns*, you can't combine them to get
O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer
nonsense. If you state it like this:

sorting the list of words is O(N log N) where N = length of the list
comparing the words is O(M) where M = the average length of the words

then the overall complexity is O(N log N)*O(M), but since M is a constant
for any particular dictionary (its just the average length of the words)
that reduces down to O(N log N).

To say that sorting a dictionary is O(N log N) does *not* imply that
string comparisons are O(1). It only implies that string comparisons
don't depend on the size of the dictionary. Comparing:

s1 = 'a'*30002
s2 = 'a'*30001 + 'b'

takes the same amount of time whether they are in a dictionary of 100
words or one of 100000 words. For the purposes of calculating the
algorithmic complexity of sorting a large list of words, we can treat
comparisons as an elementary operation even though they actually aren't.

>> which is
>> definitely false. Most comparisons will abort *very* early (after the
>> first character).

You are making unjustified assumptions about the distribution of letters
in the words. This might be a list of long chemical compounds where the
words typically differ only in their suffix. It might be a list of people
with titles:

Herr Professor Frederick Schmidt
Herr Professor Frederick Wagner
...

But either way, it doesn't matter for the Big Oh behaviour of sorting the
words.

> I've just written a program to demonstrate this (below it's attached).

I have no idea why you think this program demonstrates anything about
string equality comparisons. What you are counting is not the average
number of comparisons for string equality, but the number of comparisons
when sorting. These are *completely different*, because sorting a list
does not require testing each string against every other string.

Whatever you have measured, it has nothing to do with the Big Oh
complexity of string comparisons.

It seems to me that you have misunderstood the purpose of Big Oh
notation. It gives an asymptotic upper bound on the amount of work done,
ignoring all lower terms and multiplicative factors, not an exact formula.

If something is O(N), then this tells you:

1) the number of algorithmic steps is proportional to N, plus or
minus some terms which are less than N (e.g. sqrt(N), or log(N),
or 1/N, or some constant, etc.);

2) if you ignore those other terms, and keep all other influences
equal, then doubling N will *at worst* double the number of
algorithmic steps;

3) if all those algorithmic steps take exactly the same amount of
time, then doubling N will take twice as much time.

But of course *none* of those assumptions hold for measurements of actual
elapsed time. In real life, operations rarely take constant time, you
don't always see worst case behaviour, and the lower terms are not
necessarily insignificant enough to ignore.

[...]

> So, interestingly, in this real-world case(tm), the complexity does not
> scale with O(n). It actually goes down (which is probably due to the
> limit amount of words of higher length).

I would guess that it probably has more to do with the ability of the
timsort algorithm to exploit local order within a list and avoid
performing comparisons.

> In summary, let's please not debate "feelings" or such.

I have no idea what feelings you are referring to.

--
Steven

Johannes Bauer

unread,

Sep 5, 2012, 10:33:17 AM9/5/12

to

On 05.09.2012 04:18, Neil Hodgson wrote:

> The memcpy patch was controversial as it broke Adobe Flash which
> assumed memcpy was safe like memmove.

Adobe Flash was broken before, making an assumption that is not
guaranteed by the standard. The patch only showed the problem.

Regards,
Johannes

Johannes Bauer

unread,

Sep 5, 2012, 10:51:10 AM9/5/12

to

On 05.09.2012 16:30, Steven D'Aprano wrote:

> Since these are *completely different Ns*, you can't combine them to get
> O(N**2 log N) as the overall algorithmic complexity. Doing so is sheer
> nonsense. If you state it like this:

Yes, you are correct here.

> You are making unjustified assumptions about the distribution of letters
> in the words. This might be a list of long chemical compounds where the
> words typically differ only in their suffix. It might be a list of people
> with titles:

Actually, I'm not. I'm stating exactly what assumptions I'm making to
get my calculation. I'm comparing *random* character strings or bitstrings.

You, on the other hand, are making vague assumptions which you do not
care for formalize and yet you claim that "the number of comparisons is
equally likely to be 1, 2, 3, ..., N. The average then is". Without any
explanation for this. At all.

> Herr Professor Frederick Schmidt
> Herr Professor Frederick Wagner
> ...

Is your assumtion that we're comparing words that have the common prefix
"Herr Professor Frederick "? Why don't you clearly state what you're
comparing. In the example you gave in the other thread you claimed "Note
that this average assumes the strings are completely random.", yet now
you're talking about specific word patterns.

> I have no idea why you think this program demonstrates anything about
> string equality comparisons. What you are counting is not the average
> number of comparisons for string equality, but the number of comparisons
> when sorting. These are *completely different*, because sorting a list
> does not require testing each string against every other string.

You're wrong. It's not counting the number of string comparisons, it's
counting the number of character comparisons. Which is exactly what
we're talking about in this thread, are we not? The sorting of the list
is just to show some example where lots of strings are compared.

>> So, interestingly, in this real-world case(tm), the complexity does not
>> scale with O(n). It actually goes down (which is probably due to the
>> limit amount of words of higher length).
>
> I would guess that it probably has more to do with the ability of the
> timsort algorithm to exploit local order within a list and avoid
> performing comparisons.

Again, it's not counting the number of string comparisons. It's counting
the average number of character comparisons per string comparison. The
total amount of string comparisons has nothing to do with it.

>> In summary, let's please not debate "feelings" or such.
>
> I have no idea what feelings you are referring to.

I'm referring to "the number of comparisons is equally likely to be 1,
2, 3, ..., N. The average then is" without any deduction or rationale.
Without any explanation or assumption.

Regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

> Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.

Peter Otten

unread,

Sep 5, 2012, 11:45:55 AM9/5/12

to pytho...@python.org

Oscar Benjamin wrote:

> On 5 September 2012 10:48, Peter Otten <__pet...@web.de> wrote:

>> def count_common(a, b):

[sorry for seriously broken implementation]

> This function will return 1 if the first character differs. It does not
> count the number of common characters but rather the more relevant
> quantity which is the number of comparisons required to decide if the
> strings are equal.

I remember stumbling over the non-zero frequency for len(word) but somehow
plodded on with beautifying the broken results :(

def count_common(a, b):
"""
>>> count_common("alpha", "beta")
0
>>> count_common("", "")
0
>>> count_common("alpha", "argument")
1
>>> count_common("alpha", "alpha")
5
>>> count_common("alpha", "alphx")
4
"""
i = 0
for x, y in zip(a, b):

if x != y:
break

i += 1
return i

$ python3 reverse_compare2.py compare /usr/share/dict/words

499500 combinations
average common chars (head) 0.0535
average common chars (tail) 0.322

comparing every pair in a sample of 1000 8-char words
taken from '/usr/share/dict/words'

head

0: 477371 ************************************************************
1: 18310 **
2: 3191
3: 523
4: 79
5: 15
6: 10
7: 1

tail
0: 385069 ************************************************************
1: 78742 ************
2: 26734 ****
3: 7462 *
4: 1297
5: 168
6: 22
7: 6

Steven D'Aprano

unread,

Sep 5, 2012, 12:24:27 PM9/5/12

to

On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
[...]

>> You are making unjustified assumptions about the distribution of
>> letters in the words. This might be a list of long chemical compounds
>> where the words typically differ only in their suffix. It might be a
>> list of people with titles:
>
> Actually, I'm not. I'm stating exactly what assumptions I'm making to
> get my calculation. I'm comparing *random* character strings or
> bitstrings.

Excuse me, you are not. You are comparing English words which are highly
non-random.

> You, on the other hand, are making vague assumptions which you do not
> care for formalize and yet you claim that "the number of comparisons is
> equally likely to be 1, 2, 3, ..., N. The average then is". Without any
> explanation for this. At all.

I will accept that my explanation was not good enough, but I strongly
disagree that I gave no explanation at all.

>> Herr Professor Frederick Schmidt
>> Herr Professor Frederick Wagner
>> ...
>
> Is your assumtion that we're comparing words that have the common prefix
> "Herr Professor Frederick "?

No, I am pointing out that *your* assumption that most string comparisons
will halt close to the beginning of the string is an invalid assumption.
Your assumption only holds for some non-random strings.

[...]

>> I have no idea why you think this program demonstrates anything about
>> string equality comparisons. What you are counting is not the average
>> number of comparisons for string equality, but the number of
>> comparisons when sorting. These are *completely different*, because
>> sorting a list does not require testing each string against every other
>> string.
>
> You're wrong. It's not counting the number of string comparisons,

I didn't say it was.

> it's counting the number of character comparisons.

Correct. But only out of an extremely limited subset of all possible
string comparisons, namely those very few that happen when sorting lists
of English words using a very specific algorithm, namely timsort.

> Which is exactly what
> we're talking about in this thread, are we not? The sorting of the list
> is just to show some example where lots of strings are compared.

It is not good enough to extract a non-random subset of strings (only
English words) and then decrease the data set even further by only
comparing an extremely non-uniform subset of those strings: specifically
those few string comparisons that happen to occur using timsort.

Whatever figure you have calculated by taking this non-random selection,
it is irrelevant to the question of the average performance of string
equality given random strings.

>>> So, interestingly, in this real-world case(tm), the complexity does
>>> not scale with O(n). It actually goes down (which is probably due to
>>> the limit amount of words of higher length).
>>
>> I would guess that it probably has more to do with the ability of the
>> timsort algorithm to exploit local order within a list and avoid
>> performing comparisons.
>
> Again, it's not counting the number of string comparisons. It's counting
> the average number of character comparisons per string comparison. The
> total amount of string comparisons has nothing to do with it.

I didn't say anything about counting string comparisons. But you are
wrong -- the numbers you get are *strongly* influenced by the number of
string comparisons performed. That is just one of the reasons why the
results you get are biased.

Let me spell it out for you: given a list of five letter words:

['fruit', 'apple', 'claim', 'pears', ... , 'toast']

sorting the list may compare:

'fruit' with 'apple'
'apple' with 'claim'

but almost certainly will not compare:

'apple' with 'toast'

and it definitely won't compare:

'zzzax' with 'zzzaq'

because strings like those never get into the list in the first place.

For the sake of simple calculations, let's pretend that there are only
1000 five-letter strings possible. We want to know how many character-
comparisons are done on average when testing two random five-letter
strings for equality. To calculate this average, you must compare every
string to every other string, *including* itself.

This gives 1000*1000 = one million equality tests to be performed. For
each equality test, we count the number of character comparisons
performed, sum all those counts, and divide by 1e6. That is the average
number of char comparisons for random strings of five letters.

But that's not what you do. First you eliminate 900 out of the 1000
possible strings by only sampling English words -- strings like "xxoij"
cannot possibly be selected. So you are left with the highly non-random
subset of 10000 equality tests.

But you haven't finished. Instead of checking all 10000 equality tests,
you then decide to only check the 2000 tests that happen to randomly
occur when using the timsort algorithm to sort a list.

So from the population of one million possible equality tests, you throw
away the vast majority, then average the *strongly non-random* few that
are remaining. Naturally the result you get is strongly biased, and it
has nothing to do with the average Big Oh behaviour of equality on random
strings.

--
Steven

Oscar Benjamin

unread,

Sep 5, 2012, 6:47:14 PM9/5/12

to pytho...@python.org

In news.gmane.comp.python.general, you wrote:
> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
> [...]
>>> You are making unjustified assumptions about the distribution of
>>> letters in the words. This might be a list of long chemical compounds
>>> where the words typically differ only in their suffix. It might be a
>>> list of people with titles:
>>
>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>> get my calculation. I'm comparing *random* character strings or
>> bitstrings.
>
> Excuse me, you are not. You are comparing English words which are highly
> non-random.

Evidently we have different understandings of what 'random' means. I don't
think it's unreasonable to say that strings drawn uniformly from the set of
all strings in the English language (having a given number of characters) is
random. The distribution is not uniform over the set of all possible character
strings but it is still random. I think Johannes deliberately chose these
strings to emulate a particular kind of 'real' distribution of strings that
might occur in practise.

>
>
>> You, on the other hand, are making vague assumptions which you do not
>> care for formalize and yet you claim that "the number of comparisons is
>> equally likely to be 1, 2, 3, ..., N. The average then is". Without any
>> explanation for this. At all.
>
> I will accept that my explanation was not good enough, but I strongly
> disagree that I gave no explanation at all.
>
>
>>> Herr Professor Frederick Schmidt
>>> Herr Professor Frederick Wagner
>>> ...
>>
>> Is your assumtion that we're comparing words that have the common prefix
>> "Herr Professor Frederick "?
>
> No, I am pointing out that *your* assumption that most string comparisons
> will halt close to the beginning of the string is an invalid assumption.
> Your assumption only holds for some non-random strings.

I think you have this backwards. The case where this assumption is provably
true is precisely for random strings. To be clear, when I say 'random' in this
context I mean that each character is chosen independently from the same
probability distribution over the possible characters regardless of which
index it has in the string and regardless of what the other characters are
(IID). In this case the probability that comparison terminates at the jth
character decreases exponentially with j. This means that for large strings
the expected number of character comparisons is independent of the number of
characters in the string as the probability of reaching the later parts of the
string is too small for them to have any significant effect. This is provable
and applies regardless of how many possible characters there are and whether
or not each character is equally likely (except for the pathological case
where one character has a probability of 1).

For strings from 'real' distributions it is harder to make statements about
the 'average case' and it is possible to construct situations where the
comparison would regularly need to compare a common prefix. However, to get
asymptotic performance worse than O(1) it is not sufficient to say that there
may be a common prefix such as 'Herr' in the distribution of strings. It is
necessary that, somehow, the common prefix is likely to grow as the size of
the strings grows.

For example, the set of all strings of length N whose first N//2 characters
are always 'a' and whose remaining characters are chosen IID would lead to
O(N) performance. This is why the file paths example chosen at the start of
this thread is a good one. If a program is dealing with a number of large
strings representing file paths then it is not uncommon that many of those
paths would refer to files in the same deeply nested directory and hence
compare equal for a significant number of characters. This could lead to
average case O(N) performance.

I think it's appropriate to compare string comparison with dict insertion:
Best case O(1) (no hash collisions)
Worst case O(N) (collides with every key)
Average case O(1) (as long as you don't use pathological data)

The only difference with string comparison is that there are some conceivable,
non-malicious cases where the pathological data can occur (such as with file
paths). However, I suspect that out of all the different uses of python
strings these cases are a small minority.

In saying that, it's not inconceivable that someone could exploit string
comparison by providing pathological data to make normally O(1) operations
behave as O(N). If I understand correctly it was precisely this kind of
problem with dict insertion/lookup that lead to the recent hash-seed security
update.

Oscar

Steven D'Aprano

unread,

Sep 6, 2012, 4:33:14 AM9/6/12

to

On Wed, 05 Sep 2012 22:47:14 +0000, Oscar Benjamin wrote:

> In news.gmane.comp.python.general, you wrote:
>> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote: [...]
>>>> You are making unjustified assumptions about the distribution of
>>>> letters in the words. This might be a list of long chemical compounds
>>>> where the words typically differ only in their suffix. It might be a
>>>> list of people with titles:
>>>
>>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>>> get my calculation. I'm comparing *random* character strings or
>>> bitstrings.
>>
>> Excuse me, you are not. You are comparing English words which are
>> highly non-random.
>
> Evidently we have different understandings of what 'random' means. I
> don't think it's unreasonable to say that strings drawn uniformly from
> the set of all strings in the English language (having a given number of
> characters) is random. The distribution is not uniform over the set of
> all possible character strings but it is still random. I think Johannes
> deliberately chose these strings to emulate a particular kind of 'real'
> distribution of strings that might occur in practise.

Of course for some "real" applications, your strings being compared will
be English words. And for other applications, they will be strings of
numeric digits uniformly distributed between 20000 and 30000. Or taken
from a Gaussian distribution centered around 7000. Or strings made up of
deeply nested sets of parentheses (((((( ... )))))). Whichever special
distribution of strings you pick, I don't think calling them "random" is
terribly useful in context, even if they are generated randomly from some
non-uniform distribution.

But you know, it really doesn't make a difference. Equality testing will
*still* be O(N) since the asymptomatic behaviour for sufficiently large
string comparisons will be bounded by O(N). Multiplicative constants
("half the string" versus "0.001 of the string") do not matter.

I may have been overly-conservative earlier when I said that on average
string equality has to compare half the characters. I thought I had
remembered that from a computer science textbook, but I can't find that
reference now, so possibly I was thinking of something else. (String
searching perhaps?). In any case, the *worst* case for string equality
testing is certainly O(N) (every character must be looked at), and the
*best* case is O(1) obviously (the first character fails to match). But
I'm not so sure about the average case. Further thought is required.

--
Steven

Dave Angel

unread,

Sep 6, 2012, 6:07:38 AM9/6/12

to Steven D'Aprano, pytho...@python.org

On 09/06/2012 04:33 AM, Steven D'Aprano wrote:
> <snip>

>
> I may have been overly-conservative earlier when I said that on
> average string equality has to compare half the characters. I thought
> I had remembered that from a computer science textbook, but I can't
> find that reference now, so possibly I was thinking of something else.
> (String searching perhaps?). In any case, the *worst* case for string
> equality testing is certainly O(N) (every character must be looked
> at), and the *best* case is O(1) obviously (the first character fails
> to match). But I'm not so sure about the average case. Further thought
> is required.

For random strings (as defined below), the average compare time is
effectively unrelated to the size of the string, once the size passes
some point.

Define random string as being a selection from a set of characters, with
replacement. So if we pick some set of characters, say 10 (or 256, it
doesn't really matter), the number of possible strings is 10**N.

The likelihood of not finding a mismatch within k characters is
(1/10)**k The likelihood of actually reaching the end of the random
string is (1/10)**N. (which is the reciprocal of the number of strings,
naturally)

If we wanted an average number of comparisons, we'd have to calculate a
series, where each term is a probability times a value for k.
sum((k * 9*10**-k) for k in range(1, 10))

Those terms very rapidly approach 0, so it's safe to stop after a few.
Looking at the first 9 items, I see a value of 1.1111111

This may not be quite right, but the value is certainly well under 2 for
a population of 10 characters, chosen randomly. And notice that N
doesn't really come into it.

--

DaveA

Oscar Benjamin

unread,

Sep 6, 2012, 7:04:19 AM9/6/12

to pytho...@python.org

It's exactly right. You can obtain this result analytically from
Johannes' formula above. Just replace 256 with 10 to get that the
expected number of comparisons is

(10/9)*(1 - 10**(-N))

The last term shows the dependence on N and is tiny even for N=9.

Oscar

Roy Smith

unread,

Sep 6, 2012, 8:13:07 AM9/6/12

to

In article <50485fca$0$29977$c3e8da3$5496...@news.astraweb.com>,

Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> In any case, the *worst* case for string equality
> testing is certainly O(N) (every character must be looked at), and the
> *best* case is O(1) obviously (the first character fails to match).

The best case is O(0), if either string is empty (ducking and running).

Chris Angelico

unread,

Sep 6, 2012, 8:29:09 AM9/6/12

to pytho...@python.org

No, O(0) would be when the application decides not to compare at all.

ChrisA (also ducking)

Johannes Bauer

unread,

Sep 6, 2012, 9:37:23 AM9/6/12

to

On 05.09.2012 18:24, Steven D'Aprano wrote:
> On Wed, 05 Sep 2012 16:51:10 +0200, Johannes Bauer wrote:
> [...]
>>> You are making unjustified assumptions about the distribution of
>>> letters in the words. This might be a list of long chemical compounds
>>> where the words typically differ only in their suffix. It might be a
>>> list of people with titles:
>>
>> Actually, I'm not. I'm stating exactly what assumptions I'm making to
>> get my calculation. I'm comparing *random* character strings or
>> bitstrings.
>
> Excuse me, you are not. You are comparing English words which are highly
> non-random.

Not in my original post. If you read it again, you will clearly see that
I was talking about purely random strings. And since you like to
nitpick, I'll clarify further: I'm talking about bitstrings in which
every bit of every character has the same probability of occurence, 50%.

You then replied by mentioning probability distributions of characters
in different languages/encodings, to which I tried giving a example as
it might (and does) occur in the real world, like sorting a dictionary.

But the original point is still valid: Sorting of randomized bitstrings
is definitely not O(N), you got that wrong.

>> You, on the other hand, are making vague assumptions which you do not
>> care for formalize and yet you claim that "the number of comparisons is
>> equally likely to be 1, 2, 3, ..., N. The average then is". Without any
>> explanation for this. At all.
>
> I will accept that my explanation was not good enough, but I strongly
> disagree that I gave no explanation at all.

What possible explanation could there be that comparison of purely
random bitstrings yields an equal amount of comparisons for its length?

>>> Herr Professor Frederick Schmidt
>>> Herr Professor Frederick Wagner
>>> ...
>>
>> Is your assumtion that we're comparing words that have the common prefix
>> "Herr Professor Frederick "?
>
> No, I am pointing out that *your* assumption that most string comparisons
> will halt close to the beginning of the string is an invalid assumption.
> Your assumption only holds for some non-random strings.

Definitely not. It *especially* holds true for random strings. For
random strings with an alphabet of size n, the probability that the
first character needs to be compared is n^0, i.e. 1. The probability
that the second character needs to be compared is n^(-1). For the xth
character, it is n^(-x). This is due to the lazy abort in comparison and
it results in the average number of comparisons being extremely short no
matter the bitlength n, or O(log n).

>> it's counting the number of character comparisons.
>
> Correct. But only out of an extremely limited subset of all possible
> string comparisons, namely those very few that happen when sorting lists
> of English words using a very specific algorithm, namely timsort.

Yes, but this was to look at a real-world example (in which way more
comparisons are needed than in the random case). You first were talking
about randomized bitstrings (and so was I), then you brought up
character sets and languages (which, for the original problem, are
entirely irrelevant).

> Whatever figure you have calculated by taking this non-random selection,
> it is irrelevant to the question of the average performance of string
> equality given random strings.

Correct. I gave the estimate for random strings in my very first post.

> For the sake of simple calculations, let's pretend that there are only
> 1000 five-letter strings possible. We want to know how many character-
> comparisons are done on average when testing two random five-letter
> strings for equality. To calculate this average, you must compare every
> string to every other string, *including* itself.
>
> This gives 1000*1000 = one million equality tests to be performed. For
> each equality test, we count the number of character comparisons
> performed, sum all those counts, and divide by 1e6. That is the average
> number of char comparisons for random strings of five letters.

Easy enough. Since you didn't look at my original equation, here are
some numbers and the program attached:

64 combinations for length 6
Char comparisons: 8064
Comparison cnt : 4096
Avg comparison : 1.97

128 combinations for length 7
Char comparisons: 32512
Comparison cnt : 16384
Avg comparison : 1.98

256 combinations for length 8
Char comparisons: 130560
Comparison cnt : 65536
Avg comparison : 1.99

512 combinations for length 9
Char comparisons: 523264
Comparison cnt : 262144
Avg comparison : 2.00

1024 combinations for length 10
Char comparisons: 2095104
Comparison cnt : 1048576
Avg comparison : 2.00

2048 combinations for length 11
Char comparisons: 8384512
Comparison cnt : 4194304
Avg comparison : 2.00

4096 combinations for length 12
Char comparisons: 33546240
Comparison cnt : 16777216
Avg comparison : 2.00

Hopefully now you'll see that your assumption O(n) is dead wrong.

> But that's not what you do. First you eliminate 900 out of the 1000
> possible strings by only sampling English words -- strings like "xxoij"
> cannot possibly be selected. So you are left with the highly non-random
> subset of 10000 equality tests.

This is *exactly* what my original calculation did. Enumerate all
possibilites and divide by the total number.

Regards,
Johannes

#!/usr/bin/python3
import itertools
import random

alphabet = [ c for c in "01" ]

def cmpstr(s1, s2):
c = 0
for i in range(len(s1)):
c += 1
if s1[i] != s2[i]:
break
return c

for strlength in range(1, 16):
combinations = list(itertools.product(*([ alphabet ] * strlength)))

assert(len(combinations) == len(alphabet) ** strlength)
print("%d combinations for length %d" % (len(combinations), strlength))

compcnt = 0
comparisons = 0
for c1 in combinations:
for c2 in combinations:
comparisons += cmpstr(c1, c2)
compcnt += 1
print("Char comparisons: %d" % (comparisons))
print("Comparison cnt : %d" % (compcnt))
print("Avg comparison : %.2f" % (comparisons / compcnt))

Johannes Bauer

unread,

Sep 6, 2012, 9:43:25 AM9/6/12

to

On 06.09.2012 10:33, Steven D'Aprano wrote:

> But you know, it really doesn't make a difference. Equality testing will
> *still* be O(N) since the asymptomatic behaviour for sufficiently large
> string comparisons will be bounded by O(N). Multiplicative constants
> ("half the string" versus "0.001 of the string") do not matter.

Wrong, at least for randomized strings (i.e. every character with the
same probability). O(N) is worst-case, O(log N) is correct for
randomized strings.

Then again, since you were nitpicking about Landau notation earlier this
thread, every function bound by O(log N) is also bound by O(N), since,
as you correctly stated, it's only a upper bound. We should be talking
about the asymptotic sharp bound, i.e. capital Theta.

> I may have been overly-conservative earlier when I said that on average
> string equality has to compare half the characters. I thought I had
> remembered that from a computer science textbook, but I can't find that
> reference now, so possibly I was thinking of something else. (String
> searching perhaps?). In any case, the *worst* case for string equality
> testing is certainly O(N) (every character must be looked at), and the
> *best* case is O(1) obviously (the first character fails to match). But
> I'm not so sure about the average case. Further thought is required.

Yes, worst-case is O(N), best case O(1). Average is O(n log n).

Best regards,

Dave Angel

unread,

Sep 6, 2012, 10:23:45 AM9/6/12

to Johannes Bauer, pytho...@python.org

On 09/06/2012 09:43 AM, Johannes Bauer wrote:
> <snip>

> Yes, worst-case is O(N), best case O(1). Average is O(n log n).

Can't see how you came up with an average of n log(n). Fourteen minutes
before you made this post, you demonstrated it was less than 2 for any N.

And I previously posted that for a character set of size 10, the average
was 1.1111

--

DaveA

Johannes Bauer

unread,

Sep 6, 2012, 10:33:04 AM9/6/12

to

On 06.09.2012 16:23, Dave Angel wrote:
> On 09/06/2012 09:43 AM, Johannes Bauer wrote:
>> <snip>
>> Yes, worst-case is O(N), best case O(1). Average is O(n log n).
>
> Can't see how you came up with an average of n log(n). Fourteen minutes
> before you made this post, you demonstrated it was less than 2 for any N.

O(log n) is what I meant, sorry! Damnit.

> And I previously posted that for a character set of size 10, the average
> was 1.1111

For any given string n and and alphabet size l, the average is:

sum(i = 0..n) (1 / (l ^ n))

So with larger word length, the total complexity constantly increases.
The amount by which it increases however is shrinking exponentially with
the word length. Therefore O(log n).

Sorry about the n log n mistake, argh.

Best regards & thanks for the correction,

Johannes Bauer

unread,

Sep 6, 2012, 10:34:10 AM9/6/12

to

On 06.09.2012 15:43, Johannes Bauer wrote:

> Wrong, at least for randomized strings (i.e. every character with the
> same probability). O(N) is worst-case, O(log N) is correct for
> randomized strings.

^^
Here I write the right thing. Then further below...

> Yes, worst-case is O(N), best case O(1). Average is O(n log n).

...I write the wrong thing. O(log n) is what I meant, as Dave correctly
noticed.

Chris Angelico

unread,

Sep 6, 2012, 10:39:33 AM9/6/12

to pytho...@python.org

On Thu, Sep 6, 2012 at 11:37 PM, Johannes Bauer <dfnson...@gmx.de> wrote:
> Not in my original post. If you read it again, you will clearly see that
> I was talking about purely random strings. And since you like to
> nitpick, I'll clarify further: I'm talking about bitstrings in which
> every bit of every character has the same probability of occurence, 50%.

That sort of string isn't a normal thing to be comparing, though.

Here's an idea. Someone who's doing a lot of arguing in this thread
should take Python, find the string comparison routine, and hack in
some statistics-gathering. Then run *real code* on it. Maybe use this
with one of those web frameworks and run your web site on it for an
hour or two, or fire off some real scripts you really use. Then dump
out the stats at the end. My guess: The bulk of string comparisons
that get to actually comparing byte-for-byte will end up returning
True. Most of the false comparisons will be proven earlier; if I
understand correctly, Python will check for identity (easy true) and
different lengths (easy false). But my guess could turn out to be flat
wrong. In any case, it'll be far FAR more useful than arguing from
totally random, or random word selection, or anything.

Who's game?

ChrisA

Johannes Bauer

unread,

Sep 6, 2012, 10:42:54 AM9/6/12

to

On 06.09.2012 16:23, Dave Angel wrote:

Again playing with the equations and thinking about it again. And I
completely missed your point. It wasn't about n log n vs. log n. Both
are wrong. This surprises me. I was somehow thinking about the limit of
sum (1/n), n -> infty. But since the summands are shrinking
exponentially, the limit is different.

I think the limit of the average comparisons for a given wordlength n
against infinity with alphabet size l is

l / (l - 1)

i.e. for bitstrings it's 2 and for bytestrings it's 256/255.

This would mean string comparison of randomized strings is O(1). Can
that really be true? It looks like it.

Johannes Bauer

unread,

Sep 6, 2012, 11:36:39 AM9/6/12

to

On 06.09.2012 16:39, Chris Angelico wrote:

> In any case, it'll be far FAR more useful than arguing from
> totally random, or random word selection, or anything.
>
> Who's game?

Okay, patched against Python 3.2.3: http://pastebin.com/PRRN53P6

To invoke display of the stats, compare the string "pleasedumpstats" as
LHO, i.e.:

"pleasedumpstats" < ""

Just ran it against a big script of mine which takes the stringified
IMDb text files, parses it and dumps it into a sqlite3 database.
Surprisingly little string comparisons however (the sqlite module
without doubt uses its own routines internally). Although the database
in the end has about 2 million records, these were the stats:

strCmpEq 1017
strCmpLt 2802
strCmpGt 1633
strCmpTc 16219
strCmpCc 8570

which means 5452 comparisons of which 19% were equal and the rest inequal.

Average string length is about 2.97 characters and aborted was in
average after 1.57 characters.

Maybe someone has a test which makes heavier use of string comparison. I
don't want to make up something, however, since this is (by definition)
going to be artificial.

Best regards,
Johannes

--
>> Wo hattest Du das Beben nochmal GENAU vorhergesagt?

> Zumindest nicht �ffentlich!
Ah, der neueste und bis heute genialste Streich unsere gro�en
Kosmologen: Die Geheim-Vorhersage.

Johannes Bauer

unread,

Sep 6, 2012, 11:44:06 AM9/6/12

to

On 06.09.2012 17:36, Johannes Bauer wrote:
> "pleasedumpstats" < ""

And against a XML-reading C code generator that uses mako:

strCmpEq 39670
strCmpLt 2766215
strCmpGt 2744002
strCmpTc 37430821
strCmpCc 14048656

Compared strings: 5549887
Equal: 0.7%
Average wordlength: 6.74 chars
Average comparelength: 2.53 chars

Dave Angel

unread,

Sep 6, 2012, 11:54:58 AM9/6/12

to Johannes Bauer, pytho...@python.org

On 09/06/2012 10:42 AM, Johannes Bauer wrote:
> On 06.09.2012 16:23, Dave Angel wrote:
>> On 09/06/2012 09:43 AM, Johannes Bauer wrote:
>>> <snip>
>>> Yes, worst-case is O(N), best case O(1). Average is O(n log n).
>> Can't see how you came up with an average of n log(n). Fourteen minutes
>> before you made this post, you demonstrated it was less than 2 for any N.
>>
>> And I previously posted that for a character set of size 10, the average
>> was 1.1111
> Again playing with the equations and thinking about it again. And I
> completely missed your point. It wasn't about n log n vs. log n. Both
> are wrong. This surprises me. I was somehow thinking about the limit of
> sum (1/n), n -> infty. But since the summands are shrinking
> exponentially, the limit is different.
>
> I think the limit of the average comparisons for a given wordlength n
> against infinity with alphabet size l is
>
> l / (l - 1)
>
> i.e. for bitstrings it's 2 and for bytestrings it's 256/255.
>
> This would mean string comparison of randomized strings is O(1). Can
> that really be true? It looks like it.
>

(Just lost the internet in a storm, so I'm not sure how long it'll be
before this sends.)

Thanks, that was exactly my point. Since el is at least 2, the average
number of comparisons is no larger than 2, for any value of N. That's
why, when I'm visually comparing MD5 values, I usually only look at the
first few, and last few hex digits.

However, Chris Angelico (at 10:39) pointed out again that totally random
strings aren't real-world equivalent.

He has now proposed that somebody measure real-world cases here, by
patching the string comparison to gather statistics. I think that's
beyond me at this point. I only joined this thread when the cases under
study were well-defined random, and therefore predictable. <g>

--

DaveA

Steven D'Aprano

unread,

Sep 7, 2012, 12:06:42 AM9/7/12

to

On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:

> On 09/06/2012 04:33 AM, Steven D'Aprano wrote:
>> <snip>
>>
>> I may have been overly-conservative earlier when I said that on average
>> string equality has to compare half the characters. I thought I had
>> remembered that from a computer science textbook, but I can't find that
>> reference now, so possibly I was thinking of something else. (String
>> searching perhaps?). In any case, the *worst* case for string equality
>> testing is certainly O(N) (every character must be looked at), and the
>> *best* case is O(1) obviously (the first character fails to match). But
>> I'm not so sure about the average case. Further thought is required.
>
> For random strings (as defined below), the average compare time is
> effectively unrelated to the size of the string, once the size passes
> some point.

Okay, I've spent some time looking for the reference, and can't find it,
so I'm now sure I must have been conflating it with something else.

After further thought, and giving consideration to the arguments given by
people here, I'm now satisfied to say that for equal-length strings,
string equality is best described as O(N).

1) If the strings are equal, a == b will always compare all N
characters in each string.

2) If the strings are unequal, a == b will *at worst* compare
all N characters.

3) Following usual practice in this field, worst case is the
one which conventionally is meant when discussing Big Oh
behaviour. See, for example, "Introduction To Algorithms"
by Cormen, Leiserson and Rivest.

Also of some interest is the best case: O(1) for unequal strings (they
differ at the first character) and O(N) for equal strings.

Also of interest is the case that has caused the majority of the
discussion, the average case. I am now satisfied that the average number
of comparisons for unequal strings is O(1). To be precise, it is bounded
below by 1 comparison (you always have to compare at least one pair of
characters) and bounded above by 2 comparisons.

(I'm talking about the average here -- the actual number of comparisons
can range all the way up to N, but the average is <= 2.)

If I've done the maths right, the exact value for the average is:

((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

for random strings of length N taken from an alphabet of size M.

For M = 2, that average approaches but never exceeds 2 as N increases;
for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333...
and so forth.

Thanks to all who contributed.

--
Steven

Steven D'Aprano

unread,

Sep 7, 2012, 12:07:23 AM9/7/12

to

On Fri, 07 Sep 2012 00:39:33 +1000, Chris Angelico wrote:

> On Thu, Sep 6, 2012 at 11:37 PM, Johannes Bauer <dfnson...@gmx.de>
> wrote:
>> Not in my original post. If you read it again, you will clearly see
>> that I was talking about purely random strings. And since you like to
>> nitpick, I'll clarify further: I'm talking about bitstrings in which
>> every bit of every character has the same probability of occurence,
>> 50%.
>
> That sort of string isn't a normal thing to be comparing, though.
>
> Here's an idea. Someone who's doing a lot of arguing in this thread
> should take Python, find the string comparison routine, and hack in some
> statistics-gathering. Then run *real code* on it.

Where's the fun in that?

:-P

--
Steven

Oscar Benjamin

unread,

Sep 7, 2012, 3:10:16 PM9/7/12

to pytho...@python.org

On 2012-09-07, Steven D'Aprano <steve+comp....@pearwood.info> wrote:
> <snip>

>
> After further thought, and giving consideration to the arguments given by
> people here, I'm now satisfied to say that for equal-length strings,
> string equality is best described as O(N).
>
> 1) If the strings are equal, a == b will always compare all N
> characters in each string.
>
> 2) If the strings are unequal, a == b will *at worst* compare
> all N characters.
>
> 3) Following usual practice in this field, worst case is the
> one which conventionally is meant when discussing Big Oh
> behaviour. See, for example, "Introduction To Algorithms"
> by Cormen, Leiserson and Rivest.

Would you say, then, that dict insertion is O(N)?

>
> Also of some interest is the best case: O(1) for unequal strings (they
> differ at the first character) and O(N) for equal strings.
>
> Also of interest is the case that has caused the majority of the
> discussion, the average case. I am now satisfied that the average number
> of comparisons for unequal strings is O(1). To be precise, it is bounded
> below by 1 comparison (you always have to compare at least one pair of
> characters) and bounded above by 2 comparisons.

I find this idea of separating into the comparison of equal strings versus the
comparison of unequal strings rather odd. If the strings you compare come from
a distribution where they are guaranteed to be equal (or unequal) then you can
just use the O(0) comparison method.

Since string comparison is only useful if the strings can be equal or unequal,
the average case depends on how often they are equal/unequal as well as the
average complexity of both. For random strings the frequency of equal strings
decreases very fast as N increases so that the comparison of random strings is
O(1).

>
> (I'm talking about the average here -- the actual number of comparisons
> can range all the way up to N, but the average is <= 2.)
>
> If I've done the maths right, the exact value for the average is:
>
> ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)

I'm not sure where the extra N comes from --------^ but otherwise good.

I would have written that as:

(1 - p) * sum(i * p**(i-1) for i in range(1, N+1))

where p is the probability of a match (1/M for M equally likely characters) or
in closed form:

⎛ N ⎞
⎝1 - p ⋅(1 + N ⋅(1 - p))⎠
─────────────────────────
1 - p

>
> for random strings of length N taken from an alphabet of size M.
>
> For M = 2, that average approaches but never exceeds 2 as N increases;
> for M = 3, the average approaches 1.5, for M = 4 it approaches 1.333...
> and so forth.

It approaches 1 / (1 - p) or, if you prefer: M / (M - 1)

Oscar

Oscar Benjamin

unread,

Sep 7, 2012, 3:40:00 PM9/7/12

to pytho...@python.org

On 2012-09-07, Oscar Benjamin <oscar.j....@gmail.com> wrote:
> On 2012-09-07, Steven D'Aprano <steve+comp....@pearwood.info> wrote:
>> <snip>
>

> Since string comparison is only useful if the strings can be equal or unequal,
> the average case depends on how often they are equal/unequal as well as the
> average complexity of both. For random strings the frequency of equal strings
> decreases very fast as N increases so that the comparison of random strings is
> O(1).
>
>>

>> (I'm talking about the average here -- the actual number of comparisons
>> can range all the way up to N, but the average is <= 2.)
>>
>> If I've done the maths right, the exact value for the average is:
>>
>> ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)
>

> I'm not sure where the extra N comes from --------^ but otherwise good.

Ok, I see it's for the case where they're equal.

Oscar

Steven D'Aprano

unread,

Sep 7, 2012, 8:55:34 PM9/7/12

to

On Fri, 07 Sep 2012 19:10:16 +0000, Oscar Benjamin wrote:

> On 2012-09-07, Steven D'Aprano <steve+comp....@pearwood.info>
> wrote:
>> <snip>
>>
>> After further thought, and giving consideration to the arguments given
>> by people here, I'm now satisfied to say that for equal-length strings,
>> string equality is best described as O(N).
>>
>> 1) If the strings are equal, a == b will always compare all N
>> characters in each string.
>>
>> 2) If the strings are unequal, a == b will *at worst* compare
>> all N characters.
>>
>> 3) Following usual practice in this field, worst case is the
>> one which conventionally is meant when discussing Big Oh behaviour.
>> See, for example, "Introduction To Algorithms" by Cormen, Leiserson
>> and Rivest.
>
> Would you say, then, that dict insertion is O(N)?

Pedantically, yes.

But since we're allowed to state (or even imply *wink*) whatever
assumptions we like, we're allowed to assume "in the absence of
significant numbers of hash collisions" and come up with amortized O(1)
for dict insertions and lookups.

(Provided, of course, that your computer has an infinite amount of
unfragmented memory and the OS never starts paging your dict to disk.
Another unstated assumption that gets glossed over when we talk about
complexity analysis -- on real world computers, for big enough N,
*everything* is O(2**N) or worse.)

Big Oh analysis, despite the formal mathematics used, is not an exact
science. Essentially, it is a way of bringing some vague order to hand-
wavy estimates of complexity, and the apparent mathematical rigour is
built on some awfully shaky foundations. But despite that, it actually is
useful.

Coming back to strings... given that in any real-world application, you
are likely to have some string comparisons on equal strings and some on
unequal strings, and more importantly you don't know which are which
ahead of time, which attitude is less likely to give you a nasty surprise
when you run your code?

"I have many millions of 100K strings to compare against other 100K
strings, and string comparisons are O(1) so that will be fast."

"I have many millions of 100K strings to compare against other 100K
strings, and string comparisons are O(N) so that will be slow, better
find another algorithm."

Remember too that "for small enough N, everything is O(1)". Getting hung
up on Big Oh is just as much a mistake as ignoring it completely.

> I find this idea of separating into the comparison of equal strings
> versus the comparison of unequal strings rather odd. If the strings you
> compare come from a distribution where they are guaranteed to be equal
> (or unequal) then you can just use the O(0) comparison method.

If you know that they're (un)equal, you don't need to call == at all.

If you know that "most" strings will be unequal, then you might be
justified in treating comparisons as O(1) "most of the time" and not
stress about the occasional slow call. But of course most of the time you
don't know this, which is why it is appropriate to treat string
comparisons as O(N) rather than O(1), since that's the upper bound.

> Since string comparison is only useful if the strings can be equal or
> unequal, the average case depends on how often they are equal/unequal as
> well as the average complexity of both. For random strings the frequency
> of equal strings decreases very fast as N increases so that the
> comparison of random strings is O(1).

But that is not an upper bound, and Big Oh analysis is strictly defined
in terms of upper bounds.

>> (I'm talking about the average here -- the actual number of comparisons
>> can range all the way up to N, but the average is <= 2.)
>>
>> If I've done the maths right, the exact value for the average is:
>>
>> ((M-1)*sum( (N-i)*M**i for i in range(0, N) ) + N)/(M**N)
>
> I'm not sure where the extra N comes from --------^ but otherwise good.

The extra N comes from the case where you compare string S with itself,
which takes exactly N comparisons.

--
Steven

Oscar Benjamin

unread,

Sep 8, 2012, 7:53:13 AM9/8/12

to pytho...@python.org

On 2012-09-08, Steven D'Aprano <steve+comp....@pearwood.info> wrote:
> On Fri, 07 Sep 2012 19:10:16 +0000, Oscar Benjamin wrote:
>
>> On 2012-09-07, Steven D'Aprano <steve+comp....@pearwood.info>
>> wrote:
>> <snip>
>>

True. I can't think of a situation where I've used string comparisons
directly in any text heavy code. Rather, I would use a dict or a set (or a
regex) and hash(str) is always O(N).

>
>
> Remember too that "for small enough N, everything is O(1)". Getting hung
> up on Big Oh is just as much a mistake as ignoring it completely.
>
>

I can't think of a situation in my own work where O(N) vs O(1) string
comparisons would cause a significant problem (except perhaps in libraries
that I use but didn't write). However, I can find a number of cases where I
compare numpy.ndarrays for equality. For example, I found

if np.all(a == b):

in some code that I recently wrote. Although np.all() short-circuits, a==b
does not so that line forces O(N) behaviour onto a situation where the average
case can be better. Unfortunately numpy doesn't seem to provide a
short-circuit equals() function. array_equal() is what I want but it does the
same as the above. In future, I'll consider using something like

def cmparray(a, b):
return a.shape == b.shape and a.dtype == b.dtype and buffer(a) == buffer(b)

to take advantage of (what I assume are) short-circuit buffer comparisons.

>> Since string comparison is only useful if the strings can be equal or
>> unequal, the average case depends on how often they are equal/unequal as
>> well as the average complexity of both. For random strings the frequency
>> of equal strings decreases very fast as N increases so that the
>> comparison of random strings is O(1).
>
> But that is not an upper bound, and Big Oh analysis is strictly defined
> in terms of upper bounds.

It is an upper bound, but it is an upper bound on the *expectation value*
assuming a particular distribution of inputs, rather than an upper bound on
all possible inputs.

>>> (I'm talking about the average here -- the actual number of comparisons
>>> can range all the way up to N, but the average is <= 2.)

The average is actually bounded by 1 / (1 - p) where p is the probability that
two characters match. This bound can be arbitrarily large as p approaches 1 as
would be the case if, say, one character was much more likely than others. The
particular assumption that you have made p = 1/M where M is the number of
characters is actually the *smallest* possible value of p. For non-uniform
real data (English words for example) p is significantly greater than 1/M but
in a strict bounds sense we should say that 1/M <= p <= 1.

Oscar

Gelonida N

unread,

Sep 8, 2012, 11:50:22 AM9/8/12

to pytho...@python.org

On 09/06/2012 10:33 AM, Steven D'Aprano wrote:
> On Wed, 05 Sep 2012 22:47:14 +0000, Oscar Benjamin wrote:
>
> I may have been overly-conservative earlier when I said that on average
> string equality has to compare half the characters. I thought I had
> remembered that from a computer science textbook, but I can't find that
> reference now, so possibly I was thinking of something else. (String
> searching perhaps?).

Yeah I think you mixed it up with searching an entry in an unsorted list
of length N

That's one of the most common N/2 cases, that one hits daily with many
straight forward implementations

Gelonida N

unread,

Sep 8, 2012, 11:52:10 AM9/8/12

to pytho...@python.org

On 09/07/2012 06:06 AM, Steven D'Aprano wrote:
> On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:
>
>
> Also of some interest is the best case: O(1) for unequal strings (they
> differ at the first character) and O(N) for equal strings.

The worst case is O(N) or N characters
the average case is O(1) or two characters.

For denial of service attacks or systems, that are NEVER allowed to fail
the worst case is important.

For most other cases the average complexity counts.

However I still wonder for how many applications the complexity of
string comparisons would be the limiting factor.

Duncan Booth

unread,

Sep 10, 2012, 4:59:37 AM9/10/12

to

and of course if you ever do find an application where that worst case
matters there's an easy way round it: just call intern() on all the strings
when they are created.

For the comparison to be the limiting factor you have to be doing a lot of
comparisons on the same string (otherwise creating the string would be the
limiting factor), so at the expense of a single dictionary insertion when
the string is created you can get guaranteed O(1) on all the comparisons.

--
Duncan Booth http://kupuguy.blogspot.com

Steven D'Aprano

unread,

Sep 10, 2012, 9:45:01 AM9/10/12

to

On Mon, 10 Sep 2012 08:59:37 +0000, Duncan Booth wrote:

> Gelonida N <gelo...@gmail.com> wrote:
>
>> On 09/07/2012 06:06 AM, Steven D'Aprano wrote:
>>> On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel wrote:
>>>
>>>
>>> Also of some interest is the best case: O(1) for unequal strings (they
>>> differ at the first character) and O(N) for equal strings.
>>
>> The worst case is O(N) or N characters the average case is O(1) or two
>> characters.
>>
>> For denial of service attacks or systems, that are NEVER allowed to
>> fail the worst case is important.
>>
>> For most other cases the average complexity counts.
>>
>> However I still wonder for how many applications the complexity of
>> string comparisons would be the limiting factor.
>>
>>
> and of course if you ever do find an application where that worst case
> matters there's an easy way round it: just call intern() on all the
> strings when they are created.

There are other good reasons for interning strings, but speeding up
"astring == bstring" is not one of them.

[steve@ando ~]$ python -m timeit -s "s = 'abc'*100000" -s \
> "t = s[:-1] + 'x'" "s == t"
1000 loops, best of 3: 910 usec per loop
[steve@ando ~]$ python -m timeit -s "s = 'abc'*100000" -s \
> "t = s[:-1] + 'x'" -s "intern(s); intern(t)" "s == t"
1000 loops, best of 3: 914 usec per loop

No significant difference.

To state the bleedin' obvious, the computational effort required to
*compare two strings* pre-supposes that those strings already exist, and
is *completely independent* of the complexity of creating the strings.

> For the comparison to be the limiting factor you have to be doing a lot
> of comparisons

Correct.

> on the same string (otherwise creating the string would be the limiting
> factor),

Not necessarily.

Look, it's really hard to come up with a realistic, non-contrived example
where string comparisons are a significant bottleneck in a non-toy
application. So let me first off say that *nearly always* you're not
going to care whether "s == t" looks at all N characters or just the
first 2 (or 20, or 100). This argument is rather academic (the best sort
of argument!). Until you start getting up into truly huge strings, we're
arguing about how many angels can dance on a CPU cache.

But for the record, in principle string comparisons *could* be the
bottleneck. Example: you have 10000 strings, which are each created once
and stored in a list. Then you iterate over the list, comparing every
string against every other string. And due to some weird vagary of the
data, the strings are nearly all equal.

(Why would you do this? I don't know, maybe it's a programmers' challenge
found on the Internet, make up your own scenario...)

Total number of strings created: 10000.
Total number of strings compared: 100000000.

The overhead of creating the strings is trivial compared to the overhead
of comparing them, and since each string is only created once anyway,
interning them is just a waste of time.

> so at the expense of a single dictionary
> insertion when the string is created you can get guaranteed O(1) on all
> the comparisons.

What interning buys you is that "s == t" is an O(1) pointer compare if
they are equal. But if s and t differ in the last character, __eq__ will
still inspect every character. There is no way to tell Python "all
strings are interned, if s is not t then s != t as well".

--
Steven

Oscar Benjamin

unread,

Sep 10, 2012, 10:06:16 AM9/10/12

to pytho...@python.org

I thought that if *both* strings were interned then a pointer comparison could
decide if they were unequal without needing to check the characters.

Have I misunderstood how intern() works?

Oscar

Chris Angelico

unread,

Sep 10, 2012, 10:26:29 AM9/10/12

to pytho...@python.org

On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin
<oscar.j....@gmail.com> wrote:
> On 2012-09-10, Steven D'Aprano <steve+comp....@pearwood.info> wrote:

>> What interning buys you is that "s == t" is an O(1) pointer compare if
>> they are equal. But if s and t differ in the last character, __eq__ will
>> still inspect every character. There is no way to tell Python "all
>> strings are interned, if s is not t then s != t as well".
>>
>

> I thought that if *both* strings were interned then a pointer comparison could
> decide if they were unequal without needing to check the characters.
>
> Have I misunderstood how intern() works?

In a language where _all_ strings are guaranteed to be interned (such
as Lua, I think), you do indeed gain this. Pointer inequality implies
string inequality. But when interning is optional (as in Python), you
cannot depend on that, unless there's some way of recognizing interned
strings. Of course, that may indeed be the case; a simple bit flag
"this string has been interned" would suffice, and if both strings are
interned AND their pointers differ, THEN you can be sure the strings
differ.

I have no idea whether or not CPython version X.Y.Z does this. The
value of such an optimization really depends on how likely strings are
to be interned; for instance, if the compiler automatically interns
all the names of builtins, this could be quite beneficial. Otherwise,
probably not; most Python scripts don't bother interning anything.

ChrisA

Oscar Benjamin

unread,

Sep 10, 2012, 10:32:22 AM9/10/12

to pytho...@python.org

On 2012-09-10, Chris Angelico <ros...@gmail.com> wrote:
> On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin
><oscar.j....@gmail.com> wrote:
>> On 2012-09-10, Steven D'Aprano <steve+comp....@pearwood.info> wrote:

>>> What interning buys you is that "s == t" is an O(1) pointer compare if
>>> they are equal. But if s and t differ in the last character, __eq__ will
>>> still inspect every character. There is no way to tell Python "all
>>> strings are interned, if s is not t then s != t as well".
>>>
>>

>> I thought that if *both* strings were interned then a pointer comparison
>> could decide if they were unequal without needing to check the characters.
>>
>> Have I misunderstood how intern() works?
>
> In a language where _all_ strings are guaranteed to be interned (such
> as Lua, I think), you do indeed gain this. Pointer inequality implies
> string inequality. But when interning is optional (as in Python), you
> cannot depend on that, unless there's some way of recognizing interned
> strings. Of course, that may indeed be the case; a simple bit flag
> "this string has been interned" would suffice, and if both strings are
> interned AND their pointers differ, THEN you can be sure the strings
> differ.
>
> I have no idea whether or not CPython version X.Y.Z does this. The
> value of such an optimization really depends on how likely strings are
> to be interned; for instance, if the compiler automatically interns
> all the names of builtins, this could be quite beneficial. Otherwise,
> probably not; most Python scripts don't bother interning anything.
>

I haven't looked at the source but my understanding was precisely that there
is an intern() bit and that not only the builtins module but all the literals
in any byte-compiled module are interned.

Oscar

Oscar Benjamin

unread,

Sep 10, 2012, 10:43:34 AM9/10/12

to pytho...@python.org

On 2012-09-10, Oscar Benjamin <oscar.j....@gmail.com> wrote:
> On 2012-09-10, Chris Angelico <ros...@gmail.com> wrote:
>> On Tue, Sep 11, 2012 at 12:06 AM, Oscar Benjamin
>><oscar.j....@gmail.com> wrote:
>>> On 2012-09-10, Steven D'Aprano <steve+comp....@pearwood.info> wrote:

>>>> What interning buys you is that "s == t" is an O(1) pointer compare if
>>>> they are equal. But if s and t differ in the last character, __eq__ will
>>>> still inspect every character. There is no way to tell Python "all
>>>> strings are interned, if s is not t then s != t as well".
>>>>
>>>

>>> I thought that if *both* strings were interned then a pointer comparison
>>> could decide if they were unequal without needing to check the characters.
>>>
>>> Have I misunderstood how intern() works?
>>
>> In a language where _all_ strings are guaranteed to be interned (such
>> as Lua, I think), you do indeed gain this. Pointer inequality implies
>> string inequality. But when interning is optional (as in Python), you
>> cannot depend on that, unless there's some way of recognizing interned
>> strings. Of course, that may indeed be the case; a simple bit flag
>> "this string has been interned" would suffice, and if both strings are
>> interned AND their pointers differ, THEN you can be sure the strings
>> differ.
>>
>> I have no idea whether or not CPython version X.Y.Z does this. The
>> value of such an optimization really depends on how likely strings are
>> to be interned; for instance, if the compiler automatically interns
>> all the names of builtins, this could be quite beneficial. Otherwise,
>> probably not; most Python scripts don't bother interning anything.
>>
>
> I haven't looked at the source but my understanding was precisely that there
> is an intern() bit and that not only the builtins module but all the literals
> in any byte-compiled module are interned.
>

s/literals/identifiers/

You can see the interned flag in the PyUnicodeObject struct here:
http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303

Oscar

Chris Angelico

unread,

Sep 10, 2012, 10:56:09 AM9/10/12

to pytho...@python.org

On Tue, Sep 11, 2012 at 12:43 AM, Oscar Benjamin
<oscar.j....@gmail.com> wrote:

> On 2012-09-10, Oscar Benjamin <oscar.j....@gmail.com> wrote:
>> I haven't looked at the source but my understanding was precisely that there
>> is an intern() bit and that not only the builtins module but all the literals
>> in any byte-compiled module are interned.
>>
>
> s/literals/identifiers/
>
> You can see the interned flag in the PyUnicodeObject struct here:
> http://hg.python.org/cpython/file/3ffd6ad93fe4/Include/unicodeobject.h#l303

Ah, yep, so that's there. In that case, it's possible to have that
optimization. However, I may be misreading this, but it seems the only
Unicode comparison function is a rich compare, which is unable to take
advantage of a known difference:
http://hg.python.org/cpython/file/b48ef168d8c5/Objects/unicodeobject.c#l6114

Different pointers prove the strings differ, but don't tell you which
is to be sorted earlier. You could use this if you roll your own
comparison in C; or, if you already know the strings are interned, you
can use 'is' / 'is not'. But that seems to be the extent of it.

ChrisA

Dan Goodman

unread,

Sep 10, 2012, 12:07:41 PM9/10/12

to pytho...@python.org

On 04/09/2012 03:54, Roy Smith wrote:
> Let's assume you're testing two strings for equality. You've already
> done the obvious quick tests (i.e they're the same length), and you're
> down to the O(n) part of comparing every character.
>
> I'm wondering if it might be faster to start at the ends of the strings
> instead of at the beginning? If the strings are indeed equal, it's the
> same amount of work starting from either end. But, if it turns out that
> for real-life situations, the ends of strings have more entropy than the
> beginnings, the odds are you'll discover that they're unequal quicker by
> starting at the end.

From the rest of the thread, it looks like in most situations it won't
make much difference as typically very few characters need to be
compared if they are unequal.

However, if you were in a situation with many strings which were almost
equal, the most general way to improve the situation might be store a
hash of the string along with the string, i.e. store (hash(x), x) and
then compare equality of this tuple. Almost all of the time, if the
strings are unequal the hash will be unequal. Or, as someone else
suggested, use interned versions of the strings. This is basically the
same solution but even better. In this case, your startup costs will be
higher (creating the strings) but your comparisons will always be instant.

Dan

Oscar Benjamin

unread,

Sep 10, 2012, 12:33:47 PM9/10/12

to pytho...@python.org

Computing the hash always requires iterating over all characters in the string
so is best case O(N) where string comparison is best case (and often average
case) O(1).

Also, so far as I know the hash value once computed is stored on the string
object itself [1] and used for subsequent string comparisons so there's no
need for you to do that in your code.

Oscar

[1] http://hg.python.org/cpython/file/71d94e79b0c3/Include/unicodeobject.h#l293

Dan Goodman

unread,

Sep 10, 2012, 1:32:48 PM9/10/12

to pytho...@python.org

On 10/09/2012 18:33, Oscar Benjamin wrote:
> Computing the hash always requires iterating over all characters in the string
> so is best case O(N) where string comparison is best case (and often average
> case) O(1).

Yep, but you already have O(N) costs just creating the strings in the
first place, so it's absorbed into that. It's only worth doing if you do
many comparisons.

> Also, so far as I know the hash value once computed is stored on the string
> object itself [1] and used for subsequent string comparisons so there's no
> need for you to do that in your code.

Cool, Python is very clever as always! :)

Dan

Dan Goodman

unread,

Sep 10, 2012, 1:44:46 PM9/10/12

to pytho...@python.org

On 10/09/2012 18:07, Dan Goodman wrote:
> On 04/09/2012 03:54, Roy Smith wrote:
>> Let's assume you're testing two strings for equality. You've already
>> done the obvious quick tests (i.e they're the same length), and you're
>> down to the O(n) part of comparing every character.
>>
>> I'm wondering if it might be faster to start at the ends of the strings
>> instead of at the beginning? If the strings are indeed equal, it's the
>> same amount of work starting from either end. But, if it turns out that
>> for real-life situations, the ends of strings have more entropy than the
>> beginnings, the odds are you'll discover that they're unequal quicker by
>> starting at the end.
>
> From the rest of the thread, it looks like in most situations it won't
> make much difference as typically very few characters need to be
> compared if they are unequal.
>
> However, if you were in a situation with many strings which were almost
> equal, the most general way to improve the situation might be store a
> hash of the string along with the string, i.e. store (hash(x), x) and
> then compare equality of this tuple. Almost all of the time, if the
> strings are unequal the hash will be unequal. Or, as someone else
> suggested, use interned versions of the strings. This is basically the
> same solution but even better. In this case, your startup costs will be
> higher (creating the strings) but your comparisons will always be instant.

Just had another thought about this. Although it's unlikely to be
necessary in practice since (a) it's rarely necessary at all, and (b)
when it is, hashing and optionally interning seems like the better
approach, I had another idea that would be more general. Rather than
starting from the beginning or the end, why not do something like: check
the first and last character, then the len/2 character, then the len/4,
then 3*len/4, then len/8, 3*len/8, etc. You'd need to be a bit clever
about making sure you hit every character but I'm sure someone's already
got an efficient algorithm for this. You could probably even make this
cache efficient by working on cache line length blocks. Almost certainly
entirely unnecessary, but I like the original question and it's a nice
theoretical problem.

Dan

Oscar Benjamin

unread,

Sep 10, 2012, 5:52:47 PM9/10/12

to pytho...@python.org

On 2012-09-10, Dan Goodman <dg.g...@thesamovar.net> wrote:

It's not totally theoretical in the sense that the reasoning applies to all
sequence comparisons. If you needed to compare lists of objects where the
comparison of each pair of elements was an expensive operation then you would
want to think carefully about what order you used. Also in general you can't
hash/intern all sequences.

If I was going to change the order of comparisons for all strings then I would
use a random order. This is essentially how dict gets away with claiming to
have O(1) lookup. There are sequences of inputs that can cause every possible
hash collision to occur but because the hash function acts as a kind of
randomisation filter the pathological sequences are very unlikely to occur
unless someone is going out of their way. The clever way that Python 3.3
prevents someone from even doing this on purpose is just to introduce
additional per-process randomisation.

The difference between dict lookup and string comparison is that string
comparison always compares the characters in the same order and it corresponds
to the natural ordering of the data. This means that some pefectly natural use
cases like comparing file-paths can have close to worst case behaviour. If
string/sequence comparison occurs in a random order then there can be no use
case where the likely strings would induce close to worst case behaviour
unless you really are just comparing lots of almost identical sequences.

Oscar

Duncan Booth

unread,

Sep 11, 2012, 5:41:50 AM9/11/12

to

Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> But for the record, in principle string comparisons *could* be the
> bottleneck. Example: you have 10000 strings, which are each created
> once and stored in a list. Then you iterate over the list, comparing
> every string against every other string. And due to some weird vagary
> of the data, the strings are nearly all equal.
>
> (Why would you do this? I don't know, maybe it's a programmers'
> challenge found on the Internet, make up your own scenario...)
>
> Total number of strings created: 10000.
> Total number of strings compared: 100000000.

which is exactly what I meant by doing a lot of comparisons (10000) on
the same string. Perhaps if I'd phrased it more as "you have to be doing
many more comparisons than string creation operations" it would have
been clearer what I meant.

> The overhead of creating the strings is trivial compared to the
> overhead of comparing them, and since each string is only created once
> anyway, interning them is just a waste of time.

No, you created 10k strings many of which are equal and then did 10k
comparisons on each most of which found 'yes' they are equal. Interning
them would have reduced all the 'true' comparisons to an identity check
at the cost of 1 hash and 1 full comparison per string.

>> so at the expense of a single dictionary
>> insertion when the string is created you can get guaranteed O(1) on
>> all the comparisons.
>
> What interning buys you is that "s == t" is an O(1) pointer compare if
> they are equal. But if s and t differ in the last character, __eq__
> will still inspect every character. There is no way to tell Python
> "all strings are interned, if s is not t then s != t as well".

Right if the strings differ only in the last character, but the rest of
this thread has been about how, for random strings, the not-equal case
is O(1) as well.

Duncan Booth

unread,

Sep 11, 2012, 5:51:54 AM9/11/12

to

Oscar Benjamin <oscar.j....@gmail.com> wrote:

>> What interning buys you is that "s == t" is an O(1) pointer compare
>> if they are equal. But if s and t differ in the last character,
>> __eq__ will still inspect every character. There is no way to tell
>> Python "all strings are interned, if s is not t then s != t as well".
>>
>
> I thought that if *both* strings were interned then a pointer
> comparison could decide if they were unequal without needing to check
> the characters.
>
> Have I misunderstood how intern() works?
>

I don't think you've misunderstood how it work, but so far as I can see the
code doesn't attempt to short circuit the "not equal but interned" case.
The comparison code doesn't look at interning at all, it only looks for
identity as a shortcut.

Terry Reedy

unread,

Sep 11, 2012, 11:55:55 AM9/11/12

to pytho...@python.org

On 9/11/2012 6:40 AM, Oscar Benjamin wrote:
> On 11 September 2012 10:51, Duncan Booth <duncan...@invalid.invalid
> <mailto:duncan...@invalid.invalid>> wrote:
>
> Oscar Benjamin <oscar.j....@gmail.com

> It also doesn't seem to check if the hash values have been set. I guess
> the cached hash value is only used in contexts where the hash is
> explicitly desired.-

I believe the internal use of interning and hash comparison has varied
from release to release. However, the main use of string comparison is
for dict keys, especially the internal dicts for namespaces and
attributes. Since the dict lookup code needs hash values anyway, to find
slots with possible conflicts, I am sure it does not use the generic
comparison operators but starts with hash comparisons.

Terry Jan Reedy

Prasad, Ramit

unread,

Sep 13, 2012, 2:39:17 PM9/13/12

to pytho...@python.org

Dwight Hutto wrote:
> Why don' you just time it,eit lops through incrementing thmax input/

What? Without context I have no idea what this means.

Ramit

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

Dwight Hutto

unread,

Sep 13, 2012, 3:37:42 PM9/13/12

to Prasad, Ramit, pytho...@python.org

On Thu, Sep 13, 2012 at 2:39 PM, Prasad, Ramit
<ramit....@jpmorgan.com> wrote:
> Dwight Hutto wrote:
>> Why don' you just time it,eit lops through incrementing thmax input/
>
> What? Without context I have no idea what this means.
>
>
> Ramit

Why don't you read the OP:

Let's assume you're testing two strings for equality. You've already
done the obvious quick tests (i.e they're the same length), and you're
down to the O(n) part of comparing every character.

I'm wondering if it might be faster to start at the ends of the strings
instead of at the beginning? If the strings are indeed equal, it's the
same amount of work starting from either end. But, if it turns out that
for real-life situations, the ends of strings have more entropy than the
beginnings, the odds are you'll discover that they're unequal quicker by
starting at the end.

>

and this one from me:

First include len(string)/2, in order to include starting at the
center of the string, and threading/weaving by 2 processes out.

import timeit

do the the rest, and see which has the fastest time.> --
>

Why don't take the time to read the OP, and ramit in your head?

Remember that you're in the middle of a conversation where the OP is
following as it goes along, so anyone reading the entire set of
postings should get it.

But for people who just want to jump in, and assume that the only
thing that matters is one piece, without reading the entire content of
the conversation, will always have something out of context for them.

--
Best Regards,
David Hutto
CEO: http://www.hitwebdevelopment.com

Mark Lawrence

unread,

Sep 13, 2012, 3:53:55 PM9/13/12

to pytho...@python.org

On 13/09/2012 19:39, Prasad, Ramit wrote:
> Dwight Hutto wrote:
>> Why don' you just time it,eit lops through incrementing thmax input/
>
> What? Without context I have no idea what this means.
>
>
> Ramit
>

You're wasting your time, I've been described as a jackass for having
the audacity to ask for context :)

--
Cheers.

Mark Lawrence.

Dwight Hutto

unread,

Sep 13, 2012, 5:06:23 PM9/13/12

to Joshua Landau, Mark Lawrence, pytho...@python.org

On Thu, Sep 13, 2012 at 4:34 PM, Joshua Landau
<joshua.l...@gmail.com> wrote:

> On 13 September 2012 20:53, Mark Lawrence <bream...@yahoo.co.uk> wrote:
>>
>> On 13/09/2012 19:39, Prasad, Ramit wrote:
>>>
>>> Dwight Hutto wrote:
>>>>
>>>> Why don' you just time it,eit lops through incrementing thmax input/
>>>
>>>
>>> What? Without context I have no idea what this means.
>>>
>>

>> You're wasting your time, I've been described as a jackass for having the
>> audacity to ask for context :)
>
>

> I'm pretty sure you are in the wrong, acting as if what he said didn't make
> sense! Just read it, he obviously was telling you to time it, as eit lops
> are inside thmax input/ which, as you should know if you bothered to read
> the thread, is incrementing.
>
> "don'" is short for "don't", by the way.
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>

It's the fact that I consider the entire conversation the context.
Reading the OP's being the primary, and things they responded
positively to.

There are other things that get mixed up as well, like not hitting the
... in gmail, and so that content doesn't show, or hitting reply,
instead of reply all, and things getting jumbled for the others
involved in the conversation.

Then there is the problem of people saying you posted too much of the
context, or not inline with the OP, just at the end, or top posting.

I try to keep it along the line of what the OP has read, and they know
the context in which it's meant.

Mark Lawrence

unread,

Sep 13, 2012, 5:17:44 PM9/13/12

to pytho...@python.org

On 13/09/2012 21:34, Joshua Landau wrote:
> On 13 September 2012 20:53, Mark Lawrence <bream...@yahoo.co.uk> wrote:
>
>> On 13/09/2012 19:39, Prasad, Ramit wrote:
>>
>>> Dwight Hutto wrote:
>>>
>>>> Why don' you just time it,eit lops through incrementing thmax input/
>>>>
>>>
>>> What? Without context I have no idea what this means.
>>>
>>
>> You're wasting your time, I've been described as a jackass for having the
>> audacity to ask for context :)
>
>
> I'm pretty sure you are in the wrong, acting as if what he said didn't make
> sense! Just read it, he obviously was telling you to time it, as eit lops

> are inside thmax input/ which, as you should know if you *bothered to read
> the thread*, is incrementing.

>
> "don'" is short for "don't", by the way.
>
>
>

I do grovellingly apologize for my appalling breach of netiquette. I am
of course assuming that the rules have changed and that it's now my
responsibility to wade back through maybe a couple of hundred responses
on a long thread to find the context. I also guess that I'm never going
to achieve my ambition of being a pot smoking hippy CEO of a web
development company :(

--
Cheers.

Mark Lawrence.

Dwight Hutto

unread,

Sep 13, 2012, 5:35:46 PM9/13/12

to Mark Lawrence, pytho...@python.org

On Thu, Sep 13, 2012 at 5:17 PM, Mark Lawrence <bream...@yahoo.co.uk> wrote:
> On 13/09/2012 21:34, Joshua Landau wrote:
>>

>> On 13 September 2012 20:53, Mark Lawrence <bream...@yahoo.co.uk> wrote:acci sequence

>>
>>> On 13/09/2012 19:39, Prasad, Ramit wrote:
>>>
>>>> Dwight Hutto wrote:
>>>>
>>>>> Why don' you just time it,eit lops through incrementing thmax input/
>>>>>
>>>>
>>>> What? Without context I have no idea what this means.
>>>>
>>>
>>> You're wasting your time, I've been described as a jackass for having
>>> the
>>> audacity to ask for context :)
>>
>>
>>
>> I'm pretty sure you are in the wrong, acting as if what he said didn't
>> make
>> sense! Just read it, he obviously was telling you to time it, as eit lops
>> are inside thmax input/ which, as you should know if you *bothered to read
>> the thread*, is incrementing.
>>
>>
>> "don'" is short for "don't", by the way.
>>
>>
>>
>
> I do grovellingly apologize for my appalling breach of netiquette. I am of
> course assuming that the rules have changed and that it's now my
> responsibility to wade back through maybe a couple of hundred responses on a
> long thread to find the context.
I also guess that I'm never going to
> achieve my ambition of being a pot smoking hippy CEO of a web development
> company :(

>
>
> --
> Cheers.

Cheers usually implies you're an alcoholic pressing buttons, with the
half of rest of the coders on the net.

So don't give up hope, you might be able to take a medication that
doesn't impair your judgement with the side effects alcohol has,

And of course leaves you without the ability to read any of the
responses to say you were right in certain instances, so something
must be wrong with your mail reader, or your alcoholic mind.

Also, without reading the other posts, you're probably just placing in
a duplicate answer sometimes, which might make you seem like a copy
cat.

Steven D'Aprano

unread,

Sep 13, 2012, 11:39:46 PM9/13/12

to

On Thu, 13 Sep 2012 17:06:23 -0400, Dwight Hutto wrote:

> Then there is the problem of people saying you posted too much of the
> context, or not inline with the OP, just at the end, or top posting.

The solution to "you quoted too much unnecessary verbiage" is not "quote
nothing". It is quote only the parts that are relevant.

> I try to keep it along the line of what the OP has read, and they know
> the context in which it's meant.

You're assuming that people read your posts immediately after they read
the post you replied to. Always imagine that your reply will be read a
week after the post you replied to. Do you still expect the reader to
understand what you're talking about?

--
Steven

alex23

unread,

Sep 13, 2012, 11:48:38 PM9/13/12

to

On Sep 14, 5:37 am, Dwight Hutto <dwightdhu...@gmail.com> wrote:
> Why don't take the time to read the OP, and ramit in your head?

Please, don't be a dick.

Chris Angelico

unread,

Sep 14, 2012, 12:15:55 AM9/14/12

to pytho...@python.org

On Fri, Sep 14, 2012 at 1:39 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> You're assuming that people read your posts immediately after they read
> the post you replied to. Always imagine that your reply will be read a
> week after the post you replied to.

And a week is extremely generous too; these posts get archived on the
web. I *frequently* find myself hitting mailing list archives when
researching obscurities. This is also another good reason to post
follow-ups to the list, rather than in private email. You might never
be thanked, but somebody years down the track may find the question
and an associated answer.

ChrisA

Dwight Hutto

unread,

Sep 14, 2012, 12:46:04 AM9/14/12

to alex23, pytho...@python.org

For telling him to ramit into his head that you should read the OP?

alex23

unread,

Sep 14, 2012, 12:54:54 AM9/14/12

to

On Sep 14, 2:46 pm, Dwight Hutto <dwightdhu...@gmail.com> wrote:
> For telling him to ramit into his head that you should read the OP?

Yes. I'm not sure if it was intentionally racist, but you come across
as a bit of a dwight supremacist.

Dwight Hutto

unread,

Sep 14, 2012, 1:38:52 AM9/14/12

to alex23, pytho...@python.org

Please explain any logic whatsoever that would give you that conclusion.

Seems more like propaganda, and you're not very good at it.

I think you're referring to a play on words(ramit). Ain't I so punny.

alex23

unread,

Sep 14, 2012, 2:06:42 AM9/14/12

to

On Sep 14, 3:39 pm, Dwight Hutto <dwightdhu...@gmail.com> wrote:
> Please explain any logic whatsoever that would give you that conclusion.

Well, this:

> I think you're referring to a play on words(ramit).

Using foreign names derogatively is a common tactic of the racist.

> Ain't I so punny.

Not really, no.

Dwight Hutto

unread,

Sep 14, 2012, 4:03:59 AM9/14/12

to alex23, pytho...@python.org

>> I think you're referring to a play on words(ramit).
>
> Using foreign names derogatively is a common tactic of the racist.

Not really. But nice spin on my pun to make me look bad.

Keep trying, and maybe you'll come up with an insult/ propaganda
that's less obvious to the viewer that you're a liar, and a person who
couldn't end this with out throwing a blatant race card.

It's similar to if I said, this is real 'queer' of you to do ya big
pansy, and next you'll be calling me a homophobe.

Try harder, because no one would ever believe I was a racist, and if
they did, it's an uninformed decision based off of cherry picken
phrases out of context.

Very text book propaganda of you though.

alex23

unread,

Sep 14, 2012, 4:20:53 AM9/14/12

to

On Sep 14, 6:04 pm, Dwight Hutto <dwightdhu...@gmail.com> wrote:
> > Using foreign names derogatively is a common tactic of the racist.
>
> Not really. But nice spin on my pun to make me look bad.

It actually *is* common behaviour of racists.

> It's similar to if I said, this is real 'queer' of you to do ya big
> pansy, and next you'll be calling me a homophobe.

Well, *yes*. Because your choice of that terminology as derogatory
shows you view it as derogatory.

> Try harder, because no one would ever believe I was a racist, and if
> they did, it's an uninformed decision based off of cherry picken
> phrases out of context.

Oh, so *now* context is important.

> Very text book propaganda of you though.

I don't think that word means what you think it does.

Dwight Hutto

unread,

Sep 14, 2012, 4:53:01 AM9/14/12

to alex23, pytho...@python.org

On Fri, Sep 14, 2012 at 4:20 AM, alex23 <wuw...@gmail.com> wrote:
> On Sep 14, 6:04 pm, Dwight Hutto <dwightdhu...@gmail.com> wrote:
>> > Using foreign names derogatively is a common tactic of the racist.
>>
>> Not really. But nice spin on my pun to make me look bad.
>
> It actually *is* common behaviour of racists.
>

Not if there name is ramit. What if your name was john? I'd say I'll
be right back, I have to go take a crap on the john. It's a joke about
a name, not where it originates.

>> It's similar to if I said, this is real 'queer' of you to do ya big
>> pansy, and next you'll be calling me a homophobe.
>
> Well, *yes*. Because your choice of that terminology as derogatory
> shows you view it as derogatory.

No it was a loose analogy to show that he's just trying to use
anything he can say to slam me, and everyone can know it wasn't meant
as racist.

>> Try harder, because no one would ever believe I was a racist, and if
>> they did, it's an uninformed decision based off of cherry picken
>> phrases out of context.
>
> Oh, so *now* context is important.

Never said it wasn't. The whole conversation to me is the context, and
the OP usually follows it, and that's who it is usually intended for.

>> Very text book propaganda of you though.
>
> I don't think that word means what you think it does.

from wiki:

Propaganda is a form of communication that is aimed at influencing the
attitude of a community toward some cause or position. Propaganda is
usually repeated and dispersed over a wide variety of media in order
to create the chosen result in audience attitudes.

He wants people to think I'm a racist, and you now want to see that in
me as well. It seems it has propagated, and that I know exactly what
it means.

And all over a silly post that had too little content, at the time, to
have what I said require any other posts, especially if the OP is
following the conversation.

> http://mail.python.org/mailman/listinfo/python-list

Steven D'Aprano

unread,

Sep 14, 2012, 6:16:31 AM9/14/12

to

On Fri, 14 Sep 2012 01:20:53 -0700, alex23 wrote:
> On Sep 14, 6:04 pm, Dwight Hutto <dwightdhu...@gmail.com> wrote:

[snip]

Please don't feed the trolls.

--
Steven

alex23

unread,

Sep 14, 2012, 6:26:08 AM9/14/12

to

On Sep 14, 6:53 pm, Dwight Hutto <dwightdhu...@gmail.com> wrote:
> Not if there name is ramit. What if your name was john? I'd say I'll
> be right back, I have to go take a crap on the john. It's a joke about
> a name, not where it originates.

I'd recommend reading up on white privilege but I'm pretty sure it'd
be a wasted suggestion.

> >> It's similar to if I said, this is real 'queer' of you to do ya big
> >> pansy, and next you'll be calling me a homophobe.
>
> > Well, *yes*. Because your choice of that terminology as derogatory
> > shows you view it as derogatory.
>
> No it was a loose analogy

You can't use something as an example to support your case, and then
dismiss it as "a loose analogy" when it undermines it.

> he's just trying to use anything he can say to slam me everyone

> can know it wasn't meant as racist.

The "anything" I'm "using" is what *you* have said. I'm not trying to
"slam" you, I don't even know who you are. I just have a very short
fuse for rudeness and an even shorter one for racism, even if it *was*
intended in a "hyuck hyuck, I'm so punny" way. Ignorant racism is
still racism.

> The whole conversation to me is the context

And yet:

> He wants people to think I'm a racist, and you now want to see that in
> me as well. It seems it has propagated, and that I know exactly what
> it means.

Again, so much for context. There is no "he" and me, I'm the person
who made the original accusation and then followed up on it. That's
not propagation, so it's not propaganda.

Now, someone who starts a new thread to have a conversation *with
themselves* in some bizarre piece of performance art that seems
intended to brand as petty the *requests that he actually follow list
etiquette*...*that* is someone I'd consider a propagandist.

Dwight Hutto

unread,

Sep 14, 2012, 7:36:24 AM9/14/12

to alex23, pytho...@python.org

> I'd recommend reading up on white privilege but I'm pretty sure it'd
> be a wasted suggestion.

Not really, I tend to like interdisciplinary study. But I'm a little
of everything if you like Darwin.

>
>> >> It's similar to if I said, this is real 'queer' of you to do ya big
>> >> pansy, and next you'll be calling me a homophobe.
>>
>> > Well, *yes*. Because your choice of that terminology as derogatory
>> > shows you view it as derogatory.
>>
>> No it was a loose analogy
>
> You can't use something as an example to support your case, and then
> dismiss it as "a loose analogy" when it undermines it.
>

How did it undermine it? It's the same thing, just transferred to
another well known group subject to violent bigotry.

I used a play on words in a response, and because his names foreign
that makes me a racist. There's no logic in that, other than to bring
me down, and use the worst thing you can say...playing the race card.

>> he's just trying to use anything he can say to slam me everyone
>> can know it wasn't meant as racist.
>
> The "anything" I'm "using" is what *you* have said. I'm not trying to
> "slam" you, I don't even know who you are. I just have a very short
> fuse for rudeness and an even shorter one for racism,

It wasn't rude in terms of these things that have been said about me.
Then lengthen your fuse, because I'm not a racist, I just play with
words a lot, including foreign names.

even if it *was*
> intended in a "hyuck hyuck, I'm so punny" way. Ignorant racism is
> still racism.

Still trying to propagate a thought that I'm racist based on a guy who
for all I know is white, and uses the A.K.A ramit.

I have no freakin clue if ramit is an ethnic name or nickname...we're
on the internet tweedledick(badumchee, and hyuck,hyuck,hyuck)

>
>> The whole conversation to me is the context
>
> And yet:
>
>> He wants people to think I'm a racist, and you now want to see that in
>> me as well. It seems it has propagated, and that I know exactly what
>> it means.
>
> Again, so much for context. There is no "he" and me, I'm the person
> who made the original accusation and then followed up on it. That's
> not propagation, so it's not propaganda.

You're still trying to prove the point, when we all know I'm not a racist.

>
> Now, someone who starts a new thread to have a conversation *with
> themselves* in some bizarre piece of performance art that seems
> intended to brand as petty the *requests that he actually follow list
> etiquette*...*that* is someone I'd consider a propagandist.

> --

You mean like you taking over this thread to call me a racist, and go
on ad nauseam about it, when everyone can see what you're doing is
trying to prove a point you know is wrong?

Go back to debate 101, and flunk your professor if he passed you.

Dwight Hutto

unread,

Sep 14, 2012, 7:43:46 AM9/14/12

to Steven D'Aprano, pytho...@python.org

[snip]

> Please don't feed the trolls.

You're down here under the bridge with the rest of us trolls too, Steven. 24/7

Prasad, Ramit

unread,

Sep 14, 2012, 5:32:26 PM9/14/12

to pytho...@python.org

Dwight Hutto wrote:
> On Thu, Sep 13, 2012 at 5:17 PM, Mark Lawrence <bream...@yahoo.co.uk>
> wrote:
> > On 13/09/2012 21:34, Joshua Landau wrote:
> >>
> >> On 13 September 2012 20:53, Mark Lawrence <bream...@yahoo.co.uk>
> wrote:acci sequence
> >>
> >>> On 13/09/2012 19:39, Prasad, Ramit wrote:
> >>>
> >>>> Dwight Hutto wrote:
> >>>>
> >>>>> Why don' you just time it,eit lops through incrementing thmax input/
> >>>>>
> >>>>
> >>>> What? Without context I have no idea what this means.
> >>>>
> >>>
> >>> You're wasting your time, I've been described as a jackass for having
> >>> the
> >>> audacity to ask for context :)
> >>

Mark,
Sometimes it is the style in which something is asked, and not what
specifically is asked. ;)

> >>
> >>
> >> I'm pretty sure you are in the wrong, acting as if what he said didn't
> >> make
> >> sense! Just read it, he obviously was telling you to time it, as eit lops
> >> are inside thmax input/ which, as you should know if you *bothered to read
> >> the thread*, is incrementing.
> >>
> >>
> >> "don'" is short for "don't", by the way.
> >
> > I do grovellingly apologize for my appalling breach of netiquette. I am of
> > course assuming that the rules have changed and that it's now my
> > responsibility to wade back through maybe a couple of hundred responses on a
> > long thread to find the context.
> I also guess that I'm never going to
> > achieve my ambition of being a pot smoking hippy CEO of a web development
> > company :(
> > --
> > Cheers.

> Cheers usually implies you're an alcoholic pressing buttons, with the
> half of rest of the coders on the net.

Dwight/David:
Not necessarily true if you happen to be from the UK (or maybe just England).
It seems to be a cultural signature equivalent to "Thanks"--at least in the
UK (not sure about other places).

[snip]

Ramit

Prasad, Ramit

unread,

Sep 14, 2012, 6:43:36 PM9/14/12

to pytho...@python.org

Dwight Hutto wrote:
> On Fri, Sep 14, 2012 at 4:20 AM, alex23 <wuw...@gmail.com> wrote:

> > On Sep 14, 6:04 pm, Dwight Hutto <dwightdhu...@gmail.com> wrote:
> >> > Using foreign names derogatively is a common tactic of the racist.
> >>
> >> Not really. But nice spin on my pun to make me look bad.
> >
> > It actually *is* common behaviour of racists.
> >
>

> Not if there name is ramit. What if your name was john? I'd say I'll
> be right back, I have to go take a crap on the john. It's a joke about
> a name, not where it originates.

Okay, so maybe not racist but instead offensive juvenile humor?
I suppose that is better...

>
> >> It's similar to if I said, this is real 'queer' of you to do ya big
> >> pansy, and next you'll be calling me a homophobe.

Um, yes! You are using 'queer' and 'pansy' as a derogatory comparison
which falls under the definition for homophobia.

"Homophobia is a range of negative attitudes and feelings toward homosexuality or people who are identified or perceived as being lesbian, gay, bisexual or transgender (LGBT). Definitions refer variably to antipathy, contempt, prejudice, aversion, irrational fear, and hatred."
~ First two sentences on Wikipedia

> >
> > Well, *yes*. Because your choice of that terminology as derogatory
> > shows you view it as derogatory.

>

> No it was a loose analogy to show that he's just trying to use
> anything he can say to slam me, and everyone can know it wasn't meant
> as racist.

Since I was unsure myself if you were trying to be offensive or racist,
I would disagree with "everyone can know it wasn't meant as racist".

>
> >> Try harder, because no one would ever believe I was a racist, and if
> >> they did, it's an uninformed decision based off of cherry picken
> >> phrases out of context.
> >
> > Oh, so *now* context is important.
>

> Never said it wasn't. The whole conversation to me is the context, and
> the OP usually follows it, and that's who it is usually intended for.
>

[ snip ]

Steve Howell

unread,

Sep 14, 2012, 8:51:07 PM9/14/12

to

On Sep 6, 4:04 am, Oscar Benjamin <oscar.j.benja...@gmail.com> wrote:
> On Thu, 06 Sep 2012 06:07:38 -0400, Dave Angel <d...@davea.name> wrote:
> > For random strings (as defined below), the average compare time is
> > effectively unrelated to the size of the string, once the size
> passes
> > some point.
> > Define random string as being a selection from a set of characters,
> with
> > replacement. So if we pick some set of characters, say 10 (or 256,
> it
> > doesn't really matter), the number of possible strings is 10**N.
> > The likelihood of not finding a mismatch within k characters is
> > (1/10)**k The likelihood of actually reaching the end of the
> random
> > string is (1/10)**N. (which is the reciprocal of the number of
> strings,
> > naturally)
> > If we wanted an average number of comparisons, we'd have to
> calculate a
> > series, where each term is a probability times a value for k.
> > sum((k * 9*10**-k) for k in range(1, 10))
> > Those terms very rapidly approach 0, so it's safe to stop after a
> few.
> > Looking at the first 9 items, I see a value of 1.1111111
> > This may not be quite right, but the value is certainly well under
> 2 for
> > a population of 10 characters, chosen randomly. And notice that N
> > doesn't really come into it.
>
> It's exactly right. You can obtain this result analytically from
> Johannes' formula above. Just replace 256 with 10 to get that the
> expected number of comparisons is
>
> (10/9)*(1 - 10**(-N))

The math here is simple, but of course the average running time is
also driven by usage patterns.

Let's say you have two independently produced strings with N number of
random bits. The average amount of bit comparisons that it takes to
determine that the strings are different is between 1 and 2 checks for
any value of N, for the same reason that the average number of times
you need to flip a coin before either coming up heads or exhausting
your N tries is always less than 2, and pretty darn close to 2 for
even relatively small values of N.

def compute_checks_for_different_strings(n):
x = 1
p = 0.5
for i in range(0, n):
x += p * i
p = p * 0.5
print n, x

for n in [10, 100, 1000, 10000]:
compute_checks_for_different_strings(n)

10 1.9892578125
100 2.0
1000 2.0
10000 2.0

Now let's say your string comparison function is used to verify 100-
bit passwords, and 99% of the time the password is from a legit user
who types it correctly. Except for the 1% of a time when somebody fat
fingers the copy/paste of their password, every strcmp call is gonna
have to compare 100 pairs of characters, or N pairs. If the usage
pattern says that 99% of the time you're simply verifying that two
strings are equal, then the number of bits is the main driving factor
for running time.

Dwight Hutto

unread,

Sep 14, 2012, 11:10:41 PM9/14/12

to Prasad, Ramit, pytho...@python.org

On Fri, Sep 14, 2012 at 6:43 PM, Prasad, Ramit

<ramit....@jpmorgan.com> wrote:
> Dwight Hutto wrote:

>> On Fri, Sep 14, 2012 at 4:20 AM, alex2find-work-home/3 <wuw...@gmail.com> wrote:
>> > On Sep 14, 6:04 pm, Dwight Hutto <dwightdhu...@gmail.com> wrote:
>> >> > Using foreign names derogatively is a common tactic of the racist.
>> >>
>> >> Not really. But nice spin on my pun to make me look bad.
>> >
>> > It actually *is* common behaviour of racists.
>> >
>>

>> Not if there name is ramit. What if your name was john? I'd say I'll
>> be right back, I have to go take a crap on the john. It's a joke about
>> a name, not where it originates.
>
> Okay, so maybe not racist but instead offensive juvenile humor?
> I suppose that is better...

No, just shop talk. You hit me and hit I logarithmically hit back
..."butterfly effect".

>
>>
>> >> It's similar to if I said, this is real 'queer' of you to do ya big
>> >> pansy, and next you'll be calling me a homophobe.
>

> Um, yes! You are using 'queer' and 'pansy' as a derogatory comparison
> which falls under the definition for homophobia.
>
> "Homophobia is a range of negative attitudes and feelings toward homosexuality or people who are identified or perceived as being lesbian, gay, bisexual or transgender (LGBT). Definitions refer variably to antipathy, contempt, prejudice, aversion, irrational fear, and hatred."
> ~ First two sentences on Wikipedia

No, analogy, and a continued attack on a subject we know is propaganda.

>
>> >
>> > Well, *yes*. Because your choice of that terminology as derogatory
>> > shows you view it as derogatory.
>
>>

>> No it was a loose analogy to show that he's just trying to use
>> anything he can say to slam me, and everyone can know it wasn't meant
>> as racist.
>
> Since I was unsure myself if you were trying to be offensive or racist,
> I would disagree with "everyone can know it wasn't meant as racist".
>

If you're unsure if it was racist, you should err on the side of
caution. There are many nicknames on the net. That could be an ethnic
name, or an A.K.A.. Non-logical, pure speculation, that is biased that
your minority needs more anit-you.

Don't engage in conversations you're sure to lose.

>>
>> >> Try harder, because no one would ever believe I was a racist, and if
>> >> they did, it's an uninformed decision based off of cherry picken
>> >> phrases out of context.
>> >
>> > Oh, so *now* context is important.
>>

>> Never said it wasn't. The whole conversation to me is the context, and
>> the OP usually follows it, and that's who it is usually intended for.
>>

alex23

unread,

Sep 16, 2012, 9:11:49 PM9/16/12

to

On Sep 15, 1:10 pm, Dwight Hutto <dwightdhu...@gmail.com> wrote:
> On Fri, Sep 14, 2012 at 6:43 PM, Prasad, Ramit

> > Since I was unsure myself if you were trying to be offensive or racist,
> > I would disagree with "everyone can know it wasn't meant as racist".
>
> If you're unsure if it was racist, you should err on the side of
> caution.

If your comments are mistakable as racism, maybe *you* should be more
cautious and *not make them*.

Chris Angelico

unread,

Sep 17, 2012, 12:05:38 AM9/17/12

to pytho...@python.org

That applies to the most obvious examples (for instance, there's a
line in a 19th century opera that uses a six-letter word starting with
'n' to refer to a dark-skinned person - for obvious reasons, that line
is usually changed in modern performances, even though it was
descriptive and not offensive when the opera was written - like
referring to a Caucasian). However, there are many things which could
be misinterpreted as racist, and I would hate to see people leaned
hard on to make their speech entirely politically correct. Use common
sense, on both sides. Don't be offensive, and don't be offended.

ChrisA

alex23

unread,

Sep 17, 2012, 2:06:11 AM9/17/12

to

On Sep 17, 2:05 pm, Chris Angelico <ros...@gmail.com> wrote:
> I would hate to see people leaned
> hard on to make their speech entirely politically correct. Use common
> sense, on both sides. Don't be offensive, and don't be offended.

While I would like to agree, this would require there to be no power
disparities between the parties involved, which isn't always the case.
I hate the term "politically correct", it always seems to be used in
an eye-rolling, "what a load of nonsense" way, probably because I
believe there's at least some validity in the Sapir/Whorf hypothesis
that language dictates thought.

If Ramit's name was Barbie and e was told to "get back to your dream
home" or some other guff, it would be quite rightly viewed as sexist,
whether intentional or not. Such behaviour has been called out on this
list in the past, and I'm genuinely surprised to see so little
reaction to this.

Ethan Furman

unread,

Sep 17, 2012, 4:35:01 PM9/17/12

to pytho...@python.org

*plonk*

Neil Hodgson

unread,

Sep 17, 2012, 7:14:05 PM9/17/12

to

Ethan Furman:
> *plonk*

I can't work out who you are plonking. While more than one of the
posters on this thread seem worthy of a good plonk, by not including
sufficient context, you've left me feeling puzzled. Is there a guideline
for this in basic netiquette?

Neil

Ethan Furman

unread,

Sep 18, 2012, 11:12:32 AM9/18/12

to pytho...@python.org

You're right, my apologies. Dwight Hutto is the one I plonked. His
signal to noise ratio seems incredibly low.

alex23 can definitely be abrasive, but his ratio is high.

(Now I'm wondering what that says about me as far as who I agree with
and why... I'll have to think about that.)

~Ethan~

Dwight Hutto

unread,

Sep 18, 2012, 11:55:41 AM9/18/12

to Ethan Furman, pytho...@python.org

> You're right, my apologies. Dwight Hutto is the one I plonked. His signal
> to noise ratio seems incredibly low.

Is this just because of the lack of context, or because of the content
of my response to the OP following the question?

And with all the netiquette talk being passed around, why isn't this a
new thread instead of hijacking the OP's original question, and
engaging in bickering about netiquette that goes
on all the time?

And it seems redundant to argue this topic other than to comment, and
let it pass, unless there's seems to be a legitimate reason why
someone 'disobeyed' posting policy.

>
> alex23 can definitely be abrasive, but his ratio is high.
>
> (Now I'm wondering what that says about me as far as who I agree with and
> why... I'll have to think about that.)
>
> ~Ethan~

> --
> http://mail.python.org/mailman/listinfo/python-list

Dwight Hutto

unread,

Sep 18, 2012, 11:59:24 AM9/18/12

to Ethan Furman, pytho...@python.org

>>> sufficient context, you've left me feeling puzzled. Is there a guideline for
>>> this in basic netiquette?
>>

www.woodgate.org/FAQs/netiquette.html

Dwight Hutto

unread,

Sep 18, 2012, 12:17:40 PM9/18/12

to Ethan Furman, pytho...@python.org

>
> You're right, my apologies. Dwight Hutto is the one I plonked.

You can call me David. I go by my middle name.

And it seem to me I made some valid points about a few simple trimming
of postings, that didn't seem necessary in the context of a small
quick conversation.

Chris Angelico

unread,

Sep 18, 2012, 12:20:29 PM9/18/12

to pytho...@python.org

On Wed, Sep 19, 2012 at 2:17 AM, Dwight Hutto <dwight...@gmail.com> wrote:
>>
>> You're right, my apologies. Dwight Hutto is the one I plonked.
> You can call me David. I go by my middle name.

You're most often going to be addressed by the name that's given in
your post headers. In this case "David" has been reduced to an
initial, and is visible only in your email address, whereas "Dwight"
is right there in your real-name.

ChrisA

Dwight Hutto

unread,

Sep 18, 2012, 4:40:12 PM9/18/12

to Chris Angelico, pytho...@python.org

> You're most often going to be addressed by the name that's given in
> your post headers. In this case "David" has been reduced to an
> initial, and is visible only in your email address, whereas "Dwight"

My sig says David, but it was just to let him know he can call me by
my used name.

Mark Lawrence

unread,

Sep 18, 2012, 7:48:03 PM9/18/12

to pytho...@python.org

On 18/09/2012 21:40, Dwight Hutto wrote:
>> You're most often going to be addressed by the name that's given in
>> your post headers. In this case "David" has been reduced to an
>> initial, and is visible only in your email address, whereas "Dwight"
> My sig says David, but it was just to let him know he can call me by
> my used name.
>
>

Any particular him?

--
Cheers.

Mark Lawrence.