Strange behavior

light...@gmail.com

unread,

Aug 14, 2012, 11:38:15 AM8/14/12

to

Hi, I am migrating from PHP to Python and I am slightly confused.

I am making a function that takes a startingList, finds all the strings in the list that begin with 'x', removes those strings and puts them into a xOnlyList.

However if you run the code you will notice only one of the strings beginning with 'x' is removed from the startingList.
If I comment out 'startingList.remove(str);' the code runs with both strings beginning with 'x' being put in the xOnlyList.
Using the print statement I noticed that the second string that begins with 'x' isn't even identified by the function. Why does this happen?

def testFunc(startingList):
xOnlyList = [];
for str in startingList:
if (str[0] == 'x'):
print str;
xOnlyList.append(str)
startingList.remove(str) #this seems to be the problem
print xOnlyList;
print startingList
testFunc(['xasd', 'xjkl', 'sefwr', 'dfsews'])

#Thanks for your help!

Alain Ketterlin

unread,

Aug 14, 2012, 11:59:42 AM8/14/12

to

light...@gmail.com writes:

> However if you run the code you will notice only one of the strings
> beginning with 'x' is removed from the startingList.

>

> def testFunc(startingList):
> xOnlyList = [];
> for str in startingList:
> if (str[0] == 'x'):
> print str;
> xOnlyList.append(str)
> startingList.remove(str) #this seems to be the problem
> print xOnlyList;
> print startingList
> testFunc(['xasd', 'xjkl', 'sefwr', 'dfsews'])
>
> #Thanks for your help!

Try with ['xasd', 'sefwr', 'xjkl', 'dfsews'] and you'll understand what
happens. Also, have a look at:

http://docs.python.org/reference/compound_stmts.html#the-for-statement

You can't modify the list you're iterating on, better use another list
to collect the result.

-- Alain.

P/S: str is a builtin, you'd better avoid assigning to it.

Terry Reedy

unread,

Aug 14, 2012, 3:05:43 PM8/14/12

to pytho...@python.org

On 8/14/2012 11:59 AM, Alain Ketterlin wrote:
> light...@gmail.com writes:
>
>> However if you run the code you will notice only one of the strings
>> beginning with 'x' is removed from the startingList.
>
>>
>> def testFunc(startingList):
>> xOnlyList = [];
>> for str in startingList:
>> if (str[0] == 'x'):
>> print str;
>> xOnlyList.append(str)
>> startingList.remove(str) #this seems to be the problem
>> print xOnlyList;
>> print startingList
>> testFunc(['xasd', 'xjkl', 'sefwr', 'dfsews'])
>>
>> #Thanks for your help!
>
> Try with ['xasd', 'sefwr', 'xjkl', 'dfsews'] and you'll understand what
> happens. Also, have a look at:
>
> http://docs.python.org/reference/compound_stmts.html#the-for-statement
>
> You can't modify the list you're iterating on,

Except he obviously did ;-).
(Modifying set or dict raises SomeError.)

Indeed, people routine *replace* items while iterating.

def squarelist(lis):
for i, n in enumerate(lis):
lis[i] = n*n
return lis

print(squarelist([0,1,2,3,4,5]))
# [0, 1, 4, 9, 16, 25]

Removals can be handled by iterating in reverse. This works even with
duplicates because if the item removed is not the one tested, the one
tested gets retested.

def removeodd(lis):
for n in reversed(lis):
if n % 2:
lis.remove(n)
print(n, lis)

ll = [0,1, 5, 5, 4, 5]
removeodd(ll)
>>>
5 [0, 1, 5, 4, 5]
5 [0, 1, 4, 5]
5 [0, 1, 4]
4 [0, 1, 4]
1 [0, 4]
0 [0, 4]

> better use another list to collect the result.

If there are very many removals, a new list will be faster, even if one
needs to copy the new list back into the original, as k removals from
len n list is O(k*n) versus O(n) for new list and copy.

> P/S: str is a builtin, you'd better avoid assigning to it.

Agreed. People have actually posted code doing something like

...
list = [1,2,3]
...
z = list(x)
...
and wondered and asked why it does not work.

--
Terry Jan Reedy

Virgil Stokes

unread,

Aug 14, 2012, 3:40:10 PM8/14/12

to pytho...@python.org

You might find the following useful:

def testFunc(startingList):
xOnlyList = []; j = -1
for xl in startingList:
if (xl[0] == 'x'):
xOnlyList.append(xl)
else:
j += 1
startingList[j] = xl
if j == -1:
startingList = []
else:
del startingList[j:-1]

return(xOnlyList)

testList1 = ['xasd', 'xjkl', 'sefwr', 'dfsews']
testList2 = ['xasd', 'xjkl', 'xsefwr', 'xdfsews']
testList3 = ['xasd', 'jkl', 'sefwr', 'dfsews']
testList4 = ['asd', 'jkl', 'sefwr', 'dfsews']

xOnlyList = testFunc(testList1)
print 'xOnlyList = ',xOnlyList
print 'testList = ',testList1
xOnlyList = testFunc(testList2)
print 'xOnlyList = ',xOnlyList
print 'testList = ',testList2
xOnlyList = testFunc(testList3)
print 'xOnlyList = ',xOnlyList
print 'testList = ',testList3
xOnlyList = testFunc(testList4)
print 'xOnlyList = ',xOnlyList
print 'testList = ',testList4

And here is another version using list comprehension that I prefer

testList1 = ['xasd', 'xjkl', 'sefwr', 'dfsews']
testList2 = ['xasd', 'xjkl', 'xsefwr', 'xdfsews']
testList3 = ['xasd', 'jkl', 'sefwr', 'dfsews']
testList4 = ['asd', 'jkl', 'sefwr', 'dfsews']

def testFunc2(startingList):
return([x for x in startingList if x[0] == 'x'], [x for x in
startingList if x[0] != 'x'])

xOnlyList,testList = testFunc2(testList1)
print xOnlyList
print testList
xOnlyList,testList = testFunc2(testList2)
print xOnlyList
print testList
xOnlyList,testList = testFunc2(testList3)
print xOnlyList
print testList
xOnlyList,testList = testFunc2(testList4)
print xOnlyList
print testList

light...@gmail.com

unread,

Aug 14, 2012, 3:20:24 PM8/14/12

to

I got my answer by reading your posts and referring to: http://docs.python.org/reference/compound_stmts.html#the-for-statement
(particularly the shaded grey box)

I guess I should have (obviously) looked at the doc's before posting here; but im a noob.

Thanks for your help.

Chris Angelico

unread,

Aug 14, 2012, 5:55:58 PM8/14/12

to pytho...@python.org

On Wed, Aug 15, 2012 at 1:38 AM, <light...@gmail.com> wrote:
> def testFunc(startingList):
> xOnlyList = [];
> for str in startingList:
> if (str[0] == 'x'):
> print str;
> xOnlyList.append(str)
> startingList.remove(str) #this seems to be the problem
> print xOnlyList;
> print startingList
> testFunc(['xasd', 'xjkl', 'sefwr', 'dfsews'])

Other people have explained the problem with your code. I'll take this
example as a way of introducing you to one of Python's handy features
- it's an idea borrowed from functional languages, and is extremely
handy. It's called the "list comprehension", and can be looked up in
the docs under that name,

def testFunc(startingList):
xOnlyList = [strng for strng in startingList if strng[0] == 'x']
startingList = [strng for strng in startingList if strng[0] != 'x']
print(xOnlyList)
print(startingList)

It's a compact notation for building a list from another list. (Note
that I changed "str" to "strng" to avoid shadowing the built-in name
"str", as others suggested.)

(Unrelated side point: Putting parentheses around the print statements
makes them compatible with Python 3, in which 'print' is a function.
Unless something's binding you to Python 2, consider working with the
current version - Python 2 won't get any more features added to it any
more.)

Python's an awesome language. You may have to get your head around a
few new concepts as you shift thinking from PHP's, but it's well worth
while.

Chris Angelico

Steven D'Aprano

unread,

Aug 14, 2012, 8:19:55 PM8/14/12

to

On Tue, 14 Aug 2012 21:40:10 +0200, Virgil Stokes wrote:

> You might find the following useful:
>
> def testFunc(startingList):
> xOnlyList = []; j = -1
> for xl in startingList:
> if (xl[0] == 'x'):

That's going to fail in the starting list contains an empty string. Use
xl.startswith('x') instead.

> xOnlyList.append(xl)
> else:
> j += 1
> startingList[j] = xl

Very cunning, but I have to say that your algorithm fails the "is this
obviously correct without needing to study it?" test. Sometimes that is
unavoidable, but for something like this, there are simpler ways to solve
the same problem.

> if j == -1:
> startingList = []
> else:
> del startingList[j:-1]
> return(xOnlyList)

> And here is another version using list comprehension that I prefer

> def testFunc2(startingList):
> return([x for x in startingList if x[0] == 'x'], [x for x in
> startingList if x[0] != 'x'])

This walks over the starting list twice, doing essentially the same thing
both times. It also fails to meet the stated requirement that
startingList is modified in place, by returning a new list instead.
Here's an example of what I mean:

py> mylist = mylist2 = ['a', 'x', 'b', 'xx', 'cx'] # two names for one
list
py> result, mylist = testFunc2(mylist)
py> mylist
['a', 'b', 'cx']
py> mylist2 # should be same as mylist
['a', 'x', 'b', 'xx', 'cx']

Here is the obvious algorithm for extracting and removing words starting
with 'x'. It walks the starting list only once, and modifies it in place.
The only trick needed is list slice assignment at the end.

def extract_x_words(words):
words_with_x = []
words_without_x = []
for word in words:
if word.startswith('x'):
words_with_x.append(word)
else:
words_without_x.append(word)
words[:] = words_without_x # slice assignment
return words_with_x

The only downside of this is that if the list of words is so enormous
that you can fit it in memory *once* but not *twice*, this may fail. But
the same applies to the list comprehension solution.

--
Steven

Alain Ketterlin

unread,

Aug 15, 2012, 5:50:34 AM8/15/12

to

Chris Angelico <ros...@gmail.com> writes:

> Other people have explained the problem with your code. I'll take this
> example as a way of introducing you to one of Python's handy features
> - it's an idea borrowed from functional languages, and is extremely
> handy. It's called the "list comprehension", and can be looked up in
> the docs under that name,
>
> def testFunc(startingList):
> xOnlyList = [strng for strng in startingList if strng[0] == 'x']
> startingList = [strng for strng in startingList if strng[0] != 'x']
> print(xOnlyList)
> print(startingList)
>
> It's a compact notation for building a list from another list. (Note
> that I changed "str" to "strng" to avoid shadowing the built-in name
> "str", as others suggested.)

Fully agree with you: list comprehension is, imo, the most useful
program construct ever. Extremely useful.

But not when it makes the program traverse twice the same list, where
one traversal is enough.

-- Alain.

Alain Ketterlin

unread,

Aug 15, 2012, 5:57:59 AM8/15/12

to

light...@gmail.com writes:

> I got my answer by reading your posts and referring to:
> http://docs.python.org/reference/compound_stmts.html#the-for-statement
> (particularly the shaded grey box)

Not that the problem is not specific to python (if you erase the current
element when traversing a STL list in C++ you'll get a crash as well).

> I guess I should have (obviously) looked at the doc's before posting
> here; but im a noob.

Python has several surprising features. I think it is a good idea to
take some time to read the language reference, from cover to cover
(before or after the various tutorials, depending on your background).

-- Alain.

Virgil Stokes

unread,

Aug 16, 2012, 7:18:59 AM8/16/12

to pytho...@python.org

On 15-Aug-2012 02:19, Steven D'Aprano wrote:

On Tue, 14 Aug 2012 21:40:10 +0200, Virgil Stokes wrote:

You might find the following useful:

def testFunc(startingList):
     xOnlyList = []; j = -1
     for xl in startingList:
         if (xl[0] == 'x'):

That's going to fail in the starting list contains an empty string. Use 
xl.startswith('x') instead.

Yes, but this was by design (tacitly assumed that startingList was both a list and non-empty).

             xOnlyList.append(xl)
         else:
             j += 1
             startingList[j] = xl

Very cunning, but I have to say that your algorithm fails the "is this 
obviously correct without needing to study it?" test. Sometimes that is 
unavoidable, but for something like this, there are simpler ways to solve 
the same problem.

Sorry, but I do not sure what you mean here.

     if j == -1:
         startingList = []
     else:
         del startingList[j:-1]
     return(xOnlyList)

And here is another version using list comprehension that I prefer

def testFunc2(startingList):
     return([x for x in startingList if x[0] == 'x'], [x for x in
startingList if x[0] != 'x'])

This walks over the starting list twice, doing essentially the same thing 
both times. It also fails to meet the stated requirement that 
startingList is modified in place, by returning a new list instead.

This can meet the requirement that startingList is modified in place via the call to this function (see the attached code).

Here's an example of what I mean:

py> mylist = mylist2 = ['a', 'x', 'b', 'xx', 'cx']  # two names for one 
list
py> result, mylist = testFunc2(mylist)
py> mylist
['a', 'b', 'cx']
py> mylist2  # should be same as mylist
['a', 'x', 'b', 'xx', 'cx']

Yes, I had a typo in my original posting --- sorry about that!


Here is the obvious algorithm for extracting and removing words starting 
with 'x'. It walks the starting list only once, and modifies it in place. 
The only trick needed is list slice assignment at the end.

def extract_x_words(words):
    words_with_x = []
    words_without_x = []
    for word in words:
        if word.startswith('x'):
            words_with_x.append(word)
        else:
            words_without_x.append(word)
    words[:] = words_without_x  # slice assignment
    return words_with_x

Suppose words was not a list --- you have tacitly assumed that words is a list.


The only downside of this is that if the list of words is so enormous 
that you can fit it in memory *once* but not *twice*, this may fail. But 
the same applies to the list comprehension solution.

But, this is not the only downside if speed is important --- it is slower than the list comprehension method (see results that follows).

Here is a summary of three algorithms (algorithm-1, algorithm-2, algorithm-2A) that I tested (see attached code). Note, algorithm-2A was obtained by removing the slice assignment in the above code and modifying the return as follows

def extract_x_words(words):
    words_with_x = []
    words_without_x = []
    for word in words:
        if word.startswith('x'):
            words_with_x.append(word)
        else:
            words_without_x.append(word)
    #words[:] = words_without_x  # slice assignment
    return words_with_x, words_without_x

Of course, one needs to modify the call for "in-place" update of startingList as follows:

ï¿œï¿œ xOnlyList,startingList = extract_x_words(startingList) ï¿œ

Here is a summary of my timing results obtained for 3 different algorithms for lists with 100,000 strings of length 4 in each list:

Method	average (sd) time in seconds
algorithm-1 (list comprehension)	0.11630 (0.0014)
algorithm-2 (S. D'Aprano)	0.17594 (0.0014)
algorithm-2A (modified S. D'Aprano)	0.18217 (0.0023)

These valuesï¿œ were obtained from 100 independent runs (MC simulations) on lists that contain 100,000 strings. Approximately 50% of these strings contained a leading 'x'. Note, that the results show that algorithm-2 (suggested by S. D'Aprano) is approximately 51% slower than algorithm-1 (list comprehensions) and algorithm-2A (simple modification of algorithm-2) is approximately 57% slower than algorithm-1. Why is algorithm-2A slower than algorithm-2?

I would be interested in seeing code that is faster than algorithm-1 --- any suggestions are welcomed.ï¿œ And of course, if there are any errors in my attached code please inform me of them and I will try to correct them as soon as possible. Note, some of the code is actually irrelevant for the original "Strange behavior" post.

Have a good day!

testList4.py

Peter Otten

unread,

Aug 16, 2012, 9:02:40 AM8/16/12

to pytho...@python.org

Virgil Stokes wrote:

>>> def testFunc(startingList):
>>>xOnlyList = []; j = -1
>>>for xl in startingList:
>>>if (xl[0] == 'x'):
>> That's going to fail in the starting list contains an empty string. Use
>> xl.startswith('x') instead.

> Yes, but this was by design (tacitly assumed that startingList was both a
> list and non-empty).

You missunderstood it will fail if the list contains an empty string, not if
the list itself is empty:

>>> words = ["alpha", "", "xgamma"]
>>> [word for word in words if word[0] == "x"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: string index out of range

The startswith() version:

>>> [word for word in words if word.startswith("x")]
['xgamma']

Also possible:

>>> [word for word in words if word[:1] == "x"]
['xgamma']

> def testFunc1(startingList):
> '''
> Algorithm-1
> Note:
> One should check for an empty startingList before
> calling testFunc1 -- If this possibility exists!
> '''

> return([x for x in startingList if x[0] == 'x'],
> [x for x in startingList if x[0] != 'x'])
>
>

> I would be interested in seeing code that is faster than algorithm-1

In pure Python? Perhaps the messy variant:

def test_func(words):
nox = []
append = nox.append
withx = [x for x in words if x[0] == 'x' or append(x)]
return withx, nox

Virgil Stokes

unread,

Aug 16, 2012, 10:31:47 AM8/16/12

to pytho...@python.org

Very nice Peter,

Here are the new results for timing with your method added (algorithm-3).

Method

average (sd) time in seconds

algorithm-1 (list comprehension)	0.11774 (0.002968)
algorithm-2 (S. D'Aprano)	0.17573 (0.003385)

algorithm-2A (modified S. D'Aprano)

0.18116 (0.003081)
algorithm-3 (improved list comprehension)	0.06639 (0.001728)

Algorithm-3 is 43% faster than algorithm-1.ï¿œ Again, the code used to obtain these results is attached.

Thanks Peter for your contribution

testList4.py

Steven D'Aprano

unread,

Aug 16, 2012, 1:40:43 PM8/16/12

to

On Thu, 16 Aug 2012 13:18:59 +0200, Virgil Stokes wrote:

> On 15-Aug-2012 02:19, Steven D'Aprano wrote:
>> On Tue, 14 Aug 2012 21:40:10 +0200, Virgil Stokes wrote:
>>
>>> You might find the following useful:
>>>
>>> def testFunc(startingList):
>>> xOnlyList = []; j = -1
>>> for xl in startingList:
>>> if (xl[0] == 'x'):
>> That's going to fail in the starting list contains an empty string. Use
>> xl.startswith('x') instead.
>
> Yes, but this was by design (tacitly assumed that startingList was both
> a list and non-empty).

As Peter already pointed out, I said it would fail if the list contains
an empty string, not if the list was empty.

>>> xOnlyList.append(xl)
>>> else:
>>> j += 1
>>> startingList[j] = xl
>>
>> Very cunning, but I have to say that your algorithm fails the "is this
>> obviously correct without needing to study it?" test. Sometimes that is
>> unavoidable, but for something like this, there are simpler ways to
>> solve the same problem.
>
> Sorry, but I do not sure what you mean here.

In a perfect world, you should be able to look at a piece of code, read
it once, and see whether or not it is correct. That is what I mean by
"obviously correct". For example, if I have a function that takes an
argument, doubles it, and prints the result:

def f1(x):
print(2*x)

that is obviously correct. Whereas this is not:

def f2(x):
y = (x + 5)**2 - (x + 4)**2
sys.stdout.write(str(y - 9) + '\n')

because you have to study it to see whether or not it works correctly.

Not all programs are simple enough to be obviously correct. Sometimes you
have no choice but to write something which requires cleverness to get
the right result. But this is not one of those cases. You should almost
always prefer simple code over clever code, because the greatest expense
in programming (time, effort and money) is to make code correct.

Most code does not need to be fast. But all code needs to be correct.

[...]

> This can meet the requirement that startingList is modified in place via
> the call to this function (see the attached code).

Good grief! See, that's exactly the sort of thing I'm talking about.
Without *detailed* study of your attached code, how can I possibly know
what it does or whether it does it correctly?

Your timing code calculates the mean using a recursive algorithm. Why
don't you calculate the mean the standard way: add the numbers and divide
by the total? What benefit do you gain from a more complicated algorithm
when a simple one will do the job just as well?

You have spent a lot of effort creating a complicated, non-obvious piece
of timing code, with different random seeds for each run, and complicated
ways of calculating timing statistics... but unfortunately the most
important part of any timing test, the actually *timing*, is not done
correctly. Consequently, your code is not correct.

With an average time of a fraction of a second, none of those timing
results are trustworthy, because they are vulnerable to interference from
other processes, the operating system, and other random noise. You spend
a lot of time processing the timing results, but it is Garbage In,
Garbage Out -- the results are not trustworthy, and if they are correct,
it is only by accident.

Later in your post, you run some tests, and are surprised by the result:

> Why is algorithm-2A slower than algorithm-2?

It isn't slower. It is physically impossible, since 2A does *less* work
than 2. This demonstrates that you are actually taking a noisy
measurement: the values you get have random noise, and you don't make any
effort to minimise that noise. Hence GIGO.

The right way to test small code snippets is with the timeit module. It
is carefully written to overcome as much random noise as possible. But
even there, the authors of the timeit module are very clear that you
should not try to calculate means, let alone higher order statistics like
standard deviation. The only statistic which is trustworthy is to run as
many trials as you can afford, and select the minimum value.

So here is my timing code, which is much shorter and simpler and doesn't
try to do too much. You do need to understand the timeit.Timer class:

timeit.Timer creates a timer object; timer.repeat does the actual timing.
The specific arguments to them are not vital to understand, but you can
read the documentation if you wish to find out what they mean.

First, I define the two functions. I compare similar functions that have
the same effect. Neither modifies the input argument in place. Copy and
paste the following block into an interactive interpreter:

# Start block

def f1(startingList):
return ([x for x in startingList if x[0] == 'x'],

[x for x in startingList if x[0] != 'x'])

# Note that the above function is INCORRECT, it will fail if a string is
# empty; nevertheless I will use it for timing purposes anyway.

def f2(startingList):
words_without_x = []
words_with_x = []
for word in startingList:

if word.startswith('x'):
words_with_x.append(word)
else:
words_without_x.append(word)

return (words_with_x, words_without_x)

# Set up some test data. There's no point being too clever about this.
# Keep it simple.

import random
data = ['aa', 'bb', 'cb', 'xa', 'xb', 'xc']*1000000
random.shuffle(data)

# Set up two timers.
from timeit import Timer
setup = "from __main__ import data, f1, f2"
t1 = Timer("a, b = f1(data)", setup)
t2 = Timer("a, b = f2(data)", setup)

# and run the timers
best1 = min(t1.repeat(number=1, repeat=10))
best2 = min(t2.repeat(number=1, repeat=10))

# End block

On my computer, here are the results. Yours may differ.

best1: 3.5199968814849854
best2: 3.515479803085327

No significant difference. And that is to be expected: the bulk of the
time is spent building up two lists of three million items each.

So let's run it again with less data:

data = data[:10000]
best1 = min(t1.repeat(number=200, repeat=10))/200
best2 = min(t2.repeat(number=200, repeat=10))/200

which gives results:

best1: 0.0037816047668457033
best2: 0.005841898918151856

The double list comp solution is faster, but it's also incorrect -- it
fails if there is an empty string in the list. What happens if we replace
it with a version that doesn't have the empty string bug?

def f1(startingList):
return ([x for x in startingList if x.startswith('x')],
[x for x in startingList if not x.startswith('x')])

best1 = min(t1.repeat(number=200, repeat=10))/200
best2 = min(t2.repeat(number=200, repeat=10))/200

which gives these results:

best1: 0.008604295253753662
best2: 0.005863149166107178

So there's the first lesson: it's easy to be fast if you don't mind
writing buggy code.

Can we do better? Try this:

def f3(startingList):
words_with_x = []
words_without_x = []
append_with = words_with_x.append
append_without = words_without_x.append
for word in iter(startingList):
if word[:1] == 'x':
append_with(word)
else:
append_without(word)
return (words_with_x, words_without_x)

t3 = Timer('a, b = f3(data)', 'from __main__ import f3, data')
best3 = min(t3.repeat(number=200, repeat=10))/200

And the result:

best3: 0.0033271098136901855

which is even faster than your original version.

Or is it? No, I can't conclude that. The difference between the original
f1 function (0.00378s) and my f3 function (0.00332s) is too small to be
sure it is real from just ten trials of each. A better statistician than
me could probably estimate the number of trials needed to be confident
that one is better than the other.

But then, with a difference that small, who cares? In the real world, a
difference that small is lost in the noise. Because of the noise,
probably 50% of the time the slower code will finish first.

[...]

> Suppose words was not a list --- you have tacitly assumed that words is
> a list.

Actually, no I have not. I have assumed it is an iterable object, such as
a list, a tuple, or an iterator. So what? You have done the same thing.
Doing an isinstance type check at the beginning of both functions will
just slow them both down by the same amount.

--
Steven