I recently read about augmented assignments and that (with l1, l2
being lists)
l1.extend(l2)
is more efficient than
l1 = l1 + l2
because unnecessary copy operations can be avoided. Now my question is
if there's a similar thing for breaking a list into two parts. Let's
say I want to remove from l1 everything from and including position 10
and store it in l2. Then I can write
l2 = l1[10:]
del l1[10:]
But since I'm assigning a slice the elements will be copied.
Basically, I'm looking for something like l1.pop(10,len(l1)) which
returns and removes a whole chunk of data. Is there such a thing (and
if not, why not?)
Cheers,
Martin.
The idiom is:
>>> l1, l2 = l1[:10], l1[10:]
Don't know if it's optimized or not. If it's not, it could probably
be. This is a really common idiom.
--
Jonathan Gardner
jgar...@jonathangardner.net
> Hello,
>
> I recently read about augmented assignments and that (with l1, l2 being
> lists)
>
> l1.extend(l2)
>
> is more efficient than
>
> l1 = l1 + l2
>
> because unnecessary copy operations can be avoided. Now my question is
> if there's a similar thing for breaking a list into two parts. Let's say
> I want to remove from l1 everything from and including position 10 and
> store it in l2. Then I can write
>
> l2 = l1[10:]
> del l1[10:]
>
> But since I'm assigning a slice the elements will be copied.
Yes, list slicing can't make any sort of assumptions about what you're
going to do next, so when you take a slice, it has to copy the data.
> Basically,
> I'm looking for something like l1.pop(10,len(l1)) which returns and
> removes a whole chunk of data. Is there such a thing (and if not, why
> not?)
Such a list pop would have to copy the data too. How else could it work?
What exactly is the problem you are trying to solve?
If you are unhappy that copy-and-remove a slice from a list takes two
lines instead of one ("won't somebody think of the shortage of
newlines!!!" *wink*), then just write a helper function and call that:
def copy_and_remove(alist, slice):
tmp = alist[slice]
del alist[slice]
return tmp
l2 = copy_and_remove(l1, slice(10, None))
If you are unhappy that copying slices from lists copies data, well
that's just the way things are. Don't make the mistake of trying to
optimize your application before you actually know what bits need
optimizing.
Python lists are arrays of pointers to objects, so copying a slice is
fast: it doesn't have to copy the objects, just pointers. Deleting from
the end of the list is also quick, because you don't have to move memory,
just clear some pointers and change the length field.
Splitting such an array without copying data is, essentially, impossible.
Python lists aren't linked lists.
But as I said, most of the time copying slices doesn't matter: the time
taken is unlikely to be a serious factor in practice. If it is (and
you've profiled your code, right? you're not just guessing what you think
is fast and slow?) then there may be ways to optimize your code to avoid
needing to copy slices, but we'd need to know what you're doing with the
data.
--
Steven
> On Sat, Feb 20, 2010 at 4:55 PM, marwie <mar...@gmx.de> wrote:
[...]
>> l2 = l1[10:]
>> del l1[10:]
>>
>> But since I'm assigning a slice the elements will be copied. Basically,
>> I'm looking for something like l1.pop(10,len(l1)) which returns and
>> removes a whole chunk of data. Is there such a thing (and if not, why
>> not?)
>>
>>
> The idiom is:
>
>>>> l1, l2 = l1[:10], l1[10:]
No, that has different behaviour to what the Original Poster wants, AND
it does two copies instead of one:
(1) copy l1[:10] and l1[10:]
(2) assign the names l1 and l2 to them
(3) if, and only if, there are no other references to the original l1, it
gets deleted (garbage collected).
What the OP is doing is quite different:
(1) copy l1[:10]
(2) assign the name l2 to it
(3) resize l1 in place to the first 10 items.
What the OP wants is:
(1) assign the name l2 to l1[:10] without copying
(2) resize l1 in place to the first 10 items without affecting l2.
--
Steven
Well, to split a C array I would simply set l2 to point to l1[10] and
then
change the length of l1 (which I store somewhere else). No copying of
elements
needed. I would have assumed that python can do something like this
with its
internal arrays of pointers, too.
Anyway, this was more a question about coding style. I use
l1.extend(l2) or
l1 += l2 rather than l1 = l1 + l2 because it's as readable and
possibly faster.
I was simply wondering if something similar exists for splitting
lists.
Assuming you meant l1[10:] instead of l1[:10], that's what I wanted,
yes.
> On 21 Feb., 02:30, Steven D'Aprano <st...@REMOVE-THIS-
> cybersource.com.au> wrote:
>> Python lists are arrays of pointers to objects, so copying a slice is
>> fast: it doesn't have to copy the objects, just pointers. Deleting from
>> the end of the list is also quick, because you don't have to move
>> memory, just clear some pointers and change the length field.
>>
>> Splitting such an array without copying data is, essentially,
>> impossible. Python lists aren't linked lists.
>
> Well, to split a C array I would simply set l2 to point to l1[10] and
> then change the length of l1 (which I store somewhere else). No copying
> of elements needed. I would have assumed that python can do something
> like this with its internal arrays of pointers, too.
Python lists aren't C arrays either.
Python lists are *objects*. Everything in Python is an object.
Consequently, Python lists have a header, which includes the len. You
don't store the length somewhere else, you store it in the object: this
makes for less complexity. You can't just point l2 at an arbitrary index
in an array and expect it to work, it needs a header.
Additionally, Python lists are over-allocated so that appends are fast. A
list of (say) 1000 items might be over-allocated to (say) 1024 items, so
that you can do 24 appends before the array is full and the array needs
to be resized. This means that, on average, Python list appending is O(1)
instead of O(N). You can't just change the length blindly, you need to
worry about the over-allocation.
I'm sympathetic to your concern: I've often felt offended that doing
something like this:
x = SomeReallyBigListOrString
for item in x[1:]:
process(item)
has to copy the entire list or string (less the first item). But
honestly, I've never found a situation where it actually mattered.
> Anyway, this was more a question about coding style. I use l1.extend(l2)
> or l1 += l2 rather than l1 = l1 + l2 because it's as readable and
> possibly faster.
I really, really hate augmented assignment for everything other than ints
and floats because you can't predict what it will do. Take this for
example:
>>> mylist = [1, 2, 3, 4]
>>> same_again = mylist
>>> mylist.append(5)
>>> same_again
[1, 2, 3, 4, 5]
mylist and same_again are the same object, so append and extend behave
predictably and simply. So is simple too:
>>> mylist = mylist + [-1, -2, -3]
>>> mylist
[1, 2, 3, 4, 5, -1, -2, -3]
>>> same_again
[1, 2, 3, 4, 5]
Because you have re-bound the name mylist to a new list, same_again
doesn't get modified. Nice.
But what about mylist += something else? Is it an in-place modification,
like extend and append, or a rebinding, like +? Who can remember? The
answer will depend on whether mylist is a builtin list, or a subclass.
And then there's this mystery:
>>> mylist = [1, 2, 3]
>>> mytuple = (mylist, None)
>>> mytuple[0] += [4]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
>>> mylist
[1, 2, 3, 4]
So in fact, += is *always* a rebinding, but *sometimes* it is an in-place
modification as well. Yuck.
--
Steven
x = SomeReallyBigListOrString
for item in x.items(1):
process(item)
For ten items, though, is it really faster to muck around with array
lengths than just copying the data over? Array copies are extremely
fast on modern processors.
Programmer time and processor time being what it is, I still think my
answer is the correct one. No, it's not what he says he wants, but it
is what he needs.
--
Jonathan Gardner
jgar...@jonathangardner.net
That's about the best you can do with Python lists.
> Basically, I'm looking for something like l1.pop(10,len(l1)) which
> returns and removes a whole chunk of data. Is there such a thing (and
> if not, why not?)
Numpy arrays can share underlying data like that when you take
slices. For instance, this probably works the way you want:
a = numpy.array([1,2,3,4,5,6])
b = a[:3]
c = a[3:]
None of the actual data was copied here.
Carl Banks
> On Sat, Feb 20, 2010 at 5:41 PM, Steven D'Aprano
> <st...@remove-this-cybersource.com.au> wrote:
>>
>> What the OP wants is:
>>
>> (1) assign the name l2 to l1[:10] without copying (2) resize l1 in
>> place to the first 10 items without affecting l2.
>>
>>
> For ten items, though, is it really faster to muck around with array
> lengths than just copying the data over? Array copies are extremely fast
> on modern processors.
My mistake, he actually wants l1[10:] copied. So you might be copying a
huge amount of data, not just ten items.
Or if you prefer, replace 10 by 100000000.
But in either case, I agree that *in practice* it's rarely an actual
problem.
> Programmer time and processor time being what it is, I still think my
> answer is the correct one. No, it's not what he says he wants, but it is
> what he needs.
You missed the point that your suggestion gives radically different
behaviour to what the OP indicated he is using. He mutates the list in
place, you rebind the list's name. That has *very* different semantics.
>>> a = b = [1, 2, 3, 4]
>>> del a[2:] # mutate in place
>>> b
[1, 2]
>>>
>>> a = b = [1, 2, 3, 4]
>>> a = a[2:] # rebind the name
>>> b
[1, 2, 3, 4]
The OP may not care about the difference, but if he does require the
first behaviour, then your solution doesn't help.
--
Steven
Ok, I see your point. However, other data representation might still
be able to optimize such a multi-element pop. I'm thinking of deques,
for example.
> I'm sympathetic to your concern: I've often felt offended that doing
> something like this:
>
> x = SomeReallyBigListOrString
> for item in x[1:]:
> process(item)
>
> has to copy the entire list or string (less the first item). But
> honestly, I've never found a situation where it actually mattered.
Good grief, it copies that, too? I assumed that the copying is at
least delayed until the assignment and that x[1:] is represented by
some wrapper that just shuffles the indices around (much like
the .indices method that MRAB suggested).
Maybe copying chunks of data really isn't that much of an issue as it
used to be (and maybe I'm getting old). The application I have in mind
has to do with symbolic computation, and expressions would be
represented by python lists. It's too early to do any profiling (in
fact, it's at the "deciding if python is the right language to
implement it" stage), but it will have to handle expressions with many
terms (i.e. long lists), and in the end taking these lists apart and
putting them back together in different order is the only thing the
code will do. That to explain my interest in performance issues
related to pyhton lists.
Anyway, thanks for your help.
Martin.
Hmm, that might be worth looking into. Thanks.
Actually, my example is a bit wrong! My suggestion should have said that
the arguments of list.items would resemble those of range:
x = SomeReallyBigListOrString
for item in x.items(1, None): # [1 : None], or just [1 : ]
process(item)
Having a third 'stride' argument would also be possible.
When you remove 10 elements off of a list of size 1000, none of the
objects themselves are moved, but the pointers to the objects are all
moved, so k*990 bytes get moved backward, where k is the size of a
pointer (4 or 8 typically on modern computers).
There is no mechanism to advance a pointer forward when you delete or
pop from the top.
I proposed the following patch to implement an advance-the-pointer
scheme, but it is unlikely to be accepted in the near future:
http://bugs.python.org/file16034/no_mem_penalty.diff
http://bugs.python.org/issue7784
You can find alternative data structures that might serve your needs
better, such as collections.deque.
If you are interested in how Python lists work internally, you mostly
want to look at listobject.c:
http://svn.python.org/view/python/trunk/Objects/listobject.c?view=markup
In particular, look at list_resize(), list_slice(), and
list_ass_slice().
>> x = SomeReallyBigListOrString
>> for item in x[1:]:
>> process(item)
>>
>> has to copy the entire list or string (less the first item). But
>> honestly, I've never found a situation where it actually mattered.
>
> Good grief, it copies that, too? I assumed that the copying is at least
> delayed until the assignment and that x[1:] is represented by some
> wrapper that just shuffles the indices around (much like the .indices
> method that MRAB suggested).
Such complexity doesn't happen for free. It costs developer time, more
complexity means more bugs, and more runtime overhead, especially for
small lists. 99.9% of programs operate on small amounts of data, and
"small" gets bigger every year.
(I was reading one of my old Uni text books the other day, and one of the
projects was an application to index all the words in a text file. The
author decided to use disk storage rather than in-memory data structures
because he was hoping to read files up to 60,000 words, and considered it
unlikely that any PCs would have enough memory to store 60,000 words!)
Unless you actually profile the code, you're as likely to be *slowing
Python down* by such extra sophistication as you are to speed it up.
Copying blocks of pointers is really fast on modern CPUs.
As a general rule, Python aims for simplicity of implementation rather
than clever, but fragile, tricks.
> Maybe copying chunks of data really isn't that much of an issue as it
> used to be (and maybe I'm getting old). The application I have in mind
> has to do with symbolic computation, and expressions would be
> represented by python lists. It's too early to do any profiling (in
> fact, it's at the "deciding if python is the right language to implement
> it" stage), but it will have to handle expressions with many terms (i.e.
> long lists), and in the end taking these lists apart and putting them
> back together in different order is the only thing the code will do.
Are you likely to have expressions with a hundred or so MILLION terms?
If so, then you should start thinking about clever tricks to minimise
copying. Otherwise, you're engaged in premature optimization, which is
the root of all computer science evil :)
--
Steven
If you need to scale that direction, it's time to install a database.
>> Programmer time and processor time being what it is, I still think my
>> answer is the correct one. No, it's not what he says he wants, but it is
>> what he needs.
>
> You missed the point that your suggestion gives radically different
> behaviour to what the OP indicated he is using. He mutates the list in
> place, you rebind the list's name. That has *very* different semantics.
>
>
> The OP may not care about the difference, but if he does require the
> first behaviour, then your solution doesn't help.
>
Correct.
--
Jonathan Gardner
jgar...@jonathangardner.net