Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Slow comparison between two lists

0 views
Skip to first unread message

Jani Tiainen

unread,
Oct 23, 2008, 8:03:26 AM10/23/08
to
I have rather simple 'Address' object that contains streetname,
number, my own status and x,y coordinates for it. I have two lists
both containing approximately 30000 addresses.

I've defined __eq__ method in my class like this:

def __eq__(self, other):
return self.xcoord == other.xcoord and \
self.ycoord == other.ycoord and \
self.streetname == other.streetname and \
self.streetno == other.streetno

But it turns out to be very, very slow.

Then I setup two lists:

list_external = getexternal()
list_internal = getinternal()

Now I need get all all addresses from 'list_external' that are not in
'list_internal', and mark them as "new".

I did it like this:

for addr in list_external:
if addr not in list_internal:
addr.status = 1 # New address

But in my case running that loop takes about 10 minutes. What I am
doing wrong?

--

Jani Tiainen

Hrvoje Niksic

unread,
Oct 23, 2008, 8:19:26 AM10/23/08
to
Jani Tiainen <red...@gmail.com> writes:

> for addr in list_external:
> if addr not in list_internal:
> addr.status = 1 # New address
>
> But in my case running that loop takes about 10 minutes. What I am
> doing wrong?

The nested loop takes time proportional to the product of the number
of elements in the loops. To avoid it, convert the inner loop to a
set, which can be looked up in constant time:

internal = set(list_internal)
for addr in list_external:
if addr in internal:
addr.status = 1

Peter Otten

unread,
Oct 23, 2008, 8:24:25 AM10/23/08
to
Jani Tiainen wrote:

Even if list_external and list_internal contain the same items you need
about len(list_external)*(len(list_internal)/2), or 45 million comparisons.
You can bring that down to 30000*some_small_factor if you make your
addresses hashable and put the internal ones into a dict or set

def __hash__(self):
return hash((self.xcoord, self.yccord, self.streetname, self.streetno))
def __eq__(self, other):
# as above

Then

set_internal = set(list_internal)
for addr in list_external:
if addr not in set_internal:
addr.status = 1

Note that the attributes relevant for hash and equality must not change
during this process.

Peter

bearoph...@lycos.com

unread,
Oct 23, 2008, 8:27:34 AM10/23/08
to
Hrvoje Niksic:
> internal = set(list_internal)
...

To do that the original poster may have to define a __hash__ and
__eq__ methods in his/her class.

Bye,
bearophile

Bruno Desthuilliers

unread,
Oct 23, 2008, 8:33:38 AM10/23/08
to
Jani Tiainen a écrit :

mmm... not using sets ?

Jani Tiainen

unread,
Oct 23, 2008, 8:46:05 AM10/23/08
to

Very complete answer, thank you very much.

I tried that hash approach and sets but it seemed to get wrong results
first time and it was all due my hash method.

Now it takes like 2-3 seconds to do all that stuff and result seem to
be correct. Apparently I have lot to learn about Python... :)

--

Jani Tiainen

Hrvoje Niksic

unread,
Oct 23, 2008, 9:25:02 AM10/23/08
to
bearoph...@lycos.com writes:

>> internal = set(list_internal)
> ...
>
> To do that the original poster may have to define a __hash__ and
> __eq__ methods in his/her class.

You're right. The OP states he implements __eq__, so he also needs a
matching __hash__, such as:

def __hash__(self, other):
return (hash(self.xcoord) ^ hash(self.ycoord) ^
hash(self.streetname) ^ hash(self.streetno))

bearoph...@lycos.com

unread,
Oct 23, 2008, 9:57:54 AM10/23/08
to
Hrvoje Niksic:

> You're right.  The OP states he implements __eq__, so he also needs a
> matching __hash__, such as:
>
>     def __hash__(self, other):
>         return (hash(self.xcoord) ^ hash(self.ycoord) ^
>                 hash(self.streetname) ^ hash(self.streetno))

The hash function by Otten is better because it considers the order of
the items too (while I think the xor doesn't):

return hash((self.xcoord, self.yccord, self.streetname,
self.streetno))

Bye,
bearophile

Steven D'Aprano

unread,
Oct 23, 2008, 5:49:37 PM10/23/08
to
On Thu, 23 Oct 2008 05:03:26 -0700, Jani Tiainen wrote:

> But in my case running that loop takes about 10 minutes. What I am doing
> wrong?

Others have already suggested you have a O(N**2) algorithm. Here's an
excellent article that explains more about them:

http://www.joelonsoftware.com/articles/fog0000000319.html


The article is biased towards low-level languages like C. In Python the
way to avoid them is to use sets or dicts, or to keep the list sorted and
use the bisect module.


--
Steven

0 new messages