I've defined __eq__ method in my class like this:
def __eq__(self, other):
return self.xcoord == other.xcoord and \
self.ycoord == other.ycoord and \
self.streetname == other.streetname and \
self.streetno == other.streetno
But it turns out to be very, very slow.
Then I setup two lists:
list_external = getexternal()
list_internal = getinternal()
Now I need get all all addresses from 'list_external' that are not in
'list_internal', and mark them as "new".
I did it like this:
for addr in list_external:
if addr not in list_internal:
addr.status = 1 # New address
But in my case running that loop takes about 10 minutes. What I am
doing wrong?
--
Jani Tiainen
> for addr in list_external:
> if addr not in list_internal:
> addr.status = 1 # New address
>
> But in my case running that loop takes about 10 minutes. What I am
> doing wrong?
The nested loop takes time proportional to the product of the number
of elements in the loops. To avoid it, convert the inner loop to a
set, which can be looked up in constant time:
internal = set(list_internal)
for addr in list_external:
if addr in internal:
addr.status = 1
Even if list_external and list_internal contain the same items you need
about len(list_external)*(len(list_internal)/2), or 45 million comparisons.
You can bring that down to 30000*some_small_factor if you make your
addresses hashable and put the internal ones into a dict or set
def __hash__(self):
return hash((self.xcoord, self.yccord, self.streetname, self.streetno))
def __eq__(self, other):
# as above
Then
set_internal = set(list_internal)
for addr in list_external:
if addr not in set_internal:
addr.status = 1
Note that the attributes relevant for hash and equality must not change
during this process.
Peter
To do that the original poster may have to define a __hash__ and
__eq__ methods in his/her class.
Bye,
bearophile
mmm... not using sets ?
Very complete answer, thank you very much.
I tried that hash approach and sets but it seemed to get wrong results
first time and it was all due my hash method.
Now it takes like 2-3 seconds to do all that stuff and result seem to
be correct. Apparently I have lot to learn about Python... :)
--
Jani Tiainen
>> internal = set(list_internal)
> ...
>
> To do that the original poster may have to define a __hash__ and
> __eq__ methods in his/her class.
You're right. The OP states he implements __eq__, so he also needs a
matching __hash__, such as:
def __hash__(self, other):
return (hash(self.xcoord) ^ hash(self.ycoord) ^
hash(self.streetname) ^ hash(self.streetno))
The hash function by Otten is better because it considers the order of
the items too (while I think the xor doesn't):
return hash((self.xcoord, self.yccord, self.streetname,
self.streetno))
Bye,
bearophile
> But in my case running that loop takes about 10 minutes. What I am doing
> wrong?
Others have already suggested you have a O(N**2) algorithm. Here's an
excellent article that explains more about them:
http://www.joelonsoftware.com/articles/fog0000000319.html
The article is biased towards low-level languages like C. In Python the
way to avoid them is to use sets or dicts, or to keep the list sorted and
use the bisect module.
--
Steven