so obv i want fast insert. i want to delete duplicates too.
which datastructure is best for this?
>>> s=set()
>>> s.add('a')
>>> s.add('b')
>>> s
set(['a', 'b'])
>>> s.add('a')
>>> s
set(['a', 'b'])
>>> s.add('c')
>>> s
set(['a', 'c', 'b'])
>>>
it does remove duplicates, but is it not ordered. to order it you can
use:
>>> l=list(s)
>>> l.sort()
>>> l
['a', 'b', 'c']
hth,
Rares
sets dont seem to be so good because there is no way to iterate them.
s.pop() remove and return an arbitrary element from s; raises
KeyError if empty
i dont want to remove i just want to get the string that is stored
there.
> sets dont seem to be so good because there is no way to iterate them.
Err:
In [82]: for x in set(['a', 'b', 'c']):
....: print x
....:
a
c
b
Ciao,
Marc 'BlackJack' Rintsch
Use a list, and the bisect module to keep it sorted:
py> import bisect
py> def insert_nodupes(slist, item):
... i = bisect.bisect_left(slist, item)
... if i>=len(slist) or slist[i] != item:
... slist.insert(i, item)
...
py> items = []
py> insert_nodupes(items, 'link to some site')
py> insert_nodupes(items, 'another link')
py> insert_nodupes(items, 'third site')
py> insert_nodupes(items, 'another link')
py> items
['another link', 'link to some site', 'third site']
--
Gabriel Genellina
This can be condensed to l = sorted(s)
Keep the data redundantly in two data structures. Use collections.deque() to
append and remove as in a queue, and set() to find duplicates.
Again, no ordering, but very fast insert/delete/dup-check.
Stefan
> notnor...@yahoo.se wrote:
>> im writing a webcrawler.
>> after visiting a new site i want to store it in alphabetical order.
>>
>> so obv i want fast insert. i want to delete duplicates too.
>>
>> which datastructure is best for this?
>
> Keep the data redundantly in two data structures. Use collections.deque() to
> append and remove as in a queue, and set() to find duplicates.
>
what about heapq for sorting?
--
Rodrigo Lazo (rlazo)
That's worth doing if you need the data to be sorted after each insert.
If the OP just needs the data to be sorted at the end, using a data
structure with fast inserts (like a set) and then sorting at the end will
likely be faster.
i meant like set[pos], not iterate but access a specific position in
the set.
now i have to do:
s = toSet.pop()
toSet.add(s)
if i want to get the url of the next item in a set, how would i do
that?
i can do this with a list:
def scrapeSitesX(startAddress, toList, listPos):
return scrapeSites(toList[listPos], toList, listPos+1)
to use recursion with a list.
i can work around that but it doesnt get as elegant.
If you need to access arbitrary elements, use a list instead of a set
(but you'll get slower inserts). OTOH, if you just need to be able to get
the next item from the set, you can use an iterator:
> now i have to do:
> s = toSet.pop()
> toSet.add(s)
i = iter(toSet)
s = i.next()
> if i want to get the url of the next item in a set, how would i do that?
> i can do this with a list:
>
> def scrapeSitesX(startAddress, toList, listPos): return
> scrapeSites(toList[listPos], toList, listPos+1) to use recursion with a
> list.
> i can work around that but it doesnt get as elegant.
Do you have to use recursion here? Why not:
for site in toList:
scrape(site)
I think you ought to re-examine your requirements. Why do you need to
store in alphabetical order? (And, what does that even mean for a web
site?) Personally, I'd probably just use a simple dict for in-memory
storage, and use a database for external storage. In that case,
adding a site or page to the "seen" structure is a simple matter of
doing
urls[key] = url
Dicts are basically hash tables, so this insert is amortized O(1) and
you get dupe-detection for free, provided the key is unique. If you
want to display the contents in alphabetical order by "key," you can
do this by:
keyz = sorted (urls.keys())
for key in keyz:
print urls[key]
The problem with this is that you pay the sorting cost every time you
want to display the results. If that is a problem, you can maintain
an in-memory index (and rely on the database to do it for you in the
case of externally stored results).
i have tried both.
anyway how do i concatenate sets? which i have to if use to funcions
like you suggest.
and also when iterating a set is there any other way to avoid
exceptions than have a counter and compare to len(set).
is there no method that cna check if iter != None
nevermind got it how to concatenate.
but no builtin method for concatenating right?
def joinSets(set1, set2):
for i in set2:
set1.add(i)
return set1
def scrapeSites(startAddress):
site = startAddress
sites = set()
iterator = iter(sites)
pos = 0
while pos < 10:#len(sites):
newsites = scrapeSite(site)
joinSets(sites, newsites)
pos += 1
site = iterator.next()
return sites
def scrapeSite(address):
toSet = set()
site = urllib.urlopen(address)
for row in site:
obj = url.search(row)
if obj != None:
toSet.add(obj.group())
return toSet
wtf? im not multithreading or anything so how can the size change here?
> what about heapq for sorting?
Heap is the data structure to use for 'fast (nearly) sorted inserts'.
But heapq do not support (as far as I know) deletion of duplicates.
But a custom heap class coud do that of course.
> def joinSets(set1, set2):
> for i in set2:
> set1.add(i)
> return set1
Use the | operator, or |=
> Traceback (most recent call last):
> File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
> x = scrapeSites("http://www.yahoo.com")
> File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
> scrapeSites
> site = iterator.next()
> RuntimeError: Set changed size during iteration
You will need two sets: the one you're iterating over, and another collecting new urls. Once you finish iterating the first, continue with the new ones; stop when it's empty.
> def scrapeSites(startAddress):
> site = startAddress
> sites = set()
> iterator = iter(sites)
> pos = 0
> while pos < 10:#len(sites):
> newsites = scrapeSite(site)
> joinSets(sites, newsites)
> pos += 1
> site = iterator.next()
> return sites
Try this (untested):
def scrapeSites(startAddress):
allsites = set() # all links found so far
pending = set([startAddress]) # pending sites to examine
while pending:
newsites = set() # new links
for site in pending:
newsites |= scrapeSite(site)
pending = newsites - allsites
allsites |= newsites
return allsites
> wtf? im not multithreading or anything so how can the size change here?
You modified the set you were iterating over. Another example of the same problem:
d = {'a': 1, 'b': 2, 'c':3}
for key in d:
d[key+key]=0
--
Gabriel Genellina
You change the size of "sites" here, which means you can no longer use
the iterator.
> wtf? im not multithreading or anything so how can the size change here?
How about:
def scrape_sites(start_address):
sites = set([start_address])
sites_scraped = set()
# This will run until it doesn't find any new sites, which may be
# a long time
while len(sites) > 0:
new_sites = set()
for site in sites:
new_sites.update(scrape_site(site))
sites_scraped.update(sites)
sites = new_sites.difference(sites_scraped)
return sites
Nobody has mentioned balanced-tree data structures so you might want
to look them up.