which datastructure for fast sorted insert?

notnor...@yahoo.se

unread,

May 25, 2008, 2:37:00 AM5/25/08

to

im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.

so obv i want fast insert. i want to delete duplicates too.

which datastructure is best for this?

Rares Vernica

unread,

May 25, 2008, 2:56:10 AM5/25/08

to

use a set to store them:

>>> s=set()
>>> s.add('a')
>>> s.add('b')
>>> s
set(['a', 'b'])
>>> s.add('a')
>>> s
set(['a', 'b'])
>>> s.add('c')
>>> s
set(['a', 'c', 'b'])
>>>

it does remove duplicates, but is it not ordered. to order it you can
use:

>>> l=list(s)
>>> l.sort()
>>> l
['a', 'b', 'c']

hth,
Rares

notnor...@yahoo.se

unread,

May 25, 2008, 3:10:45 AM5/25/08

to

sets dont seem to be so good because there is no way to iterate them.

s.pop() remove and return an arbitrary element from s; raises
KeyError if empty

i dont want to remove i just want to get the string that is stored
there.

Marc 'BlackJack' Rintsch

unread,

May 25, 2008, 3:32:49 AM5/25/08

to

On Sun, 25 May 2008 00:10:45 -0700, notnorwegian wrote:

> sets dont seem to be so good because there is no way to iterate them.

Err:

In [82]: for x in set(['a', 'b', 'c']):
....: print x
....:
a
c
b

Ciao,
Marc 'BlackJack' Rintsch

Gabriel Genellina

unread,

May 25, 2008, 12:05:31 PM5/25/08

to pytho...@python.org

Use a list, and the bisect module to keep it sorted:

py> import bisect
py> def insert_nodupes(slist, item):
... i = bisect.bisect_left(slist, item)
... if i>=len(slist) or slist[i] != item:
... slist.insert(i, item)
...
py> items = []
py> insert_nodupes(items, 'link to some site')
py> insert_nodupes(items, 'another link')
py> insert_nodupes(items, 'third site')
py> insert_nodupes(items, 'another link')
py> items
['another link', 'link to some site', 'third site']

--
Gabriel Genellina

Terry Reedy

unread,

May 25, 2008, 12:59:17 PM5/25/08

to pytho...@python.org

"Rares Vernica" <ra...@ics.uci.edu> wrote in message
news:y1r1w3q...@ics.uci.edu...
| >>> l=list(s)
| >>> l.sort()

This can be condensed to l = sorted(s)

Stefan Behnel

unread,

May 25, 2008, 1:46:47 PM5/25/08

to notnor...@yahoo.se

Keep the data redundantly in two data structures. Use collections.deque() to
append and remove as in a queue, and set() to find duplicates.

Again, no ordering, but very fast insert/delete/dup-check.

Stefan

Rodrigo Lazo

unread,

May 25, 2008, 2:02:47 PM5/25/08

to pytho...@python.org

Stefan Behnel <stef...@behnel.de> writes:

> notnor...@yahoo.se wrote:
>> im writing a webcrawler.
>> after visiting a new site i want to store it in alphabetical order.
>>
>> so obv i want fast insert. i want to delete duplicates too.
>>
>> which datastructure is best for this?
>
> Keep the data redundantly in two data structures. Use collections.deque() to
> append and remove as in a queue, and set() to find duplicates.
>

what about heapq for sorting?

--

Rodrigo Lazo (rlazo)

I V

unread,

May 25, 2008, 4:07:15 PM5/25/08

to

On Sun, 25 May 2008 13:05:31 -0300, Gabriel Genellina wrote:
> Use a list, and the bisect module to keep it sorted:

That's worth doing if you need the data to be sorted after each insert.
If the OP just needs the data to be sorted at the end, using a data
structure with fast inserts (like a set) and then sorting at the end will
likely be faster.

notnor...@yahoo.se

unread,

May 25, 2008, 6:49:16 PM5/25/08

to

i meant like set[pos], not iterate but access a specific position in
the set.

now i have to do:
s = toSet.pop()
toSet.add(s)

if i want to get the url of the next item in a set, how would i do
that?
i can do this with a list:

def scrapeSitesX(startAddress, toList, listPos):
return scrapeSites(toList[listPos], toList, listPos+1)
to use recursion with a list.
i can work around that but it doesnt get as elegant.

I V

unread,

May 25, 2008, 7:30:49 PM5/25/08

to

On Sun, 25 May 2008 15:49:16 -0700, notnorwegian wrote:
> i meant like set[pos], not iterate but access a specific position in the
> set.

If you need to access arbitrary elements, use a list instead of a set
(but you'll get slower inserts). OTOH, if you just need to be able to get
the next item from the set, you can use an iterator:

> now i have to do:
> s = toSet.pop()
> toSet.add(s)

i = iter(toSet)

s = i.next()

> if i want to get the url of the next item in a set, how would i do that?
> i can do this with a list:
>
> def scrapeSitesX(startAddress, toList, listPos): return
> scrapeSites(toList[listPos], toList, listPos+1) to use recursion with a
> list.
> i can work around that but it doesnt get as elegant.

Do you have to use recursion here? Why not:

for site in toList:
scrape(site)

miller...@gmail.com

unread,

May 25, 2008, 8:08:39 PM5/25/08

to

I think you ought to re-examine your requirements. Why do you need to
store in alphabetical order? (And, what does that even mean for a web
site?) Personally, I'd probably just use a simple dict for in-memory
storage, and use a database for external storage. In that case,
adding a site or page to the "seen" structure is a simple matter of
doing

urls[key] = url

Dicts are basically hash tables, so this insert is amortized O(1) and
you get dupe-detection for free, provided the key is unique. If you
want to display the contents in alphabetical order by "key," you can
do this by:

keyz = sorted (urls.keys())
for key in keyz:
print urls[key]

The problem with this is that you pay the sorting cost every time you
want to display the results. If that is a problem, you can maintain
an in-memory index (and rely on the database to do it for you in the
case of externally stored results).

notnor...@yahoo.se

unread,

May 25, 2008, 9:04:35 PM5/25/08

to

i have tried both.

anyway how do i concatenate sets? which i have to if use to funcions
like you suggest.

and also when iterating a set is there any other way to avoid
exceptions than have a counter and compare to len(set).
is there no method that cna check if iter != None

notnor...@yahoo.se

unread,

May 25, 2008, 9:14:51 PM5/25/08

to

nevermind got it how to concatenate.

but no builtin method for concatenating right?

notnor...@yahoo.se

unread,

May 25, 2008, 9:42:06 PM5/25/08

to

Traceback (most recent call last):
File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
x = scrapeSites("http://www.yahoo.com")
File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
scrapeSites
site = iterator.next()
RuntimeError: Set changed size during iteration

def joinSets(set1, set2):
for i in set2:
set1.add(i)
return set1

def scrapeSites(startAddress):
site = startAddress
sites = set()
iterator = iter(sites)
pos = 0
while pos < 10:#len(sites):
newsites = scrapeSite(site)
joinSets(sites, newsites)
pos += 1
site = iterator.next()
return sites

def scrapeSite(address):
toSet = set()
site = urllib.urlopen(address)
for row in site:
obj = url.search(row)
if obj != None:
toSet.add(obj.group())
return toSet

wtf? im not multithreading or anything so how can the size change here?

Message has been deleted

sturlamolden

unread,

May 25, 2008, 10:15:18 PM5/25/08

to

On May 25, 8:02 pm, Rodrigo Lazo <rlazo....@gmail.com> wrote:

> what about heapq for sorting?

Heap is the data structure to use for 'fast (nearly) sorted inserts'.
But heapq do not support (as far as I know) deletion of duplicates.
But a custom heap class coud do that of course.

Gabriel Genellina

unread,

May 25, 2008, 10:25:33 PM5/25/08

to pytho...@python.org

En Sun, 25 May 2008 22:42:06 -0300, <notnor...@yahoo.se> escribió:

> def joinSets(set1, set2):
> for i in set2:
> set1.add(i)
> return set1

Use the | operator, or |=

> Traceback (most recent call last):
> File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
> x = scrapeSites("http://www.yahoo.com")
> File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
> scrapeSites
> site = iterator.next()
> RuntimeError: Set changed size during iteration

You will need two sets: the one you're iterating over, and another collecting new urls. Once you finish iterating the first, continue with the new ones; stop when it's empty.

> def scrapeSites(startAddress):
> site = startAddress
> sites = set()
> iterator = iter(sites)
> pos = 0
> while pos < 10:#len(sites):
> newsites = scrapeSite(site)
> joinSets(sites, newsites)
> pos += 1
> site = iterator.next()
> return sites

Try this (untested):

def scrapeSites(startAddress):
allsites = set() # all links found so far
pending = set([startAddress]) # pending sites to examine
while pending:
newsites = set() # new links
for site in pending:
newsites |= scrapeSite(site)
pending = newsites - allsites
allsites |= newsites
return allsites

> wtf? im not multithreading or anything so how can the size change here?

You modified the set you were iterating over. Another example of the same problem:

d = {'a': 1, 'b': 2, 'c':3}
for key in d:
d[key+key]=0

--
Gabriel Genellina

I V

unread,

May 25, 2008, 10:42:58 PM5/25/08

to

On Sun, 25 May 2008 18:42:06 -0700, notnorwegian wrote:
> def scrapeSites(startAddress):
> site = startAddress
> sites = set()
> iterator = iter(sites)
> pos = 0
> while pos < 10:#len(sites):
> newsites = scrapeSite(site)
> joinSets(sites, newsites)

You change the size of "sites" here, which means you can no longer use
the iterator.

> wtf? im not multithreading or anything so how can the size change here?

How about:

def scrape_sites(start_address):
sites = set([start_address])
sites_scraped = set()
# This will run until it doesn't find any new sites, which may be
# a long time
while len(sites) > 0:
new_sites = set()
for site in sites:
new_sites.update(scrape_site(site))
sites_scraped.update(sites)
sites = new_sites.difference(sites_scraped)
return sites

Message has been deleted

Paul Rubin

unread,

May 26, 2008, 4:54:56 PM5/26/08

to

Nobody has mentioned balanced-tree data structures so you might want
to look them up.