urlTree include and regex

6 views
Skip to first unread message

mnem0

unread,
Jul 27, 2010, 9:44:52 PM7/27/10
to UrlNet Python Library
Hi,
i just installed urlnet and python on my pc. I usually code with perl
and am totally new to python. I tried to understand the includeexclude
example and did write my own script based on several provided
examples.

I tried to crawl a forest made of two roots situated on two different
domains and i aim at limitating crawl only into these domains. Let's
say these two domains are domain1.com and domain2.com. i tried this
code in order to include all urls found belonging to domain1.com and
domain2.com :

#************************
#!/usr/bin/env python
#rmei.py

from urlnet.urltree import UrlTree

import re

urlsDeLuc = (
'http://quoniam.univ-tln.fr',
'http://quoniam.info',
)

net = UrlTree(_maxLevel=50)

incl_patternlist = [
'quoniam'
]

net.SetProperty('include_patternlist',incl_patternlist)

success = net.BuildUrlForest(Urls=urlsDeLuc)
if success:
net.WritePajekFile('lucForest1', 'lucForest1')


#*********************

with this code i only crawl 4 urls. i do suppose that the filtering
does exclude relative urls found during the crawling since they don't
contain any domain name/my include condition.

could anyone tell me how to properly handle regexes in order to
include/exclude following the rules stated above ?
best regards,
mnem0

Dale Hunscher

unread,
Jul 30, 2010, 6:13:16 PM7/30/10
to urlnet-pyt...@googlegroups.com
mnem0,

Apologies for the delayed response. I have been swamped at work, but I
will try to find time to look at this over the weekend.

Dale Hunscher
---
Dale A. Hunscher, MSI
CTO, Cielo Medsolutions LLC
3520 Green Ct.
Suite 150
Ann Arbor, MI 48105
Office: (734) 827-1000 x5679
Fax: (734) 661-2668
http://www.cielomedsolutions.com

> --
> You received this message because you are subscribed to the Google Groups "UrlNet Python Library" group.
> To post to this group, send email to urlnet-pyt...@googlegroups.com.
> To unsubscribe from this group, send email to urlnet-python-li...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/urlnet-python-library?hl=en.
>
>

Reply all
Reply to author
Forward
0 new messages