Ignore pages that contain specific text

2 views
Skip to first unread message

diddy

unread,
Sep 15, 2009, 11:28:03 AM9/15/09
to UrlNet Python Library
Hi,

Is it possible to only look at web pages where certain words/phrases
exist? E.g. if I do a web search for, say "swine flu" and create a
network from this. Is there any
way I can only harvest links from pages that contains the phrase
"swine flu" somewhere in the text of the page?


Many thanks, Dave.


Dale Hunscher

unread,
Sep 15, 2009, 11:49:10 AM9/15/09
to urlnet-pyt...@googlegroups.com
I've been looking into this for other reasons, but I don't have working code yet. More news when that happens...

Dale A. Hunscher, MSI
Sr. Business Systems Analyst
University of Michigan Medical School
734-678-5178

diddy

unread,
Sep 22, 2009, 8:37:03 AM9/22/09
to UrlNet Python Library
Cool. I think this could add a real unique edge to the code.
Basically, you only want to harvest links from pages that contain (or
don't contain) a given word/phrase(s). This would keep the network
analysis "on topic".

Dave.
> > Many thanks, Dave.- Hide quoted text -
>
> - Show quoted text -

Dale Hunscher

unread,
Sep 25, 2009, 12:48:32 AM9/25/09
to urlnet-pyt...@googlegroups.com
Dave,

See attached ZIP.

The example program includeexclude.py should get you started. I haven't tested this a lot, but I wanted to get it to you ASAP to get your feedback on it.


Dale A. Hunscher, MSI
Sr. Business Systems Analyst
University of Michigan Medical School
734-678-5178



urlnetandexamples.zip

diddy

unread,
Sep 28, 2009, 11:22:34 AM9/28/09
to UrlNet Python Library
Hi - this is great news, thanks. I ran this script (below), and my
required phrase was "social media", but it still seemed to include
refs to pages that did not contain this phrase in the page text. Am I
doing something wrong?: -

# urlforest1.py
from urlnet.urltree import UrlTree
import re # for search flags


net = UrlTree(_useHostNameForDomainName = True,_maxLevel=1)

incl_patternlist = [
# new words or phrases could be added by copying and modifying
this line.
'social media',
]

net.SetProperty('include_patternlist',incl_patternlist)

urlforest = (
'list_of_urls',)



ignorableText = \
['list_of_text_to_ignore',]


net.SetIgnorableText(ignorableText)


success = net.BuildUrlForest(Urls=urlforest)
if success:
net.WritePajekFile('urlforest', 'urlforest')
net.WritePajekNetworkFile('urltree2', 'urltree2domains', urlNet =
False)
net.WritePairNetworkFile('urltree1',
'urltree1domains',
urlNet = False, # do domains instead
uniquePairs = True,
delimiter = ' ')
>  urlnetandexamples.zip
> 299KViewDownload- Hide quoted text -

Dale Hunscher

unread,
Sep 29, 2009, 9:38:27 AM9/29/09
to urlnet-pyt...@googlegroups.com
Is there any chance you could send me an example URL of a page you expected to be excluded? That way I can debug. If you don't want it to go to the entire list you can email it to my GMail account directly (dalehu...@gmail.com). This code is really green, so I'm not too surprised if there is a bug.


Dale A. Hunscher, MSI
Sr. Business Systems Analyst
University of Michigan Medical School
734-678-5178



diddy

unread,
Sep 29, 2009, 10:22:38 AM9/29/09
to UrlNet Python Library, Dale Hunscher
Hi Dale - I have emailed you the file I used.

Many thanks, Dave.
> > > - Show quoted text -- Hide quoted text -
Reply all
Reply to author
Forward
0 new messages