searchengine1.py

2 views
Skip to first unread message

diddy

unread,
Jul 14, 2009, 8:21:05 AM7/14/09
to UrlNet Python Library
Hi,

I'm trying to run the example script: searchengine1.py. It seems to
do something in the sense it outputs a date-stamped folder with these
files inside: -

*.txt.html
*.txt.txt
searchengine1.log
*.paj
*_domains.gdf
*_urls.gdf

All the files are empty apart from the *.txt.html file, which is a
cached version of the Google search page. So, I can tell it went out
and did the search, but then didn't do anything with it.

Any thoughts?

D.

Dale Hunscher

unread,
Jul 14, 2009, 10:41:09 AM7/14/09
to urlnet-pyt...@googlegroups.com
Hello D.,

You found a bug in searchenginetree.py that escaped my unit testing.
It seems to always expect there to be a list of ignorable text
strings, which if found in a URL cause it to be ignored. You can work
around this by adding a call to net.SetIgnorableText as shown in the
lines below, right after the constructor:

net = GoogleTree(_maxLevel=1,
_workingDir=workingDir,
_resultLimit=20,
_probabilityVector = probabilityByPositionStopSmokingClicks,
_probabilityVectorGenerator = vectorGenerator)

# workaround for bug in searchenginetree: always expects ignorable text

net.SetIgnorableText([])

Let me know if this doesn't work for you. I'll put a bug report in the
issue list.

Thanks!

Dale
--
Dale A. Hunscher, MSI
Sr. Business Systems Analyst
University of Michigan Medical School
734-678-5178

diddy

unread,
Jul 17, 2009, 3:56:06 AM7/17/09
to UrlNet Python Library
Great thanks. Haven't tested it yet but will do.

Also, is there any way the searchengine1.py example can actually
exclude all Google references? Google always comes out as the highest
result.

Many thanks, Dave.
> > D.- Hide quoted text -
>
> - Show quoted text -
Message has been deleted

diddy

unread,
Jul 17, 2009, 4:29:57 AM7/17/09
to UrlNet Python Library
Hi,

One more question. This is what I am using in searchengine1.py
(below). I want it to exclude all self references to Google, but the
output still contains Google domains - am I doing something wrong?: -


# searchengine1.py
import sys
import os


from urlnet.googletree import GoogleTree
import urlnet.log
from urlnet.clickprobabilities import
probabilityByPositionStopSmokingClicks
from urlnet.searchenginetree import
computeDescendingStraightLineProbabilityVector,\
computeEqualProbabilityVector
from urlnet.urlutils import GetTimestampString


# textToIgnore = ['google',''doubleclick',]
ignorableText = ['google','doubleclick',]


def main():
# uncomment one of the vectorGenerator assignments below


# vectorGenerator = computeEqualProbabilityVector
vectorGenerator = computeDescendingStraightLineProbabilityVector


"""
We are going to make a subdirectory under
the working directory that will be different each run.
"""
from urlnet.urlutils import GetConfigValue
from os.path import join


baseDir = GetConfigValue('workingDir')


# make unique directory to write to
timestamp = GetTimestampString()
workingDir = join(baseDir,timestamp)
oldDir = os.getcwd()


myLog = None


try:
try:
os.mkdir(baseDir)
except Exception, e:
pass #TODO: make sure it's because the dir already exists
try:
os.mkdir(workingDir)
except Exception, e:
pass #TODO: make sure it's because the dir already exists
os.chdir(workingDir)
myLog = urlnet.log.Log('main')
urlnet.log.logging=True
#log.trace=True
urlnet.log.altfd=open('searchengine1.log','w')
except Exception,e:
myLog.Write(str(e)+'\n')
goAhead = False


net = GoogleTree(_maxLevel=1,
_workingDir=workingDir,
_resultLimit=100,
_probabilityVector =
probabilityByPositionStopSmokingClicks,
_probabilityVectorGenerator = vectorGenerator)
"""


# workaround for bug in searchenginetree: always expects
ignorable
text


net.SetIgnorableText([])


# uncomment these lines if you want to see what results the top-
level query returns.


##################################################
######## get and view the result set URLs ########
##################################################
(queryURL,url,Urls) = net.GetSEResultSet('"true blood"')
print queryURL
print Urls
"""


"""
# comment out the lines below (from
BuildUrlForestWithPhantomRoot
through the WriteGuessFile calls) if you just want to see
# the result set and have activated the above lines.
"""


#########################################################
######## build a forest and output some networks ########
#########################################################
net.BuildUrlForestWithPhantomRoot('"true blood"')
#net.SetProperty('getTitles',True)


net.WritePajekFile('searchengine1','searchengine1' \
#,useTitles=True \
)
net.WriteGuessFile('searchengine1_urls'\
#,useTitles=True\
) # url network
net.WriteGuessFile('searchengine1_domains',False \
#,useTitles=True \
) #domain network


# tidy up
if urlnet.log.altfd:
urlnet.log.altfd.close()
urlnet.log.altfd = None


os.chdir(oldDir)


if __name__ == '__main__':
main()
sys.exit(0)




On Jul 14, 3:41 pm, Dale Hunscher <dalehunsc...@gmail.com> wrote:

Dale Hunscher

unread,
Jul 17, 2009, 4:27:16 PM7/17/09
to urlnet-pyt...@googlegroups.com
You're on the right track, 95% of the way there.

You need to change the line:

net.SetIgnorableText([])

To read

net.SetIgnorableText(ignorableText)

This will exclude all URLs with google in them, but in my experience
the ones you want to eliminate are more specific, so the kind of think
I use is something like this:

ignorableText = \
['video.google.com',
'books.google.com',
'news.google.com',
'maps.google.com',
'images.google.com',
'blogsearch.google.com',
'mail.google.com',
'fusion.google.com',
'google.com/intl',
'google.com/search',
'google.com/accounts',
'google.com/preferences',
'doubleclick',]

That's to replace the line where you set the ignorableText variable
earlier in the program, of course. I tried this and it seems to work.

Oddly enough, I just gave a poster at NetSci 2009 in Venice which
complained about this problem with Google self-referencing as much as
it does in its search results.

I'm attaching a version of searchengine1.py with these changes, to
save you the trouble of cut-and-paste:

Cheers,
Dale
searchengine1.py

diddy

unread,
Jul 21, 2009, 10:27:14 AM7/21/09
to UrlNet Python Library
Super - thanks so much! Can you tell I'm not a programmer?!

D.
>  searchengine1.py
> 5KViewDownload- Hide quoted text -

diddy

unread,
Sep 8, 2009, 11:07:04 AM9/8/09
to UrlNet Python Library
Hi,

I tried to run "writepairfile.py" and got this error message - did I
do something wrong?: -

>>>
Traceback (most recent call last):
File "E:\URLnet\examples\writepairfile.py", line 6, in <module>
net.WritePairNetworkFile('urltree1', 'urltree1urls', urlNet =
True)
AttributeError: UrlTree instance has no attribute
'WritePairNetworkFile'
> > - Show quoted text -- Hide quoted text -

Dale Hunscher

unread,
Sep 8, 2009, 11:14:46 AM9/8/09
to urlnet-pyt...@googlegroups.com
It sounds like you didn't copy the files from the urlnet directory in examples.zip to the right place. They should overwrite the contents of your original urlnet directory, wherever that may be. If you are running out of IDLE, you should close IDLE and restart after you copy the urlnet files to the right place.

diddy

unread,
Sep 10, 2009, 9:37:43 AM9/10/09
to UrlNet Python Library
I think my posts keep disappearing!

Thanks for the update - it now works. Do you have a solution to the
subdomain issue?

Very best, Dave.
> 734-678-5178- Hide quoted text -

Dale Hunscher

unread,
Sep 10, 2009, 10:19:00 AM9/10/09
to urlnet-pyt...@googlegroups.com
Hi Dave,

See the attached example. Line 3 is the key: set the _useHostNameForDomainName argument to the UrlTree constructor to True, and the host name (e.g. bongo.blogspot.com) will be used instead of domain name (blogspot.com) in the domain networks. I'll be including this example in the next release.


Dale A. Hunscher, MSI
Sr. Business Systems Analyst
University of Michigan Medical School
734-678-5178



urltree4blogs.py

Dale Hunscher

unread,
Sep 10, 2009, 10:22:24 AM9/10/09
to urlnet-pyt...@googlegroups.com
I forgot to mention: regarding your disappearing posts, I realized that I am getting your posts in two separate GMail threads. The post you thought had disappeared was in the other thread (sent about 5 hours ago). I saw it earlier and didn't have time to respond then. Weird...


Dale A. Hunscher, MSI
Sr. Business Systems Analyst
University of Michigan Medical School
734-678-5178



On Thu, Sep 10, 2009 at 9:37 AM, diddy <david.ba...@gmail.com> wrote:

diddy

unread,
Sep 15, 2009, 9:37:44 AM9/15/09
to UrlNet Python Library
Thanks again Dale :)

Hmm, odd. OK, I'll keep an eye on where my posts are ending up.
Bizarre.

One other question for you (maybe I should start a new thread?): -

Is it possible to only look at web pages where certain words/phrases
exist? E.g. if I do a web search for, say "swine flu". Is there any
way I can only harvest links from pages that contains the phrase
"swine flu" somewhere in the text of the page?

Many thanks, Dave.
> ...
>
> read more »- Hide quoted text -

diddy

unread,
Sep 15, 2009, 10:25:32 AM9/15/09
to UrlNet Python Library
Hi,

I tried the subdomain fix above in a urlforest script, but the .net
output still seems to contain just the domain and not the subdomain.
E.g.: -

# urlforest1.py
from urlnet.urltree import UrlTree
net = UrlTree(_useHostNameForDomainName = True)

urlforest = (
'www.some_website.com',
)

net = UrlTree(_maxLevel=1)


ignorableText = \
['video.google.com',
'blogsearch.google.com',
'google.com',
'www.stumbleupon.com/submit',
'cache',
'google',
'74.125.77.132',
'209.85.229.132',
'#',
'statcounter.',
'/analytics/',
'onestat',
'doubleclick',
'swicki',
'eurekster',
'yahoo.com',
'submit?',
'quantcast',
'ads2.',
'overture.',
'/rss/',
'/rdf/',
'/feed/',
'feeddigest',
'sitemeter',
'clustrmaps',
'adbureau',
'zeus.com',
'products/acrobat',
'hon.ch',
'feedburner.com',
'://help.',
'businesswire',
'/faq.',
'sys-con.com',
'jigsaw.w3c.org',
'/categories',
'sitemap',
'site-map',
'site_map',
'rss.xml',
'misoso.com',
'adjuggler.com',
'skype.com',
'validator.w3c.org',
'digg.com/submit',
'addthis.com',
'feedblitz',
'del.icio.us/post',
'feeddigest',
'feedster',
'/about/',
'careers',
'employment',
'sitemap',
'site-map',
'aolstore.com',
'aolsyndication.com',
'/privacy/',
'/privacy.',
'twitter.com/?status'
'twitter.com/home?status',
'/help/',
'phpbb',
'crawlability',
'w3.org',
'4networking',
'www.adtech.com'
'technorati',
'/submit?'
'/share.php',
'adserver',
'invisionboard',
'reddit.com/submit',
'www.myspace.com/Modules/PostTo/Pages/',
'www.facebook.com/share.php?',
'www.facebook.com/sharer.php?',
'www.linkedin.com/shareArticle?',
'doubleclick',]

net.SetIgnorableText(ignorableText)


success = net.BuildUrlForest(Urls=urlforest)
if success:
net.WritePajekFile('urlforest', 'urlforest')
net.WritePajekNetworkFile('urltree2', 'urltree2domains', urlNet =
False)
> ...
>
> read more »
>
>  urltree4blogs.py
> < 1KViewDownload- Hide quoted text -

Dale Hunscher

unread,
Sep 15, 2009, 11:13:45 AM9/15/09
to urlnet-pyt...@googlegroups.com
Hi Dave,

I'll have to take a look at this after work today - I'm swamped! Sounds like a bug. I'm just about to start a major test of new code I've been adding (6 or 7 months worth), and I will make sure I cover this in the tests so it will get caught in the future if it re-emerges after I fix it.

I'm glad you're using the ignorableText feature. I don't know if I mentioned it or not, but when I did the example I realized it would be a nice if not necessary addition to real-life usage.


Dale A. Hunscher, MSI
Sr. Business Systems Analyst
University of Michigan Medical School
734-678-5178



Dale Hunscher

unread,
Sep 15, 2009, 11:47:44 AM9/15/09
to urlnet-pyt...@googlegroups.com
Hi again,

I actually got a chance to read through it, and suggest you do two things:

1. remove the line that says

net = UrlTree(_maxLevel=1)

2. change the line that says


net = UrlTree(_useHostNameForDomainName = True)

to include the _maxLevel argument, like this:

net = UrlTree(_useHostNameForDomainName = True,_maxLevel=1)

The line I'm telling you to remove overrides the earlier line, setting the _useHostNameForDomainName back to its default value of False.
 
This should make it work.


Dale A. Hunscher, MSI
Sr. Business Systems Analyst
University of Michigan Medical School
734-678-5178



On Tue, Sep 15, 2009 at 10:25 AM, diddy <david.ba...@gmail.com> wrote:

diddy

unread,
Sep 18, 2009, 6:39:02 AM9/18/09
to UrlNet Python Library
Great - that seems to work fine :)

D.
> ...
>
> read more »- Hide quoted text -
Reply all
Reply to author
Forward
0 new messages