Infinite Loop Crawling (Query Strings?)

Windter

unread,

Oct 28, 2012, 9:09:34 PM10/28/12

to scrapy...@googlegroups.com

I'm sure this is a common issue. When crawling, my crawler doesn't stop, and I believe it is because of the randomly generated query strings that are appended when clicking a link. For example:

"http://pharos.ece.utexas.edu/wiki/index.php?from=20121028231450&hideliu=1&hidemyself=1&target=Pharos_Tutorials&.........................................."

I found from [http://stackoverflow.com/questions/8567171/scrapy-query-string-removal] that you can use urlparse to clean up the url, but my code doesn't work. Currently I have:

def clean_url(url):

o = urlparse(url)

cleaned_url = o.scheme + "://" + o.netloc + o.path

return cleaned_url

class fmpSpider(CrawlSpider):

name = "fmp"

allowed_domains = ["ece.utexas.edu"]

start_urls = [

"http://www.ece.utexas.edu"

]

rules = (

Rule(SgmlLinkExtractor(allow=('ece\.utexas\.edu'), process_value = clean_url,), callback='parse_items', follow = True),

)

Whatever is being returned by my function clean_url is not correct. Help is much appreciated!

- Windter

Windter

unread,

Oct 29, 2012, 1:59:35 AM10/29/12

to scrapy...@googlegroups.com

I felt I needed to be more specific and clear about my issue:

When I am not using the clean_url function this is what happens:

My crawler starts at http://www.ece.utexas.edu and allows following any links on the domain ece.utexas.edu. When my crawler reaches this page [http://www.pharos.ece.utexas.edu/wiki/index.php?title=Special:UserLogin&returnto=Main+Page] it never leaves and thus goes into an infinite loop. All the response urls look something like:

"http://pharos.ece.utexas.edu/wiki/index.php?from=20121028231450&hideliu=1&hidemyself=1&target=Pharos_Tutorials&.........................................."

Now, after I add the clean_url function (in my original post), my crawler prematurely completes. Here are some of the first debug lines in the shell:

2012-10-29 00:45:57-0500 [fmp] DEBUG: Crawled (200) <GET http://www.ece.utexas.edu> (referer: None)

2012-10-29 00:45:57-0500 [fmp] DEBUG: Crawled (404) <GET http://www.ece.utexas.edu/:///sitemap/> (referer: http://www.ece.utexas.edu)

2012-10-29 00:45:57-0500 [fmp] DEBUG: Crawled (404) <GET http://www.ece.utexas.edu/:///it/cadence.cfm> (referer: http://www.ece.utexas.edu)

2012-10-29 00:45:57-0500 [fmp] DEBUG: Crawled (404) <GET http://www.ece.utexas.edu/:///aboutece/news_detail.cfm> (referer: http://www.ece.utexas.edu)

2012-10-29 00:45:57-0500 [fmp] DEBUG: Crawled (404) <GET http://www.ece.utexas.edu/:///index.cfm> (referer: http://www.ece.utexas.edu)

...

The url being crawled is obviously wrong. I can paste in my parse if need be.

Thanks,

Windter

Pablo Hoffman

unread,

Oct 30, 2012, 1:42:11 AM10/30/12

to scrapy...@googlegroups.com

That clean_url function removes all url arguments, which is probably too aggressive and not what you want.

You'd want something that removes *certain* url arguments, those that are causing function.

See the url_query_cleaner() function (from w3lib.url module) for that. w3lib is already a Scrapy dependency, so you can import that function directly from your Scrapy spider.

Windter

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/2bXzDNf_GGsJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Windter Pan

unread,

Oct 31, 2012, 4:03:41 PM10/31/12

to scrapy...@googlegroups.com

Thanks, Pablo! Will look into it.

Reply all

Reply to author

Forward