Infinite Loop Crawling (Query Strings?)

614 views
Skip to first unread message

Windter

unread,
Oct 28, 2012, 9:09:34 PM10/28/12
to scrapy...@googlegroups.com
I'm sure this is a common issue. When crawling, my crawler doesn't stop, and I believe it is because of the randomly generated query strings that are appended when clicking a link. For example:


I found from [http://stackoverflow.com/questions/8567171/scrapy-query-string-removal] that you can use urlparse to clean up the url, but my code doesn't work. Currently I have:


def clean_url(url):
   o = urlparse(url)
   cleaned_url = o.scheme + "://" + o.netloc + o.path
   return cleaned_url

class fmpSpider(CrawlSpider):
   name = "fmp"
   allowed_domains = ["ece.utexas.edu"]
   start_urls = [
   ]
   
   rules = (

        Rule(SgmlLinkExtractor(allow=('ece\.utexas\.edu'), process_value = clean_url,), callback='parse_items', follow = True),
   )

Whatever is being returned by my function clean_url is not correct. Help is much appreciated!

- Windter

Windter

unread,
Oct 29, 2012, 1:59:35 AM10/29/12
to scrapy...@googlegroups.com
I felt I needed to be more specific and clear about my issue:

When I am not using the clean_url function this is what happens:

My crawler starts at http://www.ece.utexas.edu and allows following any links on the domain ece.utexas.edu. When my crawler reaches this page [http://www.pharos.ece.utexas.edu/wiki/index.php?title=Special:UserLogin&returnto=Main+Page] it never leaves and thus goes into an infinite loop. All the response urls look something like:

"http://pharos.ece.utexas.edu/wiki/index.php?from=20121028231450&hideliu=1&hidemyself=1&target=Pharos_Tutorials&.........................................."


Now, after I add the clean_url function (in my original post), my crawler prematurely completes. Here are some of the first debug lines in the shell:

2012-10-29 00:45:57-0500 [fmp] DEBUG: Crawled (200) <GET http://www.ece.utexas.edu> (referer: None)
2012-10-29 00:45:57-0500 [fmp] DEBUG: Crawled (404) <GET http://www.ece.utexas.edu/:///sitemap/> (referer: http://www.ece.utexas.edu)
2012-10-29 00:45:57-0500 [fmp] DEBUG: Crawled (404) <GET http://www.ece.utexas.edu/:///it/cadence.cfm> (referer: http://www.ece.utexas.edu)
2012-10-29 00:45:57-0500 [fmp] DEBUG: Crawled (404) <GET http://www.ece.utexas.edu/:///aboutece/news_detail.cfm> (referer: http://www.ece.utexas.edu)
2012-10-29 00:45:57-0500 [fmp] DEBUG: Crawled (404) <GET http://www.ece.utexas.edu/:///index.cfm> (referer: http://www.ece.utexas.edu)
...
...

The url being crawled is obviously wrong. I can paste in my parse if need be.


Thanks,
Windter

Pablo Hoffman

unread,
Oct 30, 2012, 1:42:11 AM10/30/12
to scrapy...@googlegroups.com
That clean_url function removes all url arguments, which is probably too aggressive and not what you want.

You'd want something that removes *certain* url arguments, those that are causing function.

See the url_query_cleaner() function (from w3lib.url module) for that. w3lib is already a Scrapy dependency, so you can import that function directly from your Scrapy spider.

Windter

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/2bXzDNf_GGsJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Windter Pan

unread,
Oct 31, 2012, 4:03:41 PM10/31/12
to scrapy...@googlegroups.com
Thanks, Pablo! Will look into it.
Reply all
Reply to author
Forward
0 new messages