scrapy xpath returns empty list !!

1,898 views
Skip to first unread message

Scrapy_lover

unread,
Jun 27, 2012, 6:26:42 PM6/27/12
to scrapy...@googlegroups.com

i'm using scrapy crawl spider and trying to parse output pages to select all input tag parameter as the following :

  • input type :must be ( text or password or file )
  • input id : if it's not found , select [input name] instead.
Me@Me-pc:~$ scrapy shell http://testaspnet.vulnweb.com/default.aspx
2012-06-28 00:18:40+0200 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2012-06-28 00:18:40+0200 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-06-28 00:18:40+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-06-28 00:18:40+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-06-28 00:18:40+0200 [scrapy] DEBUG: Enabled item pipelines:
2012-06-28 00:18:40+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-06-28 00:18:40+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-06-28 00:18:40+0200 [default] INFO: Spider opened
2012-06-28 00:18:41+0200 [default] DEBUG: Crawled (200) <GET http://testaspnet.vulnweb.com/default.aspx> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html><head><title>acublog news</title><'>
[s]   item       {}
[s]   request    <GET http://testaspnet.vulnweb.com/default.aspx>
[s]   response   <200 http://testaspnet.vulnweb.com/default.aspx>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0xa8eb3ec>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

>>> hxs.select("//input[(@id or @name) and (@type = 'text' or @type = 'password' or @type = 'file')]/text() ").extract()
[]



what.png

Scrapy_lover

unread,
Jun 27, 2012, 6:30:30 PM6/27/12
to scrapy...@googlegroups.com
i searched a lot before posting , but still get empty list !!
Any help please

Steven Almeroth

unread,
Jun 28, 2012, 12:00:50 AM6/28/12
to scrapy...@googlegroups.com
The current response for that resource only contains two input elements, both with type="hidden", so the empty result list is correct.

2012-06-27 20:28:11-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)

>>> hxs.select("//input")
[<HtmlXPathSelector ... data=u'<input type="hidden" name="__VIEWSTATE" '>, 
<HtmlXPathSelector ... data=u'<input type="hidden" name="__EVENTVALIDA'>]

as far as the id/name replacement goes, you can do that outside of xpath in the Python code, something like:

index = id or name

Scrapy_lover

unread,
Jun 28, 2012, 2:21:54 PM6/28/12
to scrapy...@googlegroups.com
thank you a lot , you  are  the man :)

Nikhil Somaru

unread,
Jul 2, 2012, 1:21:10 AM7/2/12
to scrapy...@googlegroups.com
Can I suggest this xpath checker for Firefox? Beware, it generates XPath 2.0 responses which are not compatible with Scrapy (XPath 1.0) but you can modify them appropriately and see results on the fly.

Also, beware of things like javascript. I always test my XPath(s) against a page generated from view(response) in the shell.

Also, Firebug + Firepath serves me well.

Nikhil Somaru

unread,
Jul 2, 2012, 1:22:10 AM7/2/12
to scrapy...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages