Parse several scrapy's select()

67 views

Skip to first unread message

jaja...@gmail.com

unread,

Feb 6, 2016, 2:40:59 PM2/6/16

to scrapy-users

I seem to be having an issue with this spider. I am new to python and to scrapy, therefore, I might be missing some fundamentals and any direction in this would really help.

Below is the code so far. I'm sure I have quite a few errors and I keep trying to fix them but I'm not getting too far. I am trying to get the spider to go into one page that has a table with the date, and article title and link embedded in the title as you can see from the pic. Them once got the info from one row, go on to the next.

I figured that the best way to select the right sections was to use scrapy's select() to dig deeper into the node as the date is in it's own html class and the url and title are in another:

So, I used the "times = hxs.select('//td[@class="stime3"]')" to get the date and " sites = hxs.select('//td[@class="article"]')" to get the title name and url.

from scrapy.spider import BaseSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.linkextractors import LinkExtractor
from dirbot.items import WebsiteLoader
from scrapy.http import Request
from scrapy.http import HtmlResponse



class DindexSpider(BaseSpider):
    name = "dindex"
    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"
    allowed_domains = ["newslookup.com"]
    start_urls = [
          "http://www.newslookup.com/Business/"]
   
    rules = (
        Rule(LinkExtractor(allow="newslookup.com/Business/"), callback="parse", follow=True),
    )
 
    def parse(self, response):  
                hxs = HtmlXPathSelector(response)
               self.log("Scraping: " + response.url)
           
                times = hxs.select('//td[@class="stime3"]')
             for time in times:
                      il = WebsiteLoader(response=response, selector=time)
                    il.add_xpath('publish_date', 'text()')
                  item = il.load_item()
                   yield Request(url=time, callback=self.parse_article)
                    
    def parse_article(self, response):
          hxs = HtmlXPathSelector(response)
               self.log("scraping: " + response.url)
           
                sites = hxs.select('//td[@class="article"]')
            for site in sites:
                      il = WebsiteLoader(response=response, selector=site)
                    il.add_xpath('name', 'a/text()')
                        il.add_xpath('url', 'a/@href')
                  item = il.load_item()
                   yield Request(url=times, callback=self.parse_item)
                      
                def parse_item(self, response):
                 item = response.meta['item']
                    yield il.load_item()

Now, I may have the logic completely wrong and I hope that someone can lead me in the right direction...

One of the errors I get when I run it is:

2016-02-06 12:21:22 [scrapy] ERROR: Spider error processing <GEThttp://www.newslookup.com/Business/> (referer: None)
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\offsite.py", line 28, in process_spider_output
for x in result:
File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\spidermiddlewares\depth.py", line 54, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\virtualenvs\[TextIndexer]\Scripts\example\dindex\dirbot-mysql\dirbot\spiders\dindex.py", line 32, in parse
yield Request(url=time, callback=self.parse_article)
File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\http\request\__init__.py", line 24, in __init__
self._set_url(url)
File "C:\Python27\lib\site-packages\scrapy-1.0.4-py2.7.egg\scrapy\http\request\__init__.py", line 57, in _set_url
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got HtmlXPathSelector:
2016-02-06 12:21:22 [scrapy] INFO: Closing spider (finished)

I am really not sure what's other than it's coming from line 32 when it yields the request.

Any help, direction would be greatly appreciated.

Thanks

Steven Almeroth

unread,

Mar 4, 2016, 11:48:49 PM3/4/16

to scrapy-users

> TypeError: Request url must be str or unicode, got HtmlXPathSelector:

Try:

yield Request(url=time.extract()[0], callback=self.parse_article)

Reply all

Reply to author

Forward

0 new messages