Crawling multiple links with Scrapy

48 показвания

Преминаване към първото непрочетено съобщение

shehrumbk

непрочетено,

4.11.2016 г., 14:37:194.11.16 г.

до scrapy-users

Hey guys, I'm new to Scrapy and trying to implement a broad crawl. My goal is to visit all internal links of any given website, avoid duplicates, and save the body text.

E.g There is a website example.com

I want to visit all static URLs of example.com such as

example.com/A

example.com/A/zxc

example.com/A/zxc/f

and do the same for another domain such as

exampleB.com/A

exampleB.com/A/zxc

exampleB.com/A/zxc/f

and so on. The rule should apply for all my links that I retrieve from my database. Provided that I don't traverse the links already visited, and end my crawl of the website if there are no more static links to be visited. Can anyone guide me in how to achieve this?

What I tried:

class PHscrapy(scrapy.Spider):
    name = "PHscrapy"
    crawled_links = []

    rules = (Rule(LxmlLinkExtractor(allow=(), unique=True),callback='parse',canonicalize=True, follow=True),)

    def start_requests(self):
        db = MySQLdb.connect("localhost", "****", "****", "***")
        cursor = db.cursor()
        cursor.execute("SELECT website FROM SHOPPING")
        links = cursor.fetchall()
        for url in links:
            yield scrapy.Request(url=url[0], meta={'base_url': url[0]}, callback=self.parse)

def parse(self, response):
    base_url = response.meta['base_url']
    for link in LxmlLinkExtractor(allow=(base_url+'/*'),unique=True,canonicalize=True).extract_links(response):
        print(link.url)
        yield scrapy.Request(link.url,callback=self.parse,meta=response.meta)

Отговор до всички

Отговор до автора

Препращане

0 нови съобщения