Trying to read from message queue, not parsing response in make_requests_from_url loop

97 views
Skip to first unread message

Jeremy D

unread,
Jun 16, 2016, 1:43:31 PM6/16/16
to scrapy...@googlegroups.com
I have this question on SO, but no answers unfortunately. Figured Id try my luck here.


I'm trying to get scrapy to grab a URL from a message queue, and then scrape that URL. I have the loop going just fine and grabbing the URL from the queue, but it never enters the parse() method once it has a url, it just continues to loop (and sometimes the url comes back around even though I've deleted it from the queue...)

While it's running in terminal, if I CTRL+C and force it to end, it enters the parse() method and crawls the page, then ends. I'm not sure what's wrong here. Scrapy needs to be running at all times to catch a url as it enters the queue. Anyone have ideas or have done something like this?


class my_Spider(Spider):
        name = "my_spider"
        allowed_domains = ['domain.com']

        def __init__(self):
            super(my_Spider, self).__init__()
            self.url = None

        def start_requests(self):
            while True:
                # Crawl the url from queue
                yield self.make_requests_from_url(self._pop_queue())

        def _pop_queue(self):
            # Grab the url from queue
            return self.queue()

        def queue(self):
            url = None
            while url is None:
                conf = {
                    "sqs-access-key": "",
                    "sqs-secret-key": "",
                    "sqs-queue-name": "crawler",
                    "sqs-region": "us-east-1",
                    "sqs-path": "sqssend"
                }
                # Connect to AWS
                conn = boto.sqs.connect_to_region(
                    conf.get('sqs-region'),
                    aws_access_key_id=conf.get('sqs-access-key'),
                    aws_secret_access_key=conf.get('sqs-secret-key')
                )
                q = conn.get_queue(conf.get('sqs-queue-name'))
                message = conn.receive_message(q)
                # Didn't get a message back, wait.
                if not message:
                    time.sleep(10)
                    url = None
                else:
                    url = message
            if url is not None:
                message = url[0]
                message_body = str(message.get_body())
                message.delete()
                self.url = message_body
                return self.url

        def parse(self, response):
            ...
            yield item

Neverlast N

unread,
Jun 16, 2016, 8:24:02 PM6/16/16
to scrapy...@googlegroups.com
Thanks for bringing this up. I answered in SO. As a methodology - I would say - try to make the simplest working thing possible and then build up towards the more complex code you have. See at which point it breaks. Is it when you add an API call? Is it when you return something? What I did was to replace your queue() with this and it seems to work:

    def queue(self):
        return 'http://www.example.com/?{}'.format(random.randint(0,100000))
What can we infer from this?



From: jdavi...@gmail.com
Date: Thu, 16 Jun 2016 13:43:28 -0400
Subject: Trying to read from message queue, not parsing response in make_requests_from_url loop
To: scrapy...@googlegroups.com
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Jeremy D

unread,
Jun 17, 2016, 12:53:44 AM6/17/16
to scrapy...@googlegroups.com
Does seem to work. Using deferToThread I run into the same problem, where it doesnt get into the parse() method until the program closes. I'm open to other ideas of how to organically get a URL for scrapy to crawl that isn't through a message queue, though this seems to be the most sensible option, if I can get it to work.

This is pretty messy, but here's what I have (I've never used deferToThread, or much threading in general for that matter so I may be doing this wrong)

Full pastebin here (exactly what I have, minus AWS creds): http://pastebin.com/4cebXyTc

def start_requests(self):
        self.logger.error("STARTING QUEUE")
        while True:
            queue = deferToThread(self.queue)
            self.logger.error(self.cpuz_url)
            if self.cpuz_url is None:
                time.sleep(10)
                continue
            yield Request(self.cpuz_url, self.parse)

I've then changed my queue() function to have a try catch after it gets the 

        try:
            message = message[0]
            message_body = message.get_body()
            self.logger.error(message_body)
            message_body = str(message_body).split(',')
            message.delete()
            self.cpuz_url = message_body[0]
            self.uid = message_body[1]
        except:
            self.logger.error(message)
            self.logger.error(self.cpuz_url)
            self.cpuz_url = None


Dimitris Kouzis - Loukas

unread,
Jun 18, 2016, 4:21:15 PM6/18/16
to scrapy-users
Updated the SO answer with a functional example. Cheers.
Reply all
Reply to author
Forward
0 new messages