Shared Job Queue with Postgresql

151 views
Skip to first unread message

k bez

unread,
Dec 25, 2016, 6:49:50 PM12/25/16
to scrapy-users
I have started to implement a custom job queue so i ll move from default SQLite to Postgre.
I use this setting SPIDER_QUEUE_CLASS = 'mysite.scraper.PostgreSQLQueue' but scrapyd seems to ignore it and create/use default SQLite DB.
I want to ask if SPIDER_QUEUE_CLASS dont work anymore?
Thanks in advance.

Nikolaos-Digenis Karagiannis

unread,
Dec 26, 2016, 3:45:25 AM12/26/16
to scrapy-users
Hi,

You probably looked into some old documentation.
This setting was introduced before scrapy separated from scrapyd.
Job queues are now implemented in scrapyd.
A pickaxe search confirms this:
https://github.com/scrapy/scrapy/commit/75e2c3eb338ea03e487907fa8c99bb12317e9435
This was a point were many release notes are missing,
notes clarifying what was removed from scrapy,
because the "separation of scrapyd" alone doesn't say much.

Do you use scrapyd?
Unfortunately, the job queue class is no longer configurable.
It shouldn't be hard however to patch scrapyd,
either to make it configurable or to use your own fork.

Check out scrapyd's repository https://github.com/scrapy/scrapyd/
and if you come up with something
don't hesitate to open a PR or an issue with suggestions.
We'll be glad to help,
scrapyd does need its components to become less tightly coupled
and making the job queue configurable can contribute to this.

k bez

unread,
Dec 27, 2016, 10:59:12 AM12/27/16
to scrapy-users
Ok I patched sqlite.py module and SqlitePriorityQueue class and it seems to work fine with Postgre.
But i have some questions if smone can answer.
1. SQLite use parameter
"check_same_thread=False" for connection. But I think Postgre con is thread-safe so i didnt use smthing like this.
2. Whats is the use of remove and clear methods of
SqlitePriorityQueue class? Queue worked fine before patch them as i couldnt find a call for them.

Nikolaos-Digenis Karagiannis

unread,
Dec 27, 2016, 4:00:50 PM12/27/16
to scrapy...@googlegroups.com
1. Yes, just keep the connection in auto commit.
Also, know that you will have to ensure your pg server's availability.
2. Leftovers. Shouldn't matter.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/V8vshXijC5c/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users+unsubscribe@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

k bez

unread,
Dec 27, 2016, 6:14:43 PM12/27/16
to scrapy-users
I didnt use auto commit but seems to work fine without it in development server.
There are some rollbacks in scrapyd code too so i dont know if auto commit is good.....
Why auto commit is important?

Thanks in advance.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

k bez

unread,
Dec 30, 2016, 2:23:59 PM12/30/16
to scrapy-users
I patched two production servers with changes.
I use one server/scrapyd instance to populate shared queue with new records and consume, and the 2nd one only consume records from shared queue.
So i have 2 scrapyd instances on different servers consume records from same shared queue.
This scenario worked well for some hours but i am in front of this bug now.

https://github.com/scrapy/scrapyd/issues/40

My error log
    Traceback (most recent call last):
      File "/root/mysite/local/lib/python2.7/site-packages/twisted/internet/base.py", line 825, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/root/mysite/local/lib/python2.7/site-packages/twisted/internet/task.py", line 239, in __call__
        d = defer.maybeDeferred(self.f, *self.a, **self.kw)
      File "/root/mysite/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 149, in maybeDeferred
        result = f(*args, **kw)
      File "/root/mysite/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1331, in unwindGenerator
        return _inlineCallbacks(None, gen, Deferred())
    --- <exception caught here> ---
      File "/root/mysite/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1185, in _inlineCallbacks
        result = g.send(result)
      File "/root/mysite/local/lib/python2.7/site-packages/scrapyd/poller.py", line 24, in poll
        returnValue(self.dq.put(self._message(msg, p)))
      File "/root/mysite/local/lib/python2.7/site-packages/scrapyd/poller.py", line 33, in _message
        d = queue_msg.copy()
    exceptions.AttributeError: 'NoneType' object has no attribute 'copy'

Instead of 2 spiders like original bug, i guess i have same bug cause of 2 scrapyd instances with same spider queue.
Is there something that i can do to avoid it?

Nikolaos-Digenis Karagiannis

unread,
Jan 3, 2017, 5:25:21 AM1/3/17
to scrapy-users
Hi,
I thought I replied to your question about the autocommit
but it seems I didn't hit "post".
Since in postgresql you can do "DELETE .... RETURNING ..."
you can merge the "select" and "delete" queries in a single query
without the need to execute any python code in the middle of the transaction.

But I looked into the matter in haste,
now I see more problems with the builtin queue.
The selected row can't disappear while the transaction is not yet committed.
This would violate the isolation principle.
The code that retries if the row went missing
is written like it's meant for autocommit.
What should happen in case of concurrent access,
is having the sqlite database locked, possibly leading to a deadlock.

Regarding the second issue:
queue.pop can return `None`
if all rows are vanished between the queue.count call
and queue.pop
This is probably your case.

Let's continue on the issue tracker,
tickets #40 and #197
because we are talking about bugs here.

k bez

unread,
Jan 12, 2017, 5:25:06 AM1/12/17
to scrapy-users
After fixing bug #40 multi scrapyd with 1 spider works fine.
I try to implement a new feature now.
I have a shared pending queue like this.
1 spider1 job1
2 spider2 job1
3 spider1 job2
4 spider1 job3
5 spider2 job2
Is it possible to tell scrapyd to process only specific spider(s) and ignore rest pending records (spiders)?
I want scrapyd/server1 to process spider1 and scrapyd/server2 to process spider2 etc.

Nikolaos-Digenis Karagiannis

unread,
Jan 13, 2017, 11:12:43 AM1/13/17
to scrapy-users
This logic is part of your application
so you should write your spider queue in such a way
(eg make it read some config that defines filters for each server)
Reply all
Reply to author
Forward
0 new messages