Communication with other software components & post-processing results

55 views
Skip to first unread message

Nick Gilmour

unread,
Mar 6, 2018, 2:23:37 PM3/6/18
to pyspider-users
Hi all,

1. I  would like to control pyspider ( start, stop, set start url, modify crawl_config options) from outside e.g. from jupyter notebook. Similar questions have been asked here:


https://groups.google.com/forum/#!searchin/pyspider-users/program|sort:date/pyspider-users/wLoSfhFpu7k/ibIquf_0KQAJ

According to one answer there is a xmlrpc interface and a send_message method, which can be used for this purpose. In the examples I have found this is done either via the CLI or from another project. But how can it be done e.g. from jupyter notebook? When I do this:

from xmlrpclib import ServerProxy
client = ServerProxy("http://user:pwd@localhost:5000/")
client.send_message("project", "url")

I get this error:
ProtocolError: <ProtocolError for user:pwd@localhost:5000/: 405 METHOD NOT ALLOWED>

2. I would like to process the results after pyspider has finished and the results are in a DB. How can I trigger such a process?

3. Could someone give some guidance on how to modify pyspider to accomplish these things in case they are not implemented?


Regards,
Nick






Nick Gilmour

unread,
Mar 18, 2018, 10:32:19 AM3/18/18
to pyspider-users, Roy Binux
1. I  would like to control pyspider ( start, stop, set start url, modify crawl_config options) from outside e.g. from jupyter notebook.
I have managed to connect to the scheduler over port 23333 (port 5000 was a silly mistake...) from jupyter notebook and I can see the supported methods  but I can't really do much.
With scheduler.get_active_tasks() I can see in a dict a project name and a boolean and that's all.
So how does it work? What can be done?

Besides that, I have seen that the fetcher should be accesible via xml-rpc at port 24444  but in the console I can only see the scheduler:
"scheduler.xmlrpc listening on 127.0.0.1:23333"
and when I try to connect to the fetcher I get:
"ConnectionRefusedError: [Errno 111] Connection refused"
Why this?

2. I would like to process the results after pyspider has finished and the results are in a DB. How can I trigger such a process?
As far as i have seen it can be done by the DB with a trigger mechanism.

3. Could someone give some guidance on how to modify pyspider to accomplish these things in case they are not implemented?
Is it possible to extend the xml-rpc methods to achieve it? Other alternatives? Any suggestions? 

Regards,
Nick

Roy Binux

unread,
Mar 18, 2018, 5:05:56 PM3/18/18
to Nick Gilmour, pyspider-users
1. Yes, you could use xml-rpc to communicate with scheduler, `send_message` is implemented via scheduler xml-rpc. https://github.com/binux/pyspider/blob/87337e7ce8a19677109a95b202ce6c77ba448af1/pyspider/run.py#L728

the xml-rpc  of fetcher is not enabled by default.


Nick Gilmour

unread,
Mar 18, 2018, 6:28:52 PM3/18/18
to Roy Binux, pyspider-users
@Roy Thanks for the quick response!

1. Yes, you could use xml-rpc to communicate with scheduler, `send_message` is implemented via scheduler xml-rpc
3. You could extend xml-rpc.

Sorry, I can't follow.
These are the methods, that are supported by the server:

0 : _quit
1 : counter
2 : get_active_tasks
3 : get_projects_pause_status
4 : newtask
5 : send_task
6 : size
7 : system.listMethods
8 : system.methodHelp
9 : system.methodSignature
10 : update_project
11 : webui_update

How could I use `send_message`?

Could you give some directions how should I proceed in order to start or stop a pyspider project externally?

Regards,
Nick

Roy Binux

unread,
Mar 20, 2018, 1:31:27 PM3/20/18
to Nick Gilmour, pyspider-users
Read source code of send message here: https://github.com/binux/pyspider/blob/87337e7ce8a19677109a95b202ce6c77ba448af1/pyspider/run.py#L728

To start or stop projects, just update projectdb. You could inform scheduler by update_project method.

Nick Gilmour

unread,
Mar 20, 2018, 1:40:52 PM3/20/18
to Roy Binux, pyspider-users
Great, thanks!
Reply all
Reply to author
Forward
0 new messages