Communication with other software components & post-processing results

Nick Gilmour

unread,

Mar 6, 2018, 2:23:37 PM3/6/18

to pyspider-users

Hi all,

1. I would like to control pyspider ( start, stop, set start url, modify crawl_config options) from outside e.g. from jupyter notebook. Similar questions have been asked here:

https://groups.google.com/forum/#!searchin/pyspider-users/self.send_message%7Csort:date/pyspider-users/Gk-rPhx8258/5qqrKrv8EAAJ

https://groups.google.com/forum/#!searchin/pyspider-users/program|sort:date/pyspider-users/wLoSfhFpu7k/ibIquf_0KQAJ

According to one answer there is a xmlrpc interface and a send_message method, which can be used for this purpose. In the examples I have found this is done either via the CLI or from another project. But how can it be done e.g. from jupyter notebook? When I do this:

from xmlrpclib import ServerProxy

client = ServerProxy("http://user:pwd@localhost:5000/")

client.send_message("project", "url")

I get this error:

ProtocolError: <ProtocolError for user:pwd@localhost:5000/: 405 METHOD NOT ALLOWED>

2. I would like to process the results after pyspider has finished and the results are in a DB. How can I trigger such a process?

3. Could someone give some guidance on how to modify pyspider to accomplish these things in case they are not implemented?

Regards,

Nick

Nick Gilmour

unread,

Mar 18, 2018, 10:32:19 AM3/18/18

to pyspider-users, Roy Binux

1. I would like to control pyspider ( start, stop, set start url, modify crawl_config options) from outside e.g. from jupyter notebook.

I have managed to connect to the scheduler over port 23333 (port 5000 was a silly mistake...) from jupyter notebook and I can see the supported methods but I can't really do much.

With scheduler.get_active_tasks() I can see in a dict a project name and a boolean and that's all.

So how does it work? What can be done?

Besides that, I have seen that the fetcher should be accesible via xml-rpc at port 24444 but in the console I can only see the scheduler:

"scheduler.xmlrpc listening on 127.0.0.1:23333"

and when I try to connect to the fetcher I get:

"ConnectionRefusedError: [Errno 111] Connection refused"

Why this?

2. I would like to process the results after pyspider has finished and the results are in a DB. How can I trigger such a process?

As far as i have seen it can be done by the DB with a trigger mechanism.

3. Could someone give some guidance on how to modify pyspider to accomplish these things in case they are not implemented?

Is it possible to extend the xml-rpc methods to achieve it? Other alternatives? Any suggestions?

Regards,

Nick

Roy Binux

unread,

Mar 18, 2018, 5:05:56 PM3/18/18

to Nick Gilmour, pyspider-users

1. Yes, you could use xml-rpc to communicate with scheduler, `send_message` is implemented via scheduler xml-rpc. https://github.com/binux/pyspider/blob/87337e7ce8a19677109a95b202ce6c77ba448af1/pyspider/run.py#L728

the xml-rpc of fetcher is not enabled by default.

2. You can use http://docs.pyspider.org/en/latest/About-Projects/#on_finished-callback to process on finishes.

3. You could extend xml-rpc here https://github.com/binux/pyspider/blob/master/pyspider/scheduler/scheduler.py#L677

Nick Gilmour

unread,

Mar 18, 2018, 6:28:52 PM3/18/18

to Roy Binux, pyspider-users

@Roy Thanks for the quick response!

1. Yes, you could use xml-rpc to communicate with scheduler, `send_message` is implemented via scheduler xml-rpc

3. You could extend xml-rpc.

Sorry, I can't follow.

These are the methods, that are supported by the server:

0 : _quit

1 : counter

2 : get_active_tasks

3 : get_projects_pause_status

4 : newtask

5 : send_task

6 : size

7 : system.listMethods

8 : system.methodHelp

9 : system.methodSignature

10 : update_project

11 : webui_update

How could I use `send_message`?

Could you give some directions how should I proceed in order to start or stop a pyspider project externally?

Regards,

Nick

Roy Binux

unread,

Mar 20, 2018, 1:31:27 PM3/20/18

to Nick Gilmour, pyspider-users

Read source code of send message here: https://github.com/binux/pyspider/blob/87337e7ce8a19677109a95b202ce6c77ba448af1/pyspider/run.py#L728

To start or stop projects, just update projectdb. You could inform scheduler by update_project method.

Nick Gilmour

unread,

Mar 20, 2018, 1:40:52 PM3/20/18

to Roy Binux, pyspider-users

Great, thanks!

Reply all

Reply to author

Forward