Interact with pyspider via other program / command line

225 views
Skip to first unread message

HD

unread,
Jul 18, 2015, 4:33:19 AM7/18/15
to pyspide...@googlegroups.com
Hi,

is it possible to interact with pyspider via command line oder other python code (Without using the WebUI):

MyCode -------(URL2Spider)----------> pyspider ---------(Result_URLs)------->MyCode2

I want my code to send an url to pyspider which spiders the pages.
Then i want to feed the results of pyspider back to another/the same program.

Is that possible?
(If so, can u give me more info how e.g. a link maybee even sample code?)



Ps.: Sry for this complete beginner question to pyspider, which i just came across and didnt find an awswer in the www.

Roy Binux

unread,
Jul 20, 2015, 10:35:14 PM7/20/15
to HD, pyspide...@googlegroups.com
Hi,

Actually, I hadn't designed such usage.

pyspider have 3 main components -- scheduler, fetcher and processor (plus result_worker). The scheduler would manage and track the tasks. The fetcher fetch and the processor execute the code. The scheduler is the most important part comparing with other scrape frameworks, and the fetcher and processor are almost similar with others and easy to implement (as mostly depend on your script)

In your usage, scheduler is not necessary and the tasks not managed by pyspider. I think you could use fetcher and processor separately and imitate the pack schema. Or just write a new fetcher and processor with the libs used in pyspider (tornado and pyquery)

Best,
Roy Binux

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyspider-users/0a9028e5-75d1-42ef-a68c-74a0f847bc81%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jensvon...@googlemail.com

unread,
Sep 17, 2015, 4:21:19 AM9/17/15
to pyspider-users
Hi,

i would like to use the scheduler too, but i dont want to insert the sites to spider via WebUI but via command line or via my python code.
Is there no possibility for this?

And afterwards i would like to use the result for my own code. (I think this should be easy, because i can directly access the resultDb)

Regards

Roy Binux

unread,
Sep 17, 2015, 7:00:16 PM9/17/15
to jensvon...@googlemail.com, pyspider-users
Yes,  you can use `pyspider send_message project message` to send a message to a project via CLI then get the message via handler's on_message callback.

Message has been deleted

jensvon...@googlemail.com

unread,
Mar 27, 2016, 12:56:35 PM3/27/16
to pyspider-users, jensvon...@googlemail.com, r...@binux.me

What i try to achieve is send to simply add an URL that should completely be crawled to my pyspider-project. (And all links on the page should be crawled afterwards.)

from console:
pyspider send_message MYPROJECTNAME "http://MYURL.com"

or 
pyspider send_message MYPROJECTNAME '{"url": "http://www.MYURL.ch"}'



and on the pyspider side:

    def on_message(self, project, msg):
        
for each in msg('a[href^="http"]').items():
            
self.crawl(each.attr.href, callback=self.detail_page)


    def on_message(self, project, msg):
        self.crawl(msg, callback=self.detail_page)

This results in the Error:

[I 160327 18:44:58 processor:199] process MYPROJECTNAME:0a6b3271c7ff97a5a38728d7ede43d2c data:,on_message -> [200] len:10 -> result:None fol:1 msg:0 err:None

[E 160327 18:44:58 scheduler:170] unknown project: MYPROJECTNAME




Roy Binux

unread,
Mar 27, 2016, 1:06:14 PM3/27/16
to jensvon...@googlemail.com, pyspider-users
what's the error?
msg is a string when sent from command line.

jensvon...@googlemail.com

unread,
Mar 27, 2016, 7:17:34 PM3/27/16
to pyspider-users, jensvon...@googlemail.com, r...@binux.me
Hi, it seems that i messed up the database somehow (not pyspider, own db statements). I reinstalled pyspider and created a new project and it worked.

Thx for the help

This i how i did it. (Its very simple)

In pyspider project:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2016-03-28 01:02:46
# Project: TEST11

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        sys.stderr.write("START\n")
        #self.crawl('127.0.0.1', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

    
    def on_message(self, project, msg):
        self.crawl(msg, callback=self.index_page)

and then simply from commandline/shell/terminal:

pyspider send_message project_name "url"
eg:
pyspider send_message TEST11 "http://www.google.com/"






ecom4...@gmail.com

unread,
Mar 15, 2020, 1:31:44 PM3/15/20
to pyspider-users
Hello.

I was copy-pasted this logic from this issue but is not working.

In my understanding, I understand by command line you can send a message to PROJECTNAME with the URL to start crawling.
But... with these lines in on_start:

    def on_start(self):
        sys.stderr.write("START\n")
        #self.crawl('127.0.0.1', callback=self.index_page)
[...]

    def on_message(self, project, msg):
        self.crawl(msg, callback=self.index_page)

Is not working

How could I do?
Message has been deleted

ecom4...@gmail.com

unread,
Mar 15, 2020, 10:00:22 PM3/15/20
to pyspider-users

Hey.

I found something interesting that could also help....

Is possible to do something like that?

Create a list of URLs in CSV and then import them into a project?

Roy Binux

unread,
Mar 15, 2020, 10:25:00 PM3/15/20
to ecom4...@gmail.com, pyspider-users
Why do you say that not working? If you are expecting START trigger. It wouldn't. Only on_message will called.

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.

Joseph

unread,
Mar 16, 2020, 12:16:00 PM3/16/20
to pyspider-users
So... What should I do?

Sorry, I do not understand how to send the message and make the project start using the URL sent.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspide...@googlegroups.com.

Roy Binux

unread,
Mar 16, 2020, 11:28:42 PM3/16/20
to Joseph, pyspider-users
Send URL to on_message callback, self.crawl submit the task. That's it. You don't need to "start" anything.

To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyspider-users/326de9c6-90f2-444e-9133-6afc312e3e1b%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages