Cannot go to next page

40 views
Skip to first unread message

Nick Gilmour

unread,
Jun 7, 2017, 5:41:02 PM6/7/17
to pyspider-users
Hi all,

I have a website, where I have to pass some parameters to the URL. I have defined the params dict and is working fine for the first page. In the dict there is a parameter which controls where to start, like an index offset. No matter what I do, I cannot increment this value in order to get the next results, I'm always getting only the first ten results. How can I solve this problem?

Is there some kind of a trigger I can use?
Even this example in which a counter is incremented every time the function is called:
def myfun(s,i=[0]):    
    print(s)    
    i[0]+=1 # mutable variable get evaluated ONCE
    return i[0]
from here:

is not working within pyspyder. I don't understand why.

Roy Binux

unread,
Jun 7, 2017, 6:37:06 PM6/7/17
to Nick Gilmour, pyspider-users

pyspider is designed to be distributed, your request might be processed by different handler instance.


--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyspider-users/CAH-droyyw%2BncnqS2VqsSbAMMAKNW7yGU0vS1AjBBeGDAXCVvyA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Nick Gilmour

unread,
Jun 7, 2017, 7:01:48 PM6/7/17
to Roy Binux, pyspider-users
I currently work only with one box. Maybe I wasn't specific enough. I have a params dict like this:

parameters_dict = {...,

'offset': '',
...
}

so offset is a variable and I change it like this:

parameters_dict['offset'] = myoffset

and I want to increment the value of offset every time self.crawl is called:

self.crawl(Handler.url, callback=self.index_page, params = parameters_dict, connect_timeout = 120, timeout = 120)

It should be trivial but I still cannot figure it out.



On Thu, Jun 8, 2017 at 12:47 AM, Roy Binux <r...@binux.me> wrote:

E.g. you have two box running pyspider, the first URL is handled by box A, you increase the counter to 1. The URL B is handled by box B, it will still find the counter is 0.


On Wed, 7 Jun 2017, 15:40 Nick Gilmour, <nicke...@gmail.com> wrote:
Thanks for the reply but I don't really understand what you mean.

Could you be more specific?

On Thu, Jun 8, 2017 at 12:36 AM, Roy Binux <r...@binux.me> wrote:

pyspider is designed to be distributed, your request might be processed by different handler instance.


On Wed, 7 Jun 2017, 14:41 Nick Gilmour, <nicke...@gmail.com> wrote:
Hi all,

I have a website, where I have to pass some parameters to the URL. I have defined the params dict and is working fine for the first page. In the dict there is a parameter which controls where to start, like an index offset. No matter what I do, I cannot increment this value in order to get the next results, I'm always getting only the first ten results. How can I solve this problem?

Is there some kind of a trigger I can use?
Even this example in which a counter is incremented every time the function is called:
def myfun(s,i=[0]):    
    print(s)    
    i[0]+=1 # mutable variable get evaluated ONCE
    return i[0]
from here:

is not working within pyspyder. I don't understand why.

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-users+unsubscribe@googlegroups.com.
To post to this group, send email to pyspider-users@googlegroups.com.

Roy Binux

unread,
Jun 7, 2017, 7:04:50 PM6/7/17
to Nick Gilmour, pyspider-users

That's only an example, even with one box, pyspider will recreate the script context at any time. You shouldn't do that.


To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.

Nick Gilmour

unread,
Jun 7, 2017, 7:14:34 PM6/7/17
to Roy Binux, pyspider-users
What do you mean by:
pyspider will recreate the script context at any time

 So, what should I do?

Sorry for my naive questions...

To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-users+unsubscribe@googlegroups.com.
To post to this group, send email to pyspider-users@googlegroups.com.

Roy Binux

unread,
Jun 7, 2017, 7:19:52 PM6/7/17
to Nick Gilmour, pyspider-users

A script is actually a string basically. It need to be executed before handle any requests. When it gets created, it's a brand new context. All variables is rested to its default value.


To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.

Roy Binux

unread,
Jun 7, 2017, 7:21:27 PM6/7/17
to Nick Gilmour, pyspider-users

Generate all links once in single callback.

Nick Gilmour

unread,
Jun 7, 2017, 7:38:01 PM6/7/17
to Roy Binux, pyspider-users
Good, but how can I do that? In order to generate the links I still have to send many requests with a new parameter every time, or not? Can I use e.g. requests library to do this?

To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-users+unsubscribe@googlegroups.com.
To post to this group, send email to pyspider-users@googlegroups.com.

Roy Binux

unread,
Jun 7, 2017, 7:40:34 PM6/7/17
to Nick Gilmour, pyspider-users

I don't know what you trying to do. Can you provide some sample code?


To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.

Nick Gilmour

unread,
Jun 7, 2017, 8:12:35 PM6/7/17
to Roy Binux, pyspider-users
I have removed and replaced some code. I hope this is enough.

url = 'myurl'
    
    links_offset = 0
    links_pro_page = 10
    
    parameters_dict = {
                    'a': 'true',
                    'pagination': 'true',
                    'offset': '',
                    'rows': '',
                    'encode': 'true',
                }
    
    parameters_dict['offset'] = links_offset
    parameters_dict['rows'] = links_pro_page
    
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl(Handler.url, callback=self.index_page, params = Handler.parameters_dict, connect_timeout = 120, timeout = 120)

    
    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            filtered_links1 = ml.pos_filter_num(each.attr.href)
            filtered_links2 = ml.neg_filter_kw(filtered_links1)
            self.crawl(filtered_links2, callback=self.detail_page, connect_timeout = 120, timeout = 120)
        
        parameters_dict_mod = Handler.parameters_dict
        parameters_dict_mod['offset'] = links_offset
        parameters_dict_mod['rows'] = Handler.links_pro_page
        
        self.crawl(Handler.url, callback=self.index_page, params = parameters_dict_mod, connect_timeout = 120, timeout = 120)
        
        
    @config(priority=2)
    def detail_page(self, response):
        return {
            "status_code": response.status_code,
            "headers": response.headers,
            "error": response.error,
            "time": response.time,
            "ok": response.ok,
            "encoding": response.encoding,
            "url": response.url,
            "title": response.doc('title').text(),
            "text": response.text,
            "save": response.save
        }

To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-users+unsubscribe@googlegroups.com.
To post to this group, send email to pyspider-users@googlegroups.com.

Roy Binux

unread,
Jun 7, 2017, 11:48:13 PM6/7/17
to Nick Gilmour, pyspider-users

links_offset never changed


To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.

Nick Gilmour

unread,
Jun 8, 2017, 7:05:01 PM6/8/17
to Roy Binux, pyspider-users
Yes, I have tried many things and I have removed them all since nothing was working. In any case now it somehow works. I have a while loop in the on_start function and I can change the offset in params. But now I have another issue: I need to have in the while condition a string from the response.text, but the response.text is not avalaible within the on_start function. How can I solve this problem?
If I return the response.text in the index_page, can I somehow receive it and use it in the on_start function?

links_offset never changed


To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-users+unsubscribe@googlegroups.com.
To post to this group, send email to pyspider-users@googlegroups.com.

Roy Binux

unread,
Jun 9, 2017, 12:56:52 PM6/9/17
to Nick Gilmour, pyspider-users

Can you put the loop in index page or make it as a function and call or with response.text


links_offset never changed


To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyspider-users/CAH-drowzpVU4pMaf5SYT%3DMzwb%2BJ5YsjSpYWKJ3d0J03J__cxXQ%40mail.gmail.com.

Nick Gilmour

unread,
Jun 9, 2017, 3:18:28 PM6/9/17
to Roy Binux, pyspider-users
I have put the loop into index_page and I see that response.text is always the same, namely from the first page (the start url), which is now logical if pyspider doesn't fetch directly the page but builds the URLs first. The counter and the params_dict seem to be correct. Besides that  the results look weird in the Web UI - I mean it's like index - 10 details - index - index and so on.

If I put the loop in the on_start I see all index pages at once and If I click at one of them, then the correct result pages are fetched. Which seems to be the way to go. I have also found a way to exit the loop but it is violent. Namely by passing a counter var from on_start function to the crawl function with save and killing the script with quit() or sys.exit() if it doesn't find more links. But I assume there must be a better way.

Any other suggestions?

links_offset never changed


To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-users+unsubscribe@googlegroups.com.
To post to this group, send email to pyspider-users@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-users+unsubscribe@googlegroups.com.
To post to this group, send email to pyspider-users@googlegroups.com.

Roy Binux

unread,
Jun 9, 2017, 3:29:46 PM6/9/17