Running pyspider

En Ware

unread,

Dec 5, 2018, 10:03:09 AM12/5/18

to pyspider-users

Hello,

I used to run scrapy but found out I need message queues and also I like what pyspider has to offer. Here are my questions.

1. When I try to run the imdb tutorial I push run , which I get a follow then I push play and it seems the screen goes grey but it doesn't do anything. I am running localhost on port 5000 and almost like its running in a loop? What I am doing wrong?

I am using a virutalenv using pypy3.5 version, can I not use pypy3.5 for my python install?

I also see that that last commit was 6 days ago to change travis to python 3.5. I am assuming the project is still active?

2. Is there a forum or IRC channel # that I can talk to about pyspider rather than just the google groups?

I look forward running pyspider

- nixfreak

En Ware

unread,

Dec 5, 2018, 10:17:16 AM12/5/18

to pyspider-users

Ok so I answered my own question , pypy3.5.3 doesn't work with pyspider at least not when your using the webui. I install 3.6.3 instead in a brand new virtualenv and using the same tutorial it loaded right up.

awesome !

En Ware

unread,

Dec 5, 2018, 12:07:27 PM12/5/18

to pyspider-users

Posting some code , just trying to scrape IMDB

#!/usr/bin/env python

# -*- encoding: utf-8 -*-

# Created on 2018-12-05 09:39:49

# Project: imdb_tutorial

from pyspider.libs.base_handler import *

import re

class Handler(BaseHandler):

crawl_config = {

}

@every(minutes=24 * 60)

def on_start(self):

self.crawl('http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie', callback=self.index_page)

@config(age=10 * 24 * 60 * 60)

def index_page(self, response):

for each in response.doc('a[href^="http"]').items():

if re.match("http://www.imdb.com/title/tt\d+/$", each.attr.href):

self.crawl(each.attr.href, callback=self.detail_page)

self.crawl(response.doc('.next-page').attr.href, callback=self.index_page)

self.crawl(response.doc('.prev-page').attr.href, callback=self.index_page)

@config(priority=2)

def detail_page(self, response):

return {

"url": response.url,

"title": response.doc('.lister-item-header a').text(),

"date" : response.doc('.text-muted').text(),

}

****************************

So right now I am able to see the next-page and pre-page and index page but not able to extract that information

I'm using response.doc(.list-item-header a').text()

Can someone tell me what I am doing wrong ?

On Wednesday, December 5, 2018 at 9:03:09 AM UTC-6, En Ware wrote:

Roy Binux

unread,

Dec 5, 2018, 1:05:49 PM12/5/18

to En Ware, pyspider-users

You are using a selector of index page to extract a detail page.

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyspider-users/a3e011e5-2ee4-4395-a140-3f5865db2c63%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.