Slow processing speed

131 views
Skip to first unread message

manbuheiniu

unread,
Feb 1, 2015, 11:24:21 PM2/1/15
to pyspide...@googlegroups.com

自从添加了数据库保存功能后数据处理变得非常慢。之前五分钟能处理1700左右,现在五分钟只能处理60左右。看服务器负载哪也不高。请问怎么排查是哪的瓶颈呢?上图是服务器各项参数的监控。mysql数据库是独立服务器,各项参数显示负载也不高。

Roy Binux

unread,
Feb 1, 2015, 11:48:46 PM2/1/15
to manbuheiniu, pyspide...@googlegroups.com
详细提供
1、怎么部署的
​2​
、当前数据量
3、各个队列的状态
4、日志是否有异常
5、对方网站的规模性能,配置的 rate 和 burst
5、『数据库保存功能』是怎么实现的

2015-02-02 12:24 GMT+08:00 manbuheiniu <qiaozh...@gmail.com>:

自从添加了数据库保存功能后数据处理变得非常慢。之前五分钟能处理1700左右,现在五分钟只能处理60左右。看服务器负载哪也不高。请问怎么排查是哪的瓶颈呢?上图是服务器各项参数的监控。mysql数据库是独立服务器,各项参数显示负载也不高。

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.
To post to this group, send email to pyspide...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyspider-users/03042922-e5da-4b41-889e-2f5d1e489044%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

manbuheiniu

unread,
Feb 2, 2015, 12:59:27 AM2/2/15
to pyspide...@googlegroups.com, qiaozh...@gmail.com, r...@binux.me
1,程序单机部署并使用默认队列,数据库是单独服务器,在on_result里连接保存数据。
2,当前数据量7万左右
3,各队列状态不知道怎么查看,使用的系统自带的队列。
4,日志无错误输出
5,对方网站性能不是瓶颈。是chinaunix。我在另一个服务器上也跑着一上实例,五分钟内处理2000左右。之前rate和burst分别是2和6.五分钟处理60左右,现在是4和8,五处理70左右
6,数据保存功能是写了个数据库连接类,直接在on_result里调用类保存数据。下面附上数据库保存代码
'''
pyspider结果保存到数据库简单样例。
使用方法:
    1,把本文件放到pyspider/pyspider/database/mysql/目录下命名为mysqldb.py。
    2,修改本文件的数据库配置参数及建立相应的表和库。
    3,在脚本文件里使用from pyspider.database.mysql.mysqldb import SQL引用本代码.
    4,重写on_result方法,实例化sql并调用replace(replace方法参数第一个是表名,第二个是结果。)。简单例子如下:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2015-01-26 13:12:04
# Project: jianke

from pyspider.libs.base_handler import *
from pyspider.database.mysql.mysqldb import SQL


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('
http://www.test .com/', callback=self.index_page)

   
@config(age=10 * 24 * 60 * 60)
   
def index_page(self, response):
       
for each in response.doc('p.pic a[href^="http"]').items():
           
print each.attr.href
           
   
@config(priority=2)
   
def detail_page(self, response):
       
return {
           
"url": response.url,
           
"title": response.doc('HTML>BODY#presc>DIV.main>DIV.prices_box.wid980.clearfix>DIV.detail_box>DL.assort.tongyong>DD>A').text(),
       
}
   
def on_result(self, result):
       
#print result
       
if not result or not result['title']:
           
return
        sql
= SQL()
        sql
.replace('info',**result)
'''
from six import itervalues
import mysql.connector
from datetime import date, datetime, timedelta

class SQL:

        username = '
pyspider'   #数据库用户名
        password = '
pyspider'   #数据库密码
        database = '
result'     #数据库
        host = '
localhost'      #数据库主机地址
        connection = ''
        connect = True
    placeholder = '
%s'

        def __init__(self):
                if self.connect:
                        SQL.connect(self)
    def escape(self,string):
        return '`%s`' % string
        def connect(self):
            config = {
                '
user':SQL.username,
                '
password':SQL.password,
                '
host':SQL.host
            }
            if SQL.database != None:
                config['
database'] = SQL.database

            try:
                cnx = mysql.connector.connect(**config)
                SQL.connection = cnx
                return True
            except mysql.connector.Error as err:

            if (err.errno == errorcode.ER_ACCESS_DENIED_ERROR):
                print "The credentials you provided are not correct."
            elif (err.errno == errorcode.ER_BAD_DB_ERROR):
                print "The database you provided does not exist."
            else:
                print "Something went wrong: " , err
            return False


    def replace(self,tablename=None,**values):
        if SQL.connection == '':
                    print "Please connect first"
                    return False

                tablename = self.escape(tablename )
                if values:
                        _keys = ", ".join(self.escape(k) for k in values)
                        _values = ", ".join([self.placeholder, ] * len(values))
                        sql_query = "REPLACE INTO %s (%s) VALUES (%s)" % (tablename, _keys, _values)
                else:
                        sql_query = "REPLACE INTO %s DEFAULT VALUES" % tablename

               
        cur = SQL.connection.cursor()
                try:
                    if values:
                            cur.execute(sql_query, list(itervalues(values)))
                    else:
                            cur.execute(sql_query)
                    SQL.connection.commit()
                    return True
                except mysql.connector.Error as err:
                    print ("An error occured: {}".format(err))
                    return False




在 2015年2月2日星期一 UTC+8下午12:48:46,Roy Binux写道:

manbuheiniu

unread,
Feb 2, 2015, 1:05:33 AM2/2/15
to pyspide...@googlegroups.com, qiaozh...@gmail.com, r...@binux.me
附加上输出日志。我仔细看了看,感觉像是单线程进行工作的。   select任务是scheduler:632, 下载任务是tornado_fetcher:232,处理进程processor:153,调配scheduler:581老是这几个ID号,不改变。而且每次出现都是这四个一块出现。
[D 150202 14:01:32 scheduler:252] ignore newtask chinaunix_blog:688202e08759ba590a7c0c2446a7be99 http://blog.chinaunix.net/uid/30098782.html
[D 150202 14:01:32 scheduler:252] ignore newtask chinaunix_blog:f757e8e6e02d0dcc44c1ddb7b3c2a86b http://blog.chinaunix.net/uid/30097826.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:395cc91920f3284d3b5310799a2a9f2a http://blog.chinaunix.net/uid/30057524.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:1ed0c83ecc15e23cc7e3e8b1da2ec62d http://blog.chinaunix.net/uid/30088444.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:db108cade096c2a09616f5da100401ea http://blog.chinaunix.net/uid/30107309.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:d643508a4477b1004a0cbfa8f69912cf http://blog.chinaunix.net/uid/1827018.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:3cd56e2c05358b3244a729ec5616f5ff http://blog.chinaunix.net/uid-7374279-id-4813735.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:95673ec2a7cf60be5e14abd2064423ee http://blog.chinaunix.net/uid-301743-id-4813354.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:6433631ba71c45d92d5a624dedc0982a http://blog.chinaunix.net/uid-13328506-id-4809491.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:22d9f9a2cdfdaf3d42ee94061c7ea28c http://blog.chinaunix.net/uid-14528823-id-4808877.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:5613a37cc5bae7aae01f33e9558fbc88 http://blog.chinaunix.net/uid-509190-id-4807958.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:e6119897ffaed66a440d5766dc22645f http://blog.chinaunix.net/uid-24780853-id-4425130.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:84ca467684da257eff942c3afcb79658 http://blog.chinaunix.net/uid-24780853-id-4425129.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:0bc550bfeef1ff816564f527d8dbe056 http://blog.chinaunix.net/uid-24780853-id-4091413.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:36054be7a85b718716b9e2123a8cbead http://blog.chinaunix.net/uid-24780853-id-4043233.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:d85f3b2d2514962578b0ebfcf85fa420 http://blog.chinaunix.net/uid-24780853-id-4043226.html
[D 150202 14:01:32 scheduler:533] ignore newtask chinaunix_blog:0d70720a030f7f8cd1ed1af844f9c330 http://blog.chinaunix.net/uid/0.html
[I 150202 14:01:36 scheduler:632] select chinaunix_blog:9a2df5be4bc757426e8d86600fb98fd2 http://blog.chinaunix.net/uid/29679056.html
[I 150202 14:01:36 tornado_fetcher:232] [200] http://blog.chinaunix.net/uid/29679056.html 0.10s
[I 150202 14:01:36 processor:153] process chinaunix_blog:9a2df5be4bc757426e8d86600fb98fd2 http://blog.chinaunix.net/uid/29679056.html -> [200] len:45943 -> result:None fol:10 msg:0 err:None
[I 150202 14:01:36 _internal:87] 106.186.112.20 - - [02/Feb/2015 14:01:36] "GET /counter?time=1d&type=sum HTTP/1.1" 200 -
[I 150202 14:01:36 _internal:87] 106.186.112.20 - - [02/Feb/2015 14:01:36] "GET /counter?time=1h&type=sum HTTP/1.1" 200 -
[I 150202 14:01:36 scheduler:581] task done chinaunix_blog:9a2df5be4bc757426e8d86600fb98fd2 http://blog.chinaunix.net/uid/29679056.html
[D 150202 14:01:36 scheduler:533] ignore newtask chinaunix_blog:9a2df5be4bc757426e8d86600fb98fd2 http://blog.chinaunix.net/uid/29679056.html
[D 150202 14:01:36 scheduler:533] ignore newtask chinaunix_blog:a8730a147bd5cc023d40aba5d18768eb http://blog.chinaunix.net/uid-24789255-id-198226.html
[D 150202 14:01:36 scheduler:533] ignore newtask chinaunix_blog:e3a73eaff72c29ead1a2e0e2332afae9 http://blog.chinaunix.net/uid-24789255-id-4288187.html
[D 150202 14:01:36 scheduler:533] ignore newtask chinaunix_blog:7d37148a6a64c5f881e26a773792c836 http://blog.chinaunix.net/uid-24789255-id-4289076.html
[D 150202 14:01:36 scheduler:533] ignore newtask chinaunix_blog:3cd56e2c05358b3244a729ec5616f5ff http://blog.chinaunix.net/uid-7374279-id-4813735.html
[D 150202 14:01:36 scheduler:533] ignore newtask chinaunix_blog:95673ec2a7cf60be5e14abd2064423ee http://blog.chinaunix.net/uid-301743-id-4813354.html
[D 150202 14:01:36 scheduler:533] ignore newtask chinaunix_blog:6433631ba71c45d92d5a624dedc0982a http://blog.chinaunix.net/uid-13328506-id-4809491.html
[D 150202 14:01:36 scheduler:533] ignore newtask chinaunix_blog:22d9f9a2cdfdaf3d42ee94061c7ea28c http://blog.chinaunix.net/uid-14528823-id-4808877.html
[D 150202 14:01:36 scheduler:533] ignore newtask chinaunix_blog:5613a37cc5bae7aae01f33e9558fbc88 http://blog.chinaunix.net/uid-509190-id-4807958.html
[D 150202 14:01:36 scheduler:533] ignore newtask chinaunix_blog:072494f77a2fa2375917cc7c24986991 http://blog.chinaunix.net/uid-29679056-id-4274466.html
[I 150202 14:01:37 scheduler:632] select chinaunix_blog:6d19290d4b2faed9d77dd46313e47b2c http://blog.chinaunix.net/uid/301541.html
[I 150202 14:01:37 tornado_fetcher:232] [200] http://blog.chinaunix.net/uid/301541.html 0.11s
[I 150202 14:01:37 processor:153] process chinaunix_blog:6d19290d4b2faed9d77dd46313e47b2c http://blog.chinaunix.net/uid/301541.html -> [200] len:44969 -> result:None fol:10 msg:0 err:None
[I 150202 14:01:37 scheduler:581] task done chinaunix_blog:6d19290d4b2faed9d77dd46313e47b2c http://blog.chinaunix.net/uid/301541.html
[D 150202 14:01:37 scheduler:533] ignore newtask chinaunix_blog:6d19290d4b2faed9d77dd46313e47b2c http://blog.chinaunix.net/uid/301541.html
[D 150202 14:01:37 scheduler:533] ignore newtask chinaunix_blog:a8730a147bd5cc023d40aba5d18768eb http://blog.chinaunix.net/uid-24789255-id-198226.html
[D 150202 14:01:37 scheduler:533] ignore newtask chinaunix_blog:e3a73eaff72c29ead1a2e0e2332afae9 http://blog.chinaunix.net/uid-24789255-id-4288187.html
[D 150202 14:01:37 scheduler:533] ignore newtask chinaunix_blog:7d37148a6a64c5f881e26a773792c836 http://blog.chinaunix.net/uid-24789255-id-4289076.html
[D 150202 14:01:37 scheduler:533] ignore newtask chinaunix_blog:3cd56e2c05358b3244a729ec5616f5ff http://blog.chinaunix.net/uid-7374279-id-4813735.html
[D 150202 14:01:37 scheduler:533] ignore newtask chinaunix_blog:95673ec2a7cf60be5e14abd2064423ee http://blog.chinaunix.net/uid-301743-id-4813354.html
[D 150202 14:01:37 scheduler:533] ignore newtask chinaunix_blog:6433631ba71c45d92d5a624dedc0982a http://blog.chinaunix.net/uid-13328506-id-4809491.html
[D 150202 14:01:37 scheduler:533] ignore newtask chinaunix_blog:22d9f9a2cdfdaf3d42ee94061c7ea28c http://blog.chinaunix.net/uid-14528823-id-4808877.html
[D 150202 14:01:37 scheduler:533] ignore newtask chinaunix_blog:5613a37cc5bae7aae01f33e9558fbc88 http://blog.chinaunix.net/uid-509190-id-4807958.html
[D 150202 14:01:37 scheduler:533] ignore newtask chinaunix_blog:c6a4e98c05e1669f2df01a2dda8aed2c http://blog.chinaunix.net/uid-301541-id-2441918.html




在 2015年2月2日星期一 UTC+8下午12:48:46,Roy Binux写道:
详细提供

manbuheiniu

unread,
Feb 2, 2015, 1:08:11 AM2/2/15
to pyspide...@googlegroups.com, qiaozh...@gmail.com, r...@binux.me
程序启动命令./run.py  -c localhost.ini
localhost.ini文件的内容:
{
   
"webui": {
       
"port": 5001,
       
"username": "admin",
       
"password": "123456",
       
"need_auth": "true"
   
}
}




在 2015年2月2日星期一 UTC+8下午12:48:46,Roy Binux写道:
详细提供

Binux

unread,
Feb 2, 2015, 1:40:07 AM2/2/15
to pyspide...@googlegroups.com, qiaozh...@gmail.com, r...@binux.me
        def __init__(self):
                if self.connect:
                        SQL.connect(self)

保存每条结果都会重新创建 mysql 连接,这里 self.connect 总是为 True

manbuheiniu

unread,
Feb 2, 2015, 2:07:33 AM2/2/15
to pyspide...@googlegroups.com, qiaozh...@gmail.com, r...@binux.me
是的,每次保存结果都会创建mysql连接。这个判断是为了实例化sql的时候自动连接数据库,这样就可以直接使用实例保存结果了。如果不是true,需要每个实例化后都要调用一下连接数据库的方法。我是新手,还不会用连接池复用数据库连接。 我刚把保存到数据库的功能去掉了,发现处理速度没有提高。用这个代码在同一台服务器上新建一个项目不管保存不保存结果五分钟能处理900左右,看来是数据量大了也影响速度。先这样跑着吧,下来研究研究复用数据库连接。谢谢

在 2015年2月2日星期一 UTC+8下午2:40:07,Binux写道:

Roy Binux

unread,
Feb 2, 2015, 2:10:04 AM2/2/15
to manbuheiniu, pyspide...@googlegroups.com
你设置的 rate/burst 是多少?

------
另:这是邮件,尽量在一份邮件把能提供的信息提供完整。

manbuheiniu

unread,
Feb 2, 2015, 2:17:09 AM2/2/15
to pyspide...@googlegroups.com, qiaozh...@gmail.com, r...@binux.me

正跑的这两个是同样的代码,只是数据量不一样。



在 2015年2月2日星期一 UTC+8下午3:10:04,Roy Binux写道:

Roy Binux

unread,
Feb 2, 2015, 2:18:46 AM2/2/15
to manbuheiniu, pyspide...@googlegroups.com
这个设置是每秒一个请求,当然是每分钟 60 个啊!

Roy Binux

unread,
Feb 2, 2015, 2:20:15 AM2/2/15
to manbuheiniu, pyspide...@googlegroups.com
sorry,你说是 5分钟 60 吗?
进度条包含3个数字,5分钟新增,5分钟处理,5分钟失败。鼠标放上去才能看到处理的数字。

manbuheiniu

unread,
Feb 2, 2015, 2:27:55 AM2/2/15
to pyspide...@googlegroups.com, qiaozh...@gmail.com, r...@binux.me
第一个任务:5m of 81 tasks: new(6.2%):         5 success(93.8%):     76 retry(0.0%):     0 failed(0.0%):     0
第二个任务:5m of 730 tasks: new(60.3%):         440 success(39.7%):     290 retry(0.0%):     0 failed(0.0%):     0

在 2015年2月2日星期一 UTC+8下午3:20:15,Roy Binux写道:

Roy Binux

unread,
Feb 2, 2015, 2:48:41 AM2/2/15
to manbuheiniu, pyspide...@googlegroups.com
你是在 pyspider 实例上跑两个,然后互相比吗?他们会相互影响的。
在 active_tasks 和 task 详情里面能看到 fetch 和 process 的速度。fetcher 是异步的,可以不考虑, 1 / process 是 processor 的处理速度上限。​

manbuheiniu

unread,
Feb 2, 2015, 3:01:30 AM2/2/15
to pyspide...@googlegroups.com, qiaozh...@gmail.com, r...@binux.me
恩,现在不报错了先跑着吧。我感觉也可能是vps配置太低了。有时间我弄一个配置高的主机跑跑。谢谢。

在 2015年2月2日星期一 UTC+8下午3:48:41,Roy Binux写道:
Message has been deleted

joseph

unread,
Mar 26, 2020, 1:37:58 PM3/26/20
to pyspider-users

Hi Roy.

I am dealing with Yelp and get banned after a time even using rate/burst 0.05/1

I want to leave the system working for long time and I am struggling if would be possible to implement something like one random sleep in pyspider scripts.


I import these:

import time
from time import sleep
import random
from random import randint 
MAX_SLEEP = 20000 # in milliseconds

Then declares this function inside and outside of the handler...

#/////   SLEEP   FUNCTION   ///////
    #random sleep between 2.000 and MAX_SLEEP seconds '''
    def r_sleep():
        length = float(randint(10000, MAX_SLEEP)) / 1000
        print("Safety Random Sleep has started for {0} sec".format(length))
        sleep(length)
        print("Safety Random Sleep is over")


To be called by r_sleep()....

But I tried inside, outside, after processes... this is always giving me error.

Where could I put this r_sleep() to work before crawling details_page, where details for each business to scrape are sent?

Any direction would be very welcome.
Thanks.

Roy Binux

unread,
Mar 26, 2020, 4:47:36 PM3/26/20
to joseph, pyspider-users
You shouldn't use sleep to control the speed. Rate will do the job. 
Rate means queries per second. 0.05 for example means 1 per 20 seconds.

Reducing the frequency and hope not been blocked are naive, you have to use some proxies.

--
You received this message because you are subscribed to the Google Groups "pyspider-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyspider-user...@googlegroups.com.

ecom4...@gmail.com

unread,
Mar 26, 2020, 10:20:39 PM3/26/20
to pyspider-users
Ok.

That is my next challenge... working on proxies...

Would be that useful with RabbitMQ?

Or some suggestion for nowadays proxy rotation?

Thanks a lot.

Your program is so nice.

To unsubscribe from this group and stop receiving emails from it, send an email to pyspide...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages