Simulate a Search Bot

2,133 views
Skip to first unread message

Chris

unread,
Mar 18, 2010, 8:25:45 PM3/18/10
to scrapy-users
Hello Scrapy Users,

a friend is using python and beautiful soup for crawling.

He told me with python it was possible to simulate a Search Bot. This
can be useful with scraping pages that cut connection after too many
requests unless they come from a Search Bot.

So I was wondering if anyone knows if simualting a Search Bot is
possible with Scrapy that runs on pyhton. How would this work?

Thanks.

Daniel Graña

unread,
Mar 18, 2010, 10:59:44 PM3/18/10
to scrapy...@googlegroups.com
Does simulate a search box means:
* faking user-agent? YES, scrapy can do this
* be gentle and follow robots.txt rules? YES, scrapy can do this
* be nice and requests at low rates? YES, scrapy can also do this

anything else? probably YES ;)

don't hesitate to join #scrapy at freenode!
dan


--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.


Rolando Espinoza La Fuente

unread,
Mar 18, 2010, 11:02:21 PM3/18/10
to scrapy...@googlegroups.com

I'm not sure, but maybe your friend is refering to user-agent?

If so, you can change the USER_AGENT variable in your project's settings.py

USER_AGENT = "Googlebot/2.1 ( http://www.google.com/bot.html)"

Despite you can change the user-agent to "appear a google bot", the
remote servers
will log your ip and can notice that you didn't came from google's ips.

Side note: usually basic scraping with python consist in downloading
pages using
urllib or curl and parse the html with BeautifulSoup to extract links and data.
Scrapy uses libxml2 for xml/html parsing, which is more efficient than
BeautifulSoup.

* http://doc.scrapy.org/faq.html#how-does-scrapy-compare-to-beautifulsoul-or-lxml

Regards,

Rolando

Daniel Graña

unread,
Mar 18, 2010, 11:43:20 PM3/18/10
to scrapy...@googlegroups.com
On Thu, Mar 18, 2010 at 11:59 PM, Daniel Graña <dan...@gmail.com> wrote:
Does simulate a search box means:
* faking user-agent? YES, scrapy can do this
 
Rolando already mention how to change user-agent, so I am going to expand the other topics:
http://doc.scrapy.org/topics/settings.html#user-agent
and you can override user-agent per spider using user_agent  attribute in spider class.

* be gentle and follow robots.txt rules? YES, scrapy can do this

Robots middleware is included by default but you need to enable it using:
http://doc.scrapy.org/topics/settings.html#robotstxt-obey
 
* be nice and requests at low rates? YES, scrapy can also do this

see:  http://doc.scrapy.org/topics/settings.html#download-delay

And in case you want to distribute crawling from a single spider as coming from multiples ips,
you can use an array of proxies and dispatch requests setting request.meta['proxy'] from
a custom downloader middleware.

good luck
dan

Christian S

unread,
Mar 19, 2010, 9:36:59 AM3/19/10
to scrapy...@googlegroups.com

YOU ARE ALL GREAT!

Thank you so much for your great support and sharing your abundance of knowledge!

I look into all the links and advice you gave me...

Best wishes,

Chris

 

Reply all
Reply to author
Forward
0 new messages