a friend is using python and beautiful soup for crawling.
He told me with python it was possible to simulate a Search Bot. This
can be useful with scraping pages that cut connection after too many
requests unless they come from a Search Bot.
So I was wondering if anyone knows if simualting a Search Bot is
possible with Scrapy that runs on pyhton. How would this work?
Thanks.
--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.
I'm not sure, but maybe your friend is refering to user-agent?
If so, you can change the USER_AGENT variable in your project's settings.py
USER_AGENT = "Googlebot/2.1 ( http://www.google.com/bot.html)"
Despite you can change the user-agent to "appear a google bot", the
remote servers
will log your ip and can notice that you didn't came from google's ips.
Side note: usually basic scraping with python consist in downloading
pages using
urllib or curl and parse the html with BeautifulSoup to extract links and data.
Scrapy uses libxml2 for xml/html parsing, which is more efficient than
BeautifulSoup.
* http://doc.scrapy.org/faq.html#how-does-scrapy-compare-to-beautifulsoul-or-lxml
Regards,
Rolando
Does simulate a search box means:
* faking user-agent? YES, scrapy can do this
* be gentle and follow robots.txt rules? YES, scrapy can do this
* be nice and requests at low rates? YES, scrapy can also do this
YOU ARE ALL GREAT!
Thank you so much for your great support and sharing your abundance of knowledge!
I look into all the links and advice you gave me...
Best wishes,
Chris