Mechanize function in Scrapy

64 views

Skip to first unread message

Sayth Renshaw

unread,

Mar 11, 2014, 5:53:48 AM3/11/14

to scrapy...@googlegroups.com

Having completed and toyed with the tutorial I have something I don't understand. What happens when my base url features links and content that change daily?

I don't want all the data only specific documents when they update to the page.

From the base url to get the link across to the page I want to scrape is body/div/div/div/div/table/tbody/tr/td/p/a.

So i want to navigate down that path if State and location details when they update. So will Scrapy allow me to do that or do I need to employ something like Mechanize https://pypi.python.org/pypi/mechanize/?

Sayth

Pablo Hoffman

unread,

Apr 18, 2014, 11:40:17 AM4/18/14

to scrapy-users

One way to do that is to keep track (in a disk file, for example) of already seen urls & content (along with their hashes) and check every scraped item against those in an item pipeline [1], dropping [2] the ones that were already seen before.

[1] http://doc.scrapy.org/en/latest/topics/item-pipeline.html

[2] http://doc.scrapy.org/en/latest/topics/exceptions.html#scrapy.exceptions.DropItem

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages