Noob question - rescraping for new data

68 views

Skip to first unread message

Mike Hewitt

unread,

Aug 10, 2015, 8:34:53 AM8/10/15

to beautifulsoup

I am just looking into using beautifulsoup as a scraper/crawler for a search engine.

Generally speaking is it possible to set schedules for scrapes (just a server cron job perhaps?), and can they be done so that only new data is scraped, i.e. you don't have to recrawl a whole website every time?

Travis N

unread,

Aug 18, 2015, 7:33:47 AM8/18/15

to beautifulsoup

also a noob here, but maybe this helps...

curl when run in bash has an option to check the timestamp of a file (-z OR --time-cond); I would guess that pycurl can do the same. The following links look like they might help:

http://www.programcreek.com/python/example/53130/pycurl.INFO_FILETIME

..from above:

"m['filetime'] = self.handle.getinfo(pycurl.INFO_FILETIME)"

http://pycurl.sourceforge.net/doc/curlobject.html#pycurl.Curl.getinfo

Reply all

Reply to author

Forward

0 new messages