Noob question - rescraping for new data

68 views
Skip to first unread message

Mike Hewitt

unread,
Aug 10, 2015, 8:34:53 AM8/10/15
to beautifulsoup

I am just looking into using beautifulsoup as a scraper/crawler for a search engine.


Generally speaking is it possible to set schedules for scrapes (just a server cron job perhaps?), and can they be done so that only new data is scraped, i.e. you don't have to recrawl a whole website every time?

Travis N

unread,
Aug 18, 2015, 7:33:47 AM8/18/15
to beautifulsoup
also a noob here, but maybe this helps...

curl when run in bash has an option to check the timestamp of a file (-z OR --time-cond); I would guess that pycurl can do the same.  The following links look like they might help:


..from above:
"m['filetime'] = self.handle.getinfo(pycurl.INFO_FILETIME)"
Reply all
Reply to author
Forward
0 new messages