How to restrict the time horizon (like Jan/2015-Dec/2015) of crawling on a specific website by python code?~

22 views
Skip to first unread message

gangwan...@gmail.com

unread,
May 16, 2016, 8:16:23 AM5/16/16
to Common Crawl
I just learnt about Common Crawl for two days and not familiar with its details. Now I want to restrict the time within the whole year of 2015 focusing on forbes.com's news by python. I am not sure how to set the Common Crawl Index to satisfy this limitation in python code. Could anyone give me some examples for a similar target for refrence ? Thanks a lot !!!!

Tom Morris

unread,
May 16, 2016, 10:39:28 PM5/16/16
to common...@googlegroups.com
On Mon, May 16, 2016 at 8:16 AM, <gangwan...@gmail.com> wrote:
I just learnt about Common Crawl for two days and not familiar with its details. Now I want to restrict the time within the whole year of 2015 focusing on forbes.com's news by python. I am not sure how to set the Common Crawl Index to satisfy this limitation in python code. Could anyone give me some examples for a similar target for refrence ? Thanks a lot !!!!

There are 10 separate crawls spaced a month, more or less,  apart for 2015. They are each indexed separately, so you'll need to query all each of the indexes. e.g.


You can find the list of indexes and their corresponding API endpoints here: http://index.commoncrawl.org/

Before diving too deeply into your project, you probably want to verify that coverage of forbes.com looks like it's broad/deep enough for your needs.

Tom

Reply all
Reply to author
Forward
0 new messages