Hi Quanyi,
unfortunately, the 2008 - 2012 crawls use a different data format.
Consolidating the format is on our list, but not ready yet and
will take us some more time.
The best starting point about the format and location of the pre-2013
crawls is available on the old Common Crawl wiki [1]. You may also look
through discussions in this forum dating back to the early years.
And keep in mind that the data location has changed since then. All
necessary information, you'll find on the Common Crawl website.
> web data before 2008
There are other web archives dating back to 2008 or before.
I'm not aware of any exhaustive list. [2,3] are two projects which
came to my mind. Please have a look at [4] for more pointers.
You should get in contact with the maintainers of the web archives.
Best,
Sebastian
[1]
https://commoncrawl.atlassian.net/wiki/spaces/CRWL/pages/2850886/About+the+Data+Set
[2]
https://lemurproject.org/clueweb09.php/
[3]
http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/
[4]
https://netpreserve.org/
On 9/24/25 12:48, Quanyi Hong wrote:
> Hi everyone! First of all, Common Crawl is amazing—huge thanks to everyone
> who’s built and maintained this project.
>
> I’m trying to get *Common Crawl data from 2008–2013*, but I couldn’t find