Looking for 2008–2013 Common Crawl data

21 views
Skip to first unread message

Quanyi Hong

unread,
Sep 24, 2025, 6:50:17 AM (12 days ago) Sep 24
to Common Crawl
Hi everyone! First of all, Common Crawl is amazing—huge thanks to everyone who’s built and maintained this project.  

I’m trying to get Common Crawl data from 2008–2013, but I couldn’t find any clear download links or mirrors online. Does anyone know how to access those years?

Also, I’m wondering if it’s possible to fill in web data before 2008—are there any ready-made datasets or practical methods for collecting as many early web snapshots as possible?

Any tips, links, or experiences would be super appreciated. Thanks a lot!

Quanyi



Sebastian Nagel

unread,
Sep 24, 2025, 7:57:07 AM (12 days ago) Sep 24
to common...@googlegroups.com
Hi Quanyi,

unfortunately, the 2008 - 2012 crawls use a different data format.
Consolidating the format is on our list, but not ready yet and
will take us some more time.

The best starting point about the format and location of the pre-2013
crawls is available on the old Common Crawl wiki [1]. You may also look
through discussions in this forum dating back to the early years.
And keep in mind that the data location has changed since then. All
necessary information, you'll find on the Common Crawl website.

> web data before 2008

There are other web archives dating back to 2008 or before.

I'm not aware of any exhaustive list. [2,3] are two projects which
came to my mind. Please have a look at [4] for more pointers.
You should get in contact with the maintainers of the web archives.

Best,
Sebastian


[1]
https://commoncrawl.atlassian.net/wiki/spaces/CRWL/pages/2850886/About+the+Data+Set
[2] https://lemurproject.org/clueweb09.php/
[3] http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/
[4] https://netpreserve.org/

On 9/24/25 12:48, Quanyi Hong wrote:
> Hi everyone! First of all, Common Crawl is amazing—huge thanks to everyone
> who’s built and maintained this project.
>
> I’m trying to get *Common Crawl data from 2008–2013*, but I couldn’t find
Reply all
Reply to author
Forward
0 new messages