Inquiry About Accessing Specific Website Data from the Common Crawl Archive

118 views

Skip to first unread message

Sanchit Singh

unread,

Jan 24, 2025, 8:13:10 PM1/24/25

to Common Crawl

Dear Common Crawl Team,

I hope this message finds you well! First, I’d like to extend my appreciation for the incredible work your team does in maintaining and providing access to such a valuable resource for researchers and developers worldwide.

I am currently working on a project that involves e-commerce data, and I’m exploring the possibility of retrieving all product pages from a specific website (e.g., Best Buy) that exist in the Common Crawl database. Is there a way to identify and access all the relevant data for a particular website from the Crawl Archive?

Any guidance on how to approach this—such as tools, techniques, or resources to filter and retrieve specific website data—would be greatly appreciated.

Thank you so much for your time and the fantastic service you provide. I look forward to hearing from you!

Best regards,
Sanchit Singh

Thom Vaughan

unread,

Jan 26, 2025, 6:26:18 AM1/26/25

to Common Crawl

Hi Sanchit,

I see that you got an answer to this from Greg on our Discord server, and thought I'd post that here for the benefit of the Google group:

> We have 2 indexes that you can use to locate all of the pages on a webhost, and then you can use that information to download the actual content. One way to do that is with this tool https://github.com/cocrawler/cdx_toolkit -- which will spit out the pages into a WARC file, and then you'll want to use a WARC library like https://github.com/webrecorder/warcio to extract the actual content.

Further supporting links:

> https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format
> https://github.com/commoncrawl/whirlwind-python/

Good luck with your data project.

TV

Reply all

Reply to author

Forward

0 new messages