Golang tool/package for Common Crawl data extraction

257 views
Skip to first unread message

Rustem Kamalov

unread,
Jun 1, 2023, 1:21:26 AM6/1/23
to Common Crawl
Hi guys!

I've developed a solution that can help you extract data from Common Crawl :)
You can use it as a separate tool, or import it into your Go project.
Github: https://github.com/karust/gogetcrawl

I hope the package will be useful to someone!

LUCKY OKEWORO

unread,
Jun 1, 2023, 4:39:03 AM6/1/23
to common...@googlegroups.com
This is amazing, great work

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/cd219df8-36db-4e4b-a986-0b72e7ead05cn%40googlegroups.com.

Sebastian Nagel

unread,
Jun 2, 2023, 9:05:39 AM6/2/23
to common...@googlegroups.com
Hi Rustem,

thanks! I've added the tool to the list of code examples, tools and
libraries on the Common Crawl website:
https://commoncrawl.org/the-data/examples/

Best,
Sebastian

LUCKY OKEWORO

unread,
Jun 3, 2023, 3:00:14 PM6/3/23
to common...@googlegroups.com
This is wonderful

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.

tauseef jamal

unread,
Jun 3, 2023, 3:56:54 PM6/3/23
to common...@googlegroups.com

Wesam Al-Nabki

unread,
Jun 4, 2023, 7:30:23 AM6/4/23
to Common Crawl
Many thanks for this great tool. 

I understood from the --ext we can download files or HTML pages, can we use it to download the WET (the clean text) files of a website? Also, can we set a date range "from" "to" date-stamps?

Regards,
Wesam 

Rustem Kamalov

unread,
Jun 4, 2023, 7:06:35 PM6/4/23
to Common Crawl
Hi Wesam,

You are right, we can set the file extension to search/download via `--ext` (you can also do it via `--filter` and MIME type).
I updated the solution, now you can set a date range using `--from` and `--to` arguments.

We have "length" and "offset" params in JSON response of the Index server for WARC archive, which allow us to download a certain page.
Currently, I'm not sure how I can do it for WET, because WET archive should have different offsets...

Regards,
Rustem

воскресенье, 4 июня 2023 г. в 14:30:23 UTC+3, wesam....@gmail.com:
Reply all
Reply to author
Forward
0 new messages