Downloading URLS from common crawl

233 views
Skip to first unread message

Dr. Nisha T N

unread,
Feb 6, 2021, 5:36:04 AM2/6/21
to Common Crawl
Hi researchers,

I am a reseracher in information security and currently working on phishing attack detection.  

Can anyone guide me, how to download URL s from common crawl dataset?

Tom Alby

unread,
Feb 6, 2021, 5:42:26 AM2/6/21
to Common Crawl
I have just gone through that, and I suggest to use the Athena interface, it was the fastest way for me: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
It costs a few $, I have paid like $7 for extensive querying and downloading the whole db with several additional columns.
Tom

Dr. Nisha T N

unread,
Feb 6, 2021, 5:44:25 AM2/6/21
to common...@googlegroups.com
Thank you very much  Tom. Let me try...

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/cae9d8b6-5802-49ec-bc32-1bf2994e283dn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages