**Beginner** HELP

91 views
Skip to the first unread message

Thirumalai Raj R

unread,
9 Jan 2018, 2:06:53 am9/1/18
to Common Crawl
Hey

I'm new to CC and WARC. please help me in extracting URLs from "December Index" for a one particular domain. 
Also suggest the efficient method is to use python or java?

Thanks
Thirumalai Raj, R.

Sebastian Nagel

unread,
9 Jan 2018, 3:13:20 am9/1/18
to common...@googlegroups.com
Hi,

the easiest way is to use the parameter matchType=domain, e.g.,
http://index.commoncrawl.org/CC-MAIN-2017-51-index?url=example.com&matchType=domain

If it's about a large domain you have to use the pagination API, see
https://github.com/ikreymer/pywb/wiki/CDX-Server-API#pagination-api
or use the CDX index client
https://github.com/ikreymer/cdx-index-client

If you want to extract not only URLs but also content (WARC records), please see this prior post
https://groups.google.com/d/msg/common-crawl/pQ34q-_EARU/FLFtvTfXAwAJ
or try to run this code snippet with the result from the index lookup
https://gist.github.com/sebastian-nagel/18d479bf203b328d2dded46639dd68d8

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Thirumalai Raj R

unread,
9 Jan 2018, 4:54:29 am9/1/18
to Common Crawl
Great. Thanks 
Reply all
Reply to author
Forward
0 new messages