Groups
Sign in
Groups
Common Crawl
Conversations
About
Send feedback
Help
**Beginner** HELP
91 views
Skip to the first unread message
Thirumalai Raj R
unread,
9 Jan 2018, 2:06:53 am
9/1/18
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You don’t have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the ‘view member email addresses’ permission to view the original message
to Common Crawl
Hey
I'm new to CC and WARC. please help me in extracting URLs from "December Index" for a one particular domain.
Also suggest the efficient method is to use python or java?
Thanks
Thirumalai Raj, R.
Sebastian Nagel
unread,
9 Jan 2018, 3:13:20 am
9/1/18
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You don’t have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the ‘view member email addresses’ permission to view the original message
to common...@googlegroups.com
Hi,
the easiest way is to use the parameter matchType=domain, e.g.,
http://index.commoncrawl.org/CC-MAIN-2017-51-index?url=example.com&matchType=domain
If it's about a large domain you have to use the pagination API, see
https://github.com/ikreymer/pywb/wiki/CDX-Server-API#pagination-api
or use the CDX index client
https://github.com/ikreymer/cdx-index-client
If you want to extract not only URLs but also content (WARC records), please see this prior post
https://groups.google.com/d/msg/common-crawl/pQ34q-_EARU/FLFtvTfXAwAJ
or try to run this code snippet with the result from the index lookup
https://gist.github.com/sebastian-nagel/18d479bf203b328d2dded46639dd68d8
Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com
<mailto:
common-crawl...@googlegroups.com
>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com
>.
> Visit this group at
https://groups.google.com/group/common-crawl
.
> For more options, visit
https://groups.google.com/d/optout
.
Thirumalai Raj R
unread,
9 Jan 2018, 4:54:29 am
9/1/18
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You don’t have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the ‘view member email addresses’ permission to view the original message
to Common Crawl
Great. Thanks
Reply all
Reply to author
Forward
0 new messages