You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
I'm looking for some facebook based urls. Normally, I'd use regular expressions, but the CDX server API for search doesn't seem to allow those?
Is there a way to do it? Or a substitute?
Dominik Stadler
unread,
Dec 14, 2015, 5:42:39 PM12/14/15
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Hi,
It's not too hard to download the CDX-index files and apply the regex
yourself, see e.g. the tool "DownloadURLIndex" that is included in a
little pet-project of mine at
https://github.com/centic9/CommonCrawlDocumentDownload which searches
the CDX-index-files for URLs with matching extensions/mime-types.
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
On Mon, Dec 14, 2015 at 2:19 PM, Visvanathan T <tvish...@gmail.com> wrote:
I'm looking for some facebook based urls. Normally, I'd use regular expressions, but the CDX server API for search doesn't seem to allow those?
Is there a way to do it? Or a substitute?
You can use the filter query parameter with filter=~url:<regex> although I'm not sure whether it will timeout if you're attempting to scan a large portion of the index (e.g. all of the .com TLD).
In its current state, it uses a single stream to download, decompress, and analyze the index, but it could be enhanced to do all 300 shards in parallel or just process a subset of the shards or use requests-cache to cache the downloads or ... Well, you get the idea.