Regex Matching on Common Crawl

207 views
Skip to first unread message

Visvanathan T

unread,
Dec 14, 2015, 2:19:31 PM12/14/15
to Common Crawl
I'm looking for some facebook based urls. Normally, I'd use regular expressions, but the CDX server API for search doesn't seem to allow those? 
Is there a way to do it? Or a substitute?


Dominik Stadler

unread,
Dec 14, 2015, 5:42:39 PM12/14/15
to common...@googlegroups.com
Hi,

It's not too hard to download the CDX-index files and apply the regex
yourself, see e.g. the tool "DownloadURLIndex" that is included in a
little pet-project of mine at
https://github.com/centic9/CommonCrawlDocumentDownload which searches
the CDX-index-files for URLs with matching extensions/mime-types.

Dominik.
> --
> You received this message because you are subscribed to the Google Groups
> "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to common-crawl...@googlegroups.com.
> To post to this group, send email to common...@googlegroups.com.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Tom Morris

unread,
Dec 14, 2015, 5:59:35 PM12/14/15
to common...@googlegroups.com
On Mon, Dec 14, 2015 at 2:19 PM, Visvanathan T <tvish...@gmail.com> wrote:
I'm looking for some facebook based urls. Normally, I'd use regular expressions, but the CDX server API for search doesn't seem to allow those? 
Is there a way to do it? Or a substitute?

You can use the filter query parameter with filter=~url:<regex> although I'm not sure whether it will timeout if you're attempting to scan a large portion of the index (e.g. all of the .com TLD).

If that doesn't work, you could adapt the small Python program that I posted back in April to scan the URL index: https://gist.github.com/tfmorris/ab89ed13e2e52830aa6c

In its current state, it uses a single stream to download, decompress, and analyze the index, but it could be enhanced to do all 300 shards in parallel or just process a subset of the shards or use requests-cache to cache the downloads or ... Well, you get the idea.

Tom
Reply all
Reply to author
Forward
0 new messages