How to use the data for my use case

383 views
Skip to first unread message

hemant thakkar

unread,
Oct 29, 2018, 2:29:56 PM10/29/18
to Common Crawl
Hi,

I have gone through the docs but it is not clear how do I accomplish the following use case.

Use case:
Given a set of URLs (pointing to home page of various Websites), I want to retrieve the corresponding Jobs or Careers page URL if one exists for the given site.

I will appreciate any suggestions or pointers.

Thanks,
Hemant

Sebastian Nagel

unread,
Oct 30, 2018, 5:42:20 AM10/30/18
to common...@googlegroups.com
Hi Hemant,

in case you can identify the job page by its URL,
I would recommend to use the (columnar) URL index.

First, you need to define a regular expression
to identify job pages. A very simple one:
(job|career|employ|openings|opportunities)

Then you can use the Common Crawl URL index, e.g,
to find job pages for the domain "museums.ca" in the
latest monthly crawl:

https://index.commoncrawl.org/CC-MAIN-2018-43-index?url=museums.ca&matchType=domain&filter=~url:.*(job|career|employ|openings|opportunities)&output=json

Have a look at the API documentation [1] how to iterate over results.
You may also use Greg's toolkit [2] to download the page content.

If you have an AWS account you can query the columnar index with Athena [3].
Same as above but for the 3 latest crawls:

SELECT url,
warc_filename,
warc_record_offset,
warc_record_length
FROM "ccindex"."ccindex"
WHERE (crawl = 'CC-MAIN-2018-43'
OR crawl = 'CC-MAIN-2018-39'
OR crawl = 'CC-MAIN-2018-34')
AND subset = 'warc'
AND regexp_like(url_path, '(job|career|employ|openings|opportunities)')
AND url_host_registered_domain = 'museums.ca'

You get 904 results which you can download based on the WARC filename, offset and
length. See [4,5] how to do this and for more examples.

As said you need an AWS account to use Athena but it's cheap (less than one cent for the
query above).


Best,
Sebastian


[1] https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference
[2] https://github.com/cocrawler/cdx_toolkit
[3] https://aws.amazon.com/athena/
[4] https://github.com/commoncrawl/cc-index-table
[5] http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

hemant thakkar

unread,
Oct 30, 2018, 2:04:31 PM10/30/18
to Common Crawl
Hi Sebastian,
Thank you for detailed explanation and help. I will go through the approach you have outlined.
Regards,
Hemant

Chris Brooks

unread,
Jan 9, 2019, 11:30:36 PM1/9/19
to Common Crawl
Is there a way to use the Common Crawl URL index to search all websites within a top-level-domain (TLD, like ".com" for example)?

It seems to work for smaller TLDs, so a query like this one will return all URLs on the .va TLD that match the regex /.*news.*/   

If I wanted to run a similar query across larger TLDs (or across the entire crawl), would I be better off using the columnar index with Athena?  Practically speaking (like, for less than USD $1 and in less than an hour), could I say "show me every URL in the crawl that matches a particular string"?

This is fascinating -- thank you!
-Chris

Sebastian Nagel

unread,
Jan 10, 2019, 2:54:49 AM1/10/19
to common...@googlegroups.com
Hi Chris,

> But, not for larger TLDs, like .com or .net.

For larger result sets you need to use the pagination API [1]
or one of the CDX clients [2,3].

However, please do not do this for .com which makes about 50% of
the entire captures. It's much faster to download the entire
URL index (or the 50% of files which hold the .com TLD)
and process it offline.

> better off using the columnar index with Athena?

Yes, if you need only a smaller subset identified by a regex.
See [4,5] for examples.

> for less than USD $1

Yes. Athena is cheap if only few data is read. My recommendations:
- restrict the query to one or few "crawl" and "subset" partitions
- use only the output fields you really need
- for testing: also do a SELECT on a domain or small TLD

> and in less than an hour

Depending on how many data is read: within 10 seconds or only few minutes.

Best,
Sebastian


[1] https://github.com/webrecorder/pywb/wiki/CDX-Server-API#pagination-api
[2] https://github.com/ikreymer/cdx-index-client
[3] https://pypi.org/project/cdx-toolkit/
[4] https://groups.google.com/d/msg/common-crawl/EBYaos2Yk1M/aSklsmCQBwAJ
[5] https://groups.google.com/d/msg/common-crawl/FveBjZxthaY/cFLKAvnfBAAJ

Chris Brooks

unread,
Jan 11, 2019, 3:03:27 PM1/11/19
to Common Crawl
Thank you Sebastian, I really appreciate the reply!
-Chris
Reply all
Reply to author
Forward
0 new messages