Hi Hemant,
in case you can identify the job page by its URL,
I would recommend to use the (columnar) URL index.
First, you need to define a regular expression
to identify job pages. A very simple one:
(job|career|employ|openings|opportunities)
Then you can use the Common Crawl URL index, e.g,
to find job pages for the domain "
museums.ca" in the
latest monthly crawl:
https://index.commoncrawl.org/CC-MAIN-2018-43-index?url=museums.ca&matchType=domain&filter=~url:.*(job|career|employ|openings|opportunities)&output=json
Have a look at the API documentation [1] how to iterate over results.
You may also use Greg's toolkit [2] to download the page content.
If you have an AWS account you can query the columnar index with Athena [3].
Same as above but for the 3 latest crawls:
SELECT url,
warc_filename,
warc_record_offset,
warc_record_length
FROM "ccindex"."ccindex"
WHERE (crawl = 'CC-MAIN-2018-43'
OR crawl = 'CC-MAIN-2018-39'
OR crawl = 'CC-MAIN-2018-34')
AND subset = 'warc'
AND regexp_like(url_path, '(job|career|employ|openings|opportunities)')
AND url_host_registered_domain = '
museums.ca'
You get 904 results which you can download based on the WARC filename, offset and
length. See [4,5] how to do this and for more examples.
As said you need an AWS account to use Athena but it's cheap (less than one cent for the
query above).
Best,
Sebastian
[1]
https://github.com/webrecorder/pywb/wiki/CDX-Server-API#api-reference
[2]
https://github.com/cocrawler/cdx_toolkit
[3]
https://aws.amazon.com/athena/
[4]
https://github.com/commoncrawl/cc-index-table
[5]
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
>
common-crawl...@googlegroups.com <mailto:
common-crawl...@googlegroups.com>.
> To post to this group, send email to
common...@googlegroups.com
> <mailto:
common...@googlegroups.com>.
> Visit this group at
https://groups.google.com/group/common-crawl.
> For more options, visit
https://groups.google.com/d/optout.