Individual Web Pages

277 views
Skip to first unread message

Scott Terry

unread,
Jan 23, 2023, 3:38:07 PM1/23/23
to Common Crawl
Hi.  Noob here.

Situation: I know which pages/URLs I would like to retrieve.

Is it possible to extract individual pages from an archive, using the "https" method, without having to download an entire archive?

If so, how?

Thanks in advance!
Scott

Sebastian Nagel

unread,
Jan 24, 2023, 9:41:32 AM1/24/23
to common...@googlegroups.com
Hi Scott,

yes, it's possible to extract individual pages from the archives
using one of the URL indexes. See
https://commoncrawl.org/access-the-data/
or
https://groups.google.com/g/common-crawl/c/phYQfJh_M0A/m/JsRwH62-BQAJ

If it's about many pages, you probably prefer to use the columnar index

https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb

You could also use a big data tool to pick the WARC records or fetch
them from us-east-1 to limit network latency.

Best,
Sebastian

Scott Terry

unread,
Jan 25, 2023, 3:28:33 PM1/25/23
to Common Crawl
Thank you so much Sebastian.  I'll check the URL methods out.

Regarding the big data tool, I don't recall seeing it in any documentation.  Is there a link?

Thanks again!

Scott

Sebastian Nagel

unread,
Jan 30, 2023, 5:15:42 AM1/30/23
to common...@googlegroups.com
Hi Scott,

> Regarding the big data tool, I don't recall seeing it in any
> documentation. Is there a link?

Two Spark jobs:
- (Java) store a list of WARC records (by file path, offset and length)
into WARC files:

https://github.com/commoncrawl/cc-index-table#export-subsets-of-the-common-crawl-archives
- (Python) process WARC records

https://github.com/commoncrawl/cc-pyspark/blob/main/cc_index_word_count.py

But there are other ways to go, eg. using Amazon Lambda:

https://medium.com/@jaderd/one-click-to-download-exactly-the-web-pages-you-may-want-no-matter-how-many-they-are-d4834265a0a3

Best,
Sebastian

On 1/25/23 21:28, Scott Terry wrote:
> Thank you so much Sebastian.  I'll check the URL methods out.
>
> Regarding the big data tool, I don't recall seeing it in any
> documentation.  Is there a link?
>
> Thanks again!
>
> Scott
>
> On Tuesday, January 24, 2023 at 3:41:32 PM UTC+1 Sebastian Nagel wrote:
>
> Hi Scott,
>
> yes, it's possible to extract individual pages from the archives
> using one of the URL indexes. See
> https://commoncrawl.org/access-the-data/
> <https://commoncrawl.org/access-the-data/>
> or
> https://groups.google.com/g/common-crawl/c/phYQfJh_M0A/m/JsRwH62-BQAJ <https://groups.google.com/g/common-crawl/c/phYQfJh_M0A/m/JsRwH62-BQAJ>
>
> If it's about many pages, you probably prefer to use the columnar index
>
> https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb <https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb>

Aliz 'Randomdude'

unread,
Feb 9, 2023, 10:16:32 PM2/9/23
to common...@googlegroups.com
Hi, 

Hope you'll excuse something of a corporate 'plug' here :)

There's some example code I wrote a while back for my company which does just this - there's a python file included to fetch a named URL from the common crawl dataset. It's at https://github.com/watchtowrlabs/common-crawl/blob/main/fetchURL.py , and setup is detailed in the accompanying blogpost https://labs.watchtowr.com/all-around-the-world-the-common-crawl-dataset/

Hope that's useful - it's not my intention to spam the company blog, it just seems genuinely relevant! Let me know if I'm out of order and I'll adjust my posting accordingly.

- Aliz Hammond


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/c9f7cf27-23eb-b01a-9248-ceb85ba6da9e%40commoncrawl.org.

Sebastian Nagel

unread,
Feb 14, 2023, 8:56:00 AM2/14/23
to common...@googlegroups.com
Hi Aliz,

> https://labs.watchtowr.com/all-around-the-world-the-common-crawl-dataset/

Thanks for sharing this nice example. It's already on our list of examples:

https://commoncrawl.org/the-data/examples/

Best,
Sebastian

On 2/10/23 04:16, Aliz 'Randomdude' wrote:
> Hi,
>
> Hope you'll excuse something of a corporate 'plug' here :)
>
> There's some example code I wrote a while back for my company which does
> just this - there's a python file included to fetch a named URL from the
> common crawl dataset. It's at
> https://github.com/watchtowrlabs/common-crawl/blob/main/fetchURL.py
> <https://github.com/watchtowrlabs/common-crawl/blob/main/fetchURL.py> ,
> and setup is detailed in the accompanying blogpost
> https://labs.watchtowr.com/all-around-the-world-the-common-crawl-dataset/.
>
> Hope that's useful - it's not my intention to spam the company blog, it
> just seems genuinely relevant! Let me know if I'm out of order and I'll
> adjust my posting accordingly.
>
> - Aliz Hammond
>
>
> On Mon, 30 Jan 2023 at 18:15, Sebastian Nagel <seba...@commoncrawl.org
> <mailto:seba...@commoncrawl.org>> wrote:
>
> Hi Scott,
>
>  > Regarding the big data tool, I don't recall seeing it in any
>  > documentation.  Is there a link?
>
> Two Spark jobs:
> - (Java) store a list of WARC records (by file path, offset and length)
>    into WARC files:
>
> https://github.com/commoncrawl/cc-index-table#export-subsets-of-the-common-crawl-archives <https://github.com/commoncrawl/cc-index-table#export-subsets-of-the-common-crawl-archives>
> - (Python) process WARC records
>
> https://github.com/commoncrawl/cc-pyspark/blob/main/cc_index_word_count.py <https://github.com/commoncrawl/cc-pyspark/blob/main/cc_index_word_count.py>
>
> But there are other ways to go, eg. using Amazon Lambda:
>
> https://medium.com/@jaderd/one-click-to-download-exactly-the-web-pages-you-may-want-no-matter-how-many-they-are-d4834265a0a3 <https://medium.com/@jaderd/one-click-to-download-exactly-the-web-pages-you-may-want-no-matter-how-many-they-are-d4834265a0a3>
>
> Best,
> Sebastian
>
> On 1/25/23 21:28, Scott Terry wrote:
> > Thank you so much Sebastian.  I'll check the URL methods out.
> >
> > Regarding the big data tool, I don't recall seeing it in any
> > documentation.  Is there a link?
> >
> > Thanks again!
> >
> > Scott
> >
> > On Tuesday, January 24, 2023 at 3:41:32 PM UTC+1 Sebastian Nagel
> wrote:
> >
> >     Hi Scott,
> >
> >     yes, it's possible to extract individual pages from the archives
> >     using one of the URL indexes. See
> > https://commoncrawl.org/access-the-data/
> <https://commoncrawl.org/access-the-data/>
> >     <https://commoncrawl.org/access-the-data/
> <https://commoncrawl.org/access-the-data/>>
> >     or
> >
> https://groups.google.com/g/common-crawl/c/phYQfJh_M0A/m/JsRwH62-BQAJ <https://groups.google.com/g/common-crawl/c/phYQfJh_M0A/m/JsRwH62-BQAJ> <https://groups.google.com/g/common-crawl/c/phYQfJh_M0A/m/JsRwH62-BQAJ <https://groups.google.com/g/common-crawl/c/phYQfJh_M0A/m/JsRwH62-BQAJ>>
> >
> >     If it's about many pages, you probably prefer to use the
> columnar index
> >
> >
> https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb <https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb> <https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb <https://nbviewer.org/github/commoncrawl/cc-notebooks/blob/main/cc-index-table/bulk-url-lookups-by-table-joins.ipynb>>
Reply all
Reply to author
Forward
0 new messages