Retrieving text

396 views
Skip to first unread message

Lewis Blackwell

unread,
Jul 19, 2020, 7:06:05 AM7/19/20
to Common Crawl

Hi Everyone,

 

I have been experimenting with the CC URL index by querying various domains and reviewing the results in json format. From my research I learned that I can access the original webpage (archive) of specific articles by appending the url of the CC URL index+url value found in each of the json objects.

Is there a better way to retrieve the html from each of the urls for a specific domain? I appreciate your feedback.

Thanks

Sebastian Nagel

unread,
Jul 20, 2020, 4:42:21 AM7/20/20
to common...@googlegroups.com
Hi Lewis,

please have a look at Greg's cdx-toolkit
https://pypi.org/project/cdx-toolkit/
a nice tool which should fit your use case.

Below the explanation how you could implement it by your own.
In case you need to scale it to millions of pages, there is
the columnar index which allows you to pick records using Spark.

> Is there a better way to retrieve the html from each of the urls for a specific domain?

After you picked the list of URLs resp. WARC records: it's much faster to send directly a range request to S3.

E.g. here for the Common Crawl terms of use fetched back in 2017:
http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=http://commoncrawl.org/terms-of-use/&output=json

{"urlkey": "org,commoncrawl)/terms-of-use", "timestamp": "20170824102900", "filename":
"crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz", "digest":
"JXY77DKLYTHWKLKALRFKJQ4LM23FMLEV", "url": "http://commoncrawl.org/terms-of-use/", "mime": "text/html", "length": "7058", "mime-detected":
"text/html", "offset": "91300676", "status": "200"}

With the given WARC file name, offset and length it's trivial to fetch the single WARC record from the archives. Just send an HTTP range
request from $offset to ($offset+$length-1) to commoncrawl.s3.amazonaws.com and uncompress the response body. Here an example, how to fetch
the terms-of-use record using curl and gzip:

curl -s -r91300676-$((91300676+7058-1)) \

"https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz"
\
| gzip -dc

There are other options to send a range request, see
https://groups.google.com/g/common-crawl/c/8vnQnUA-0-0/m/8aT5g-9SFgAJ

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com?utm_medium=email&utm_source=footer>.

Lewis Blackwell

unread,
Jul 20, 2020, 11:28:23 AM7/20/20
to common...@googlegroups.com

Hi Sebastian,

Thank you for your excellent feedback! I am currently experimenting with the methods you suggested.

Does selecting json metadata with "mime-detected":“text/html” ensure that I am always processing text/html related data?

Thanks,

Lewis    

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.

Sebastian Nagel

unread,
Jul 20, 2020, 12:06:20 PM7/20/20
to common...@googlegroups.com
Hi Lewis,

> Does selecting json metadata with *"mime-detected":“text/html”*ensure that I am always
> processing text/html related data?

The MIME type is detected by Tika [1]. Of course, there is no guarantee that the detection
is always correct - there are billions of URLs/pages and I'm sure one or the other edge
case is included where the field "mime-detected" is wrong, even for trivial types such as
"text/html".

Note: there is also "application/xhtml+xml" - you'll probably want to include it as well.

Best,
Sebastian

[1] https://tika.apache.org/1.24.1/detection.html#Mime_Magic_Detection


On 7/20/20 5:28 PM, Lewis Blackwell wrote:
> Hi Sebastian,
>
> Thank you for your excellent feedback! I am currently experimenting with the methods you suggested.
>
> Does selecting json metadata with *"mime-detected":“text/html”*ensure that I am always processing text/html related data?
>
> Thanks,
>
> Lewis    
>
> On Mon, Jul 20, 2020 at 4:42 AM Sebastian Nagel <seba...@commoncrawl.org <mailto:seba...@commoncrawl.org>> wrote:
>
> Hi Lewis,
>
> please have a look at Greg's cdx-toolkit
>   https://pypi.org/project/cdx-toolkit/
> a nice tool which should fit your use case.
>
> Below the explanation how you could implement it by your own.
> In case you need to scale it to millions of pages, there is
> the columnar index which allows you to pick records using Spark.
>
> > Is there a better way to retrieve the html from each of the urls for a specific domain?
>
> After you picked the list of URLs resp. WARC records: it's much faster to send directly a range request to S3.
>
> E.g. here for the Common Crawl terms of use fetched back in 2017:
>   http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=http://commoncrawl.org/terms-of-use/&output=json
>
>   {"urlkey": "org,commoncrawl)/terms-of-use", "timestamp": "20170824102900", "filename":
> "crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz", "digest":
> "JXY77DKLYTHWKLKALRFKJQ4LM23FMLEV", "url": "http://commoncrawl.org/terms-of-use/", "mime": "text/html", "length": "7058", "mime-detected":
> "text/html", "offset": "91300676", "status": "200"}
>
> With the given WARC file name, offset and length it's trivial to fetch the single WARC record from the archives. Just send an HTTP range
> request from $offset to ($offset+$length-1) to commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com> and uncompress the
> response body. Here an example, how to fetch
> the terms-of-use record using curl and gzip:
>
>  curl -s -r91300676-$((91300676+7058-1)) \
>
> "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz"
> \
>  | gzip -dc
>
> There are other options to send a range request, see
>   https://groups.google.com/g/common-crawl/c/8vnQnUA-0-0/m/8aT5g-9SFgAJ
>
> Best,
> Sebastian
>
> On 7/19/20 1:06 PM, Lewis Blackwell wrote:
> > Hi Everyone,
> >
> >  
> >
> > I have been experimenting with the CC URL index by querying various domains and reviewing the results in json format. From my research I
> > learned that I can access the original webpage (archive) of specific articles by appending the url of the CC URL index+url value found in
> > each of the json objects.
> >
> > Is there a better way to retrieve the html from each of the urls for a specific domain? I appreciate your feedback.
> >
> > Thanks
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>
> > <mailto:common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>>.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com
> >
> <https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CAO7mzb-Rr_XUYySCrLrGwCrO3L9L3EeYrz-V7kWZnnfpEzbooQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CAO7mzb-Rr_XUYySCrLrGwCrO3L9L3EeYrz-V7kWZnnfpEzbooQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Lewis Blackwell

unread,
Jul 20, 2020, 9:14:55 PM7/20/20
to common...@googlegroups.com
Thank you for your feedback Sebastian!

Lewis

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/34ecfda5-14ed-f9a8-18d1-566e1850d96d%40commoncrawl.org.

Lewis Blackwell

unread,
Jul 22, 2020, 5:17:08 PM7/22/20
to common...@googlegroups.com

Hi Sebastian,

Is there a way to get json objects with unique digest values? Also does CC capture publication dates of each captured webpage?

Thanks,

Lewis


To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.

Sebastian Nagel

unread,
Jul 23, 2020, 10:56:19 AM7/23/20
to common...@googlegroups.com
Hi Lewis,

> Is there a way to get json objects with unique digest values?

The CDX API does not allow to do this. You might use the columnar index
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

Here an example to search for duplicates on the Common Crawl web site:

SELECT COUNT(*) AS n_captures,
COUNT(DISTINCT(url)) AS n_urls,
content_digest,
array_agg(DISTINCT(url)) as urls
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2020-29'
AND subset = 'warc'
AND contains(ARRAY ['commoncrawl.org'],
url_host_registered_domain)
GROUP BY content_digest
HAVING (COUNT(*) > 1)

A trial to run this on an entire monthly crawl failed with "Query exhausted
resources at this scale factor" - which I'd even expected. But if you're able
to add some limitation (eg. to a country-code top-level domain
`WHERE ... url_host_tld = 'de'`) it should work.


> Also does CC capture publication dates of each captured webpage?

Only the capture time ("timestamp" in the CDX index, "fetch_time" in the columnar index)
is stored.

Best,
Sebastian


On 7/22/20 11:16 PM, Lewis Blackwell wrote:
> Hi Sebastian,
>
> Is there a way to get json objects with unique digest values? Also does CC capture publication dates of each captured webpage?
>
> Thanks,
>
> Lewis
>
>
> On Mon, Jul 20, 2020 at 4:42 AM Sebastian Nagel <seba...@commoncrawl.org <mailto:seba...@commoncrawl.org>> wrote:
>
> Hi Lewis,
>
> please have a look at Greg's cdx-toolkit
>   https://pypi.org/project/cdx-toolkit/
> a nice tool which should fit your use case.
>
> Below the explanation how you could implement it by your own.
> In case you need to scale it to millions of pages, there is
> the columnar index which allows you to pick records using Spark.
>
> > Is there a better way to retrieve the html from each of the urls for a specific domain?
>
> After you picked the list of URLs resp. WARC records: it's much faster to send directly a range request to S3.
>
> E.g. here for the Common Crawl terms of use fetched back in 2017:
>   http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=http://commoncrawl.org/terms-of-use/&output=json
>
>   {"urlkey": "org,commoncrawl)/terms-of-use", "timestamp": "20170824102900", "filename":
> "crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz", "digest":
> "JXY77DKLYTHWKLKALRFKJQ4LM23FMLEV", "url": "http://commoncrawl.org/terms-of-use/", "mime": "text/html", "length": "7058", "mime-detected":
> "text/html", "offset": "91300676", "status": "200"}
>
> With the given WARC file name, offset and length it's trivial to fetch the single WARC record from the archives. Just send an HTTP range
> request from $offset to ($offset+$length-1) to commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com> and uncompress the
> response body. Here an example, how to fetch
> the terms-of-use record using curl and gzip:
>
>  curl -s -r91300676-$((91300676+7058-1)) \
>
> "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz"
> \
>  | gzip -dc
>
> There are other options to send a range request, see
>   https://groups.google.com/g/common-crawl/c/8vnQnUA-0-0/m/8aT5g-9SFgAJ
>
> Best,
> Sebastian
>
> On 7/19/20 1:06 PM, Lewis Blackwell wrote:
> > Hi Everyone,
> >
> >  
> >
> > I have been experimenting with the CC URL index by querying various domains and reviewing the results in json format. From my research I
> > learned that I can access the original webpage (archive) of specific articles by appending the url of the CC URL index+url value found in
> > each of the json objects.
> >
> > Is there a better way to retrieve the html from each of the urls for a specific domain? I appreciate your feedback.
> >
> > Thanks
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>
> > <mailto:common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>>.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com
> >
> <https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CAO7mzb9o3JStFUL4xV%2BBA7jQz_BebUTE5_3WD48VHgzoZ-2R5Q%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CAO7mzb9o3JStFUL4xV%2BBA7jQz_BebUTE5_3WD48VHgzoZ-2R5Q%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Lewis Blackwell

unread,
Jul 26, 2020, 7:13:53 PM7/26/20
to common...@googlegroups.com
Thank you!

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/e0d05460-aa5f-160f-eebd-2c0636e3580c%40commoncrawl.org.

Lewis Blackwell

unread,
Aug 9, 2020, 5:29:05 PM8/9/20
to common...@googlegroups.com

Hi Everyone,

 

I need to extract the html for each captured page (url) directly from  warc files and have specific domains that I need to extract the text from. Is there a way to download all the warc files for a specific domain and then extract the text? Can you provide an example?

 

Thanks,


To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.

Sebastian Nagel

unread,
Aug 12, 2020, 11:13:54 AM8/12/20
to common...@googlegroups.com
Hi Lewis,

to download WARC captures for specific domains you may use Greg's cdx-toolkit
https://pypi.org/project/cdx-toolkit/

Once all desired content is in WARC files, you could convert them to WET files,
see
https://groups.google.com/u/1/g/common-crawl/c/imv4hlLob4s/m/c2SR-iXcAwAJ

If you have experience running Spark:
- there's one example in cc-pyspark which fetched WARC captures given a SQL query
on the columnar index and extracts the text from the HTML:
https://github.com/commoncrawl/cc-pyspark/blob/master/cc_index_word_count.py
(I assume you do not want to count words, but this part should be easy to adapt)

Best,
Sebastian


On 8/9/20 11:28 PM, Lewis Blackwell wrote:
> Hi Everyone,
>
>  
>
> I need to extract the html for each captured page (url) directly from  warc files and have specific domains that I need to extract the text
> from. Is there a way to download all the warc files for a specific domain and then extract the text? Can you provide an example?
>
>  
>
> Thanks,
>
>
> On Mon, Jul 20, 2020 at 4:42 AM Sebastian Nagel <seba...@commoncrawl.org <mailto:seba...@commoncrawl.org>> wrote:
>
> Hi Lewis,
>
> please have a look at Greg's cdx-toolkit
>   https://pypi.org/project/cdx-toolkit/
> a nice tool which should fit your use case.
>
> Below the explanation how you could implement it by your own.
> In case you need to scale it to millions of pages, there is
> the columnar index which allows you to pick records using Spark.
>
> > Is there a better way to retrieve the html from each of the urls for a specific domain?
>
> After you picked the list of URLs resp. WARC records: it's much faster to send directly a range request to S3.
>
> E.g. here for the Common Crawl terms of use fetched back in 2017:
>   http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=http://commoncrawl.org/terms-of-use/&output=json
>
>   {"urlkey": "org,commoncrawl)/terms-of-use", "timestamp": "20170824102900", "filename":
> "crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz", "digest":
> "JXY77DKLYTHWKLKALRFKJQ4LM23FMLEV", "url": "http://commoncrawl.org/terms-of-use/", "mime": "text/html", "length": "7058", "mime-detected":
> "text/html", "offset": "91300676", "status": "200"}
>
> With the given WARC file name, offset and length it's trivial to fetch the single WARC record from the archives. Just send an HTTP range
> request from $offset to ($offset+$length-1) to commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com> and uncompress the
> response body. Here an example, how to fetch
> the terms-of-use record using curl and gzip:
>
>  curl -s -r91300676-$((91300676+7058-1)) \
>
> "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz"
> \
>  | gzip -dc
>
> There are other options to send a range request, see
>   https://groups.google.com/g/common-crawl/c/8vnQnUA-0-0/m/8aT5g-9SFgAJ
>
> Best,
> Sebastian
>
> On 7/19/20 1:06 PM, Lewis Blackwell wrote:
> > Hi Everyone,
> >
> >  
> >
> > I have been experimenting with the CC URL index by querying various domains and reviewing the results in json format. From my research I
> > learned that I can access the original webpage (archive) of specific articles by appending the url of the CC URL index+url value found in
> > each of the json objects.
> >
> > Is there a better way to retrieve the html from each of the urls for a specific domain? I appreciate your feedback.
> >
> > Thanks
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>
> > <mailto:common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>>.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com
> >
> <https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CAO7mzb9MGw0-6DoTMxFciB%3DDGJMLp4wVTQ0WgAfaQS2NiBCRPg%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CAO7mzb9MGw0-6DoTMxFciB%3DDGJMLp4wVTQ0WgAfaQS2NiBCRPg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Lewis Blackwell

unread,
Aug 13, 2020, 11:07:02 PM8/13/20
to common...@googlegroups.com
Thank you Sebastian!

Lewis

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/2fc9afb4-af3c-5d27-5042-ae259ac14c62%40commoncrawl.org.

Lewis Blackwell

unread,
Oct 1, 2020, 8:51:41 PM10/1/20
to common...@googlegroups.com
Hi Sebastian,

Can you give an example of how to send a range request to S3. Previously you provided an example on how to get the terms of service by doing the following given the filename, offset and length:
Is there a way to do the same this but instead of requesting one file, requests several? I appreciate your feedback.

Thanks,
Lewis

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.

Sebastian Nagel

unread,
Oct 2, 2020, 2:02:40 AM10/2/20
to common...@googlegroups.com
Hi Lewis,

it's not possible to do this in one request only
- "Amazon S3 doesn't support retrieving multiple ranges of data per GET request." [1]
- you cannot request multiple files (URLs) in one request anyway

Of course, it's more efficient to execute multiple requests in a loop reusing the HTTP/S3 client and maybe the connection as well. That's
what cdx-toolkit is doing, or alternatively have a look
at cc-pyspark [2] or cc-index-table [3] which fetch and process WARC records in a Spark job.

In case you'd need more information and/or help: could you share more about your use case and your preferred programming language and platform?

Best,
Sebastian

[1] https://docs.aws.amazon.com/AmazonS3/latest/dev/GettingObjectsUsingAPIs.html
[2] https://github.com/commoncrawl/cc-pyspark
[3] https://github.com/commoncrawl/cc-index-table

On 10/2/20 2:51 AM, Lewis Blackwell wrote:
> Hi Sebastian,
>
> Can you give an example of how to send a range request to S3. Previously you provided an example on how to get the terms of service by doing
> the following given the filename, offset and length:
>
>   curl -s -r91300676-$((91300676+7058-1)) \
>
> "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz"
> \
>  | gzip -dc 
>
> Is there a way to do the same this but instead of requesting one file, requests several? I appreciate your feedback.
>
> Thanks,
> Lewis
>
> On Mon, Jul 20, 2020 at 4:42 AM Sebastian Nagel <seba...@commoncrawl.org <mailto:seba...@commoncrawl.org>> wrote:
>
> Hi Lewis,
>
> please have a look at Greg's cdx-toolkit
>   https://pypi.org/project/cdx-toolkit/
> a nice tool which should fit your use case.
>
> Below the explanation how you could implement it by your own.
> In case you need to scale it to millions of pages, there is
> the columnar index which allows you to pick records using Spark.
>
> > Is there a better way to retrieve the html from each of the urls for a specific domain?
>
> After you picked the list of URLs resp. WARC records: it's much faster to send directly a range request to S3.
>
> E.g. here for the Common Crawl terms of use fetched back in 2017:
>   http://index.commoncrawl.org/CC-MAIN-2017-34-index?url=http://commoncrawl.org/terms-of-use/&output=json
>
>   {"urlkey": "org,commoncrawl)/terms-of-use", "timestamp": "20170824102900", "filename":
> "crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz", "digest":
> "JXY77DKLYTHWKLKALRFKJQ4LM23FMLEV", "url": "http://commoncrawl.org/terms-of-use/", "mime": "text/html", "length": "7058", "mime-detected":
> "text/html", "offset": "91300676", "status": "200"}
>
> With the given WARC file name, offset and length it's trivial to fetch the single WARC record from the archives. Just send an HTTP range
> request from $offset to ($offset+$length-1) to commoncrawl.s3.amazonaws.com <http://commoncrawl.s3.amazonaws.com> and uncompress the
> response body. Here an example, how to fetch
> the terms-of-use record using curl and gzip:
>
>  curl -s -r91300676-$((91300676+7058-1)) \
>
> "https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2017-34/segments/1502886133449.19/warc/CC-MAIN-20170824101532-20170824121532-00090.warc.gz"
> \
>  | gzip -dc
>
> There are other options to send a range request, see
>   https://groups.google.com/g/common-crawl/c/8vnQnUA-0-0/m/8aT5g-9SFgAJ
>
> Best,
> Sebastian
>
> On 7/19/20 1:06 PM, Lewis Blackwell wrote:
> > Hi Everyone,
> >
> >  
> >
> > I have been experimenting with the CC URL index by querying various domains and reviewing the results in json format. From my research I
> > learned that I can access the original webpage (archive) of specific articles by appending the url of the CC URL index+url value found in
> > each of the json objects.
> >
> > Is there a better way to retrieve the html from each of the urls for a specific domain? I appreciate your feedback.
> >
> > Thanks
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>
> > <mailto:common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>>.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com
> >
> <https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CAO7mzb_3xFPDBsCPjcAotzqNc%3DaHC_-Zu87sTx%3DuBiJN6Sor%3DA%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CAO7mzb_3xFPDBsCPjcAotzqNc%3DaHC_-Zu87sTx%3DuBiJN6Sor%3DA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Lewis Blackwell

unread,
Oct 2, 2020, 5:05:30 PM10/2/20
to common...@googlegroups.com
Hi Sebastian,
Thank you for your feedback. I need to extract each of the html pages  from a list of domains in the fastest way possible. I have been sending a request in a loop as you described, but I need a faster method to meet my deadline. Initially I thought about downloading the warc files for specific domains, but it looks like the html pages for specific domains are not necessarily in unique warc files. Is that correct? 
I am not sure how to proceed. I appreciate any feedback you can provide. Please note that I am using python.
Thanks,
Lewis


To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/1f80944c-50e4-bcd5-a1a1-09864eb905b3%40commoncrawl.org.

Sebastian Nagel

unread,
Oct 3, 2020, 5:14:35 AM10/3/20
to common...@googlegroups.com
Hi Lewis,

> the html pages for specific domains are not necessarily in unique warc files. Is that correct?

Yes. Every WARC file contains a (pseudo)random sample of URLs/pages.

To increase the throughput:
- fetch the records not from remote but from AWS (us-east-1 region) where the data is stored.
This minimizes the overhead for the many small requests, downloading of the results in large
chunks shouldn't be a bottleneck.
- parallelize the fetching: simply run multiple processes of your script
or cdx-toolkit in parallel, or use some big data frameworks (eg. cc-pyspark
which can fetch WARC records from CSV files <url,warc_file,record_offset,record_length>
and process the data.

Collin Delow wrote a very informative article about optimizing the throughput of per-record fetches:
https://code402.com/blog/s3-scans-vs-index/

Sebastian
> >     <mailto:common-crawl%2Bunsu...@googlegroups.com <mailto:common-crawl%252Buns...@googlegroups.com>>
> >     > <mailto:common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>
> <mailto:common-crawl%2Bunsu...@googlegroups.com <mailto:common-crawl%252Buns...@googlegroups.com>>>.
> >     > To view this discussion on the web visit
> >     > https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com
> >     >
> >   
>  <https://groups.google.com/d/msgid/common-crawl/ba1951e2-b804-49e8-a136-90f6687f6c75o%40googlegroups.com?utm_medium=email&utm_source=footer>.
> >
> >     --
> >     You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> >     To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>
> >     <mailto:common-crawl%2Bunsu...@googlegroups.com <mailto:common-crawl%252Buns...@googlegroups.com>>.
> >     To view this discussion on the web visit
> >     https://groups.google.com/d/msgid/common-crawl/61cd3fcd-4356-299d-826e-d9ff582c6b16%40commoncrawl.org.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>
> > <mailto:common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>>.
> > To view this discussion on the web visit
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/1f80944c-50e4-bcd5-a1a1-09864eb905b3%40commoncrawl.org.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CAO7mzb_ON2%3DQ2%2B57pGu8vOtyiJCxKRKMhdS_ZtG_2iXLEQaziQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CAO7mzb_ON2%3DQ2%2B57pGu8vOtyiJCxKRKMhdS_ZtG_2iXLEQaziQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Lewis Blackwell

unread,
Oct 12, 2020, 1:34:13 AM10/12/20
to common...@googlegroups.com
Thank you for your feedback!

Lewis

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/23af935e-4c3a-61a3-d7ec-194d0366c467%40commoncrawl.org.
Reply all
Reply to author
Forward
0 new messages