March/April 2020 crawl archive now available

127 views
Skip to first unread message

Sebastian Nagel

unread,
Apr 15, 2020, 4:40:54 AM4/15/20
to Common Crawl
Hi all,

the crawl archives of March/April 2020 are now available. The crawl was run from March 28
to April 10. It covers 2.85 billion web pages or 280 TiB of uncompressed content. As usual,
more details about the crawl and information how to access and use the data can be found
on the Common Crawl blog [1].

Please note that we will also merge the next two monthly crawls as a joint May/June crawl
which is planned to start in the last week of May and to be released between June 10 and 15.

Best,
Sebastian

[1] https://commoncrawl.org/2020/04/march-april-2020-crawl-archive-now-available/

Karlo

unread,
May 5, 2020, 4:58:12 PM5/5/20
to Common Crawl
Hi Sebastian, I was wondering how come the March/April archive was merged, couldn't find the explanation on your website, and have actually just ran into this post. I see now that the next May/June crawl will also be merged.

Will two-month crawls become the norm now or this was more some sort of an exception?

I'm asking as I'm considering a research project which would depend on a historical time difference in between multiple crawls.

Cheers,
Karlo

Sebastian Nagel

unread,
May 8, 2020, 12:13:01 PM5/8/20
to Common Crawl
Hi Karlo,

> Will two-month crawls become the norm now or this was more some sort of an exception?

Hopefully, the latter. The decision to delay the March crawl [1] has been made because
of the uncertainty during the start of the Covid-19 pandemic.


> I'm asking as I'm considering a research project which would depend on a historical time
> difference in between multiple crawls.

When looking back, we started monthly releases in spring 2014. Before (since 2008) crawls had been
released yearly or biyearly. Anyway even since 2014 there have been gaps. And yes, for 2017-2019 there have been 12 releases per year in
almost regular intervals (3.5 - 5.5 weeks).

Best,
Sebastian


[1] https://groups.google.com/forum/#!topic/common-crawl/mkRXPYJr5yM

Karlo Sarin

unread,
May 8, 2020, 1:52:03 PM5/8/20
to common...@googlegroups.com
Thanks Sebastian for a prompt reply, I really appreciate that!

Cheers,
Karlo

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/9c7fb004-20bf-e9af-d3a6-72270c594595%40commoncrawl.org.

Sarah Masud

unread,
Jun 7, 2020, 8:59:51 AM6/7/20
to Common Crawl
Hey, I am trying to crawl the March data from 1st March to 20th March using the newsplease[1] crawler, and I am facing the following issue

ERROR:newsplease.crawler.commoncrawl_extractor:Unexpected error: <class 'PermissionError'>
2020-06-07 16:52:38 [newsplease.crawler.commoncrawl_extractor] ERROR: Unexpected error: <class 'PermissionError'>
ERROR:newsplease.crawler.commoncrawl_extractor:Unexpected error: <class 'PermissionError'>
2020-06-07 16:52:38 [newsplease.crawler.commoncrawl_extractor] ERROR: Unexpected error: <class 'PermissionError'>
ERROR:newsplease.crawler.commoncrawl_extractor:Unexpected error: <class 'PermissionError'>
2020-06-07 16:52:38 [newsplease.crawler.commoncrawl_extractor] ERROR: Unexpected error: <class 'PermissionError'>

it is trying to download the https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2020/03/CC-NEWS-20200309003729-01135.warc.gz and few warcs around it again and agin but getting stuck with the above error.

The current downloaded WARC list is stuff after/at this point
```
https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2Fcrawl-data%2FCC-NEWS%2F2020%2F03%2FCC-NEWS-20200309003729-01135.warc.gz
https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2Fcrawl-data%2FCC-NEWS%2F2020%2F03%2FCC-NEWS-20200309062503-01137.warc.gz
https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2Fcrawl-data%2FCC-NEWS%2F2020%2F03%2FCC-NEWS-20200309102435-01139.warc.gz
https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2Fcrawl-data%2FCC-NEWS%2F2020%2F03%2FCC-NEWS-20200309130750-01141.warc.gz
https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2Fcrawl-data%2FCC-NEWS%2F2020%2F03%2FCC-NEWS-20200309153704-01143.warc.gz

```

Any help would be appreciated.

Sebastian Nagel

unread,
Jun 7, 2020, 11:59:12 AM6/7/20
to common...@googlegroups.com
Hi Sarah,
I've checked the permissions of this file - I'm able to download it, so that cannot be the reason?

Could it be that there are missing permissions on the other end? For example, no write permissions in the folder the files are stored?
This would also fit the description of
the PermissionError, see https://docs.python.org/3/library/exceptions.html#PermissionError

In doubt, I would report this as an issue of the news-please ... Ok, I see you already did:
https://github.com/fhamborg/news-please/issues/163


Best,
Sebastian
> <https://commoncrawl.org/2020/04/march-april-2020-crawl-archive-now-available/>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/d7f269df-b5f6-4e2c-a946-fbe4a3cc0ae0o%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/d7f269df-b5f6-4e2c-a946-fbe4a3cc0ae0o%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sarah Masud

unread,
Jun 7, 2020, 4:31:54 PM6/7/20
to Common Crawl
Hey Sebastian,
Thanks for a quick response. Write permission could be an issue since I am running this on server. However, to overcome the permission issue, I set chmod 777 in the newsplease folder itself.  I know it is not an ideal solution but that way the folders where warcs download will have all the write access.
After that I reran crawler, but I am still getting the error. It starts downloading the warcs before this error comes up, and then it keep retrying :/
```
INFO:newsplease.crawler.commoncrawl_extractor:downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2020/03/CC-NEWS-20200309130750-01141.warc.gz (local: ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2Fcrawl-data%2FCC-NEWS%2F2020%2F03%2FCC-NEWS-20200309130750-01141.warc.gz)
INFO:newsplease.crawler.commoncrawl_extractor:downloading https://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2020/03/CC-NEWS-20200309153704-01143.warc.gz (local: ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2Fcrawl-data%2FCC-NEWS%2F2020%2F03%2FCC-NEWS-20200309153704-01143.warc.gz)
INFO:newsplease.crawler.commoncrawl_extractor:download completed, local file: ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2Fcrawl-data%2FCC-NEWS%2F2020%2F03%2FCC-NEWS-20200304090923-01069.warc.gz
663M / 1023MERROR:newsplease.crawler.commoncrawl_extractor:Unexpected error: <class 'PermissionError'>
2020-06-08 01:49:02 [newsplease.crawler.commoncrawl_extractor] ERROR: Unexpected error: <class 'PermissionError'>
737M / 1023MERROR:newsplease.crawler.commoncrawl_extractor:Unexpected error: <class 'PermissionError'>
2020-06-08 01:49:02 [newsplease.crawler.commoncrawl_extractor] ERROR: Unexpected error: <class 'PermissionError'>
580M / 1023MERROR:newsplease.crawler.commoncrawl_extractor:Unexpected error: <class 'PermissionError'>
2020-06-08 01:49:02 [newsplease.crawler.commoncrawl_extractor] ERROR: Unexpected error: <class 'PermissionError'>

```
Anything else that can cause this error?

On Wednesday, April 15, 2020 at 2:10:54 PM UTC+5:30, Sebastian Nagel wrote:

Sebastian Nagel

unread,
Jun 8, 2020, 5:56:09 AM6/8/20
to common...@googlegroups.com
Hi Sarah,

could you try to get the full stack trace of the error?
- either by passing
continue_after_error=False
when calling
commoncrawl_crawler.crawl_from_commoncrawl(...)
- or by replacing the line
self.__logger.error('Unexpected error: %s', sys.exc_info()[0])
by
self.__logger.error('Unexpected error: %s (%s)', *sys.exc_info()[0:2])
self.__logger.error(sys.exc_info()[2], exc_info=True)
in newsplease/crawler/commoncrawl_extractor.py

The error happens in a large try-catch-block, there can be many possible reasons.
It's important to get more context about the problem.

Best,
Sebastian
> <https://commoncrawl.org/2020/04/march-april-2020-crawl-archive-now-available/>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/493de132-2674-49f5-abb5-f50d5d4ee583o%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/493de132-2674-49f5-abb5-f50d5d4ee583o%40googlegroups.com?utm_medium=email&utm_source=footer>.

Ala Anvari

unread,
Jun 8, 2020, 10:28:02 AM6/8/20
to common...@googlegroups.com
Hi all,

I'm trying to get an athena query to word count the segments in a wet folder. I'm pretty sure my 'database' in Athena is pointing at the right folder. Completely new to presto.

This is what my SQL looks like - what have I done wrong?

select (unnest( split("line", ' ') ) ) as words, COUNT from "default"."tableone" GROUP BY(words)

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/4409034f-21e2-3446-1341-3f6bda95736f%40commoncrawl.org.

Sebastian Nagel

unread,
Jun 8, 2020, 10:36:57 AM6/8/20
to common...@googlegroups.com
Hi Ala,

please start a new thread for a new question. Thanks!

Could you also share the column layout of "tableone" and
how you filled the WET files into the table? WET isn't a format
supported by Athena, I also doubt that Athena is the right
tool to produce a word count on billions or trillions of words.

You might have a look at
https://github.com/commoncrawl/cc-pyspark
https://github.com/commoncrawl/cc-warc-examples
https://github.com/commoncrawl/cc-mrjob
which include a word count job on WET files using MapReduce or Spark.

Best,
Sebastian

On 6/8/20 4:27 PM, Ala Anvari wrote:
> Hi all,
>
> I'm trying to get an athena query to word count the segments in a wet folder. I'm pretty sure my 'database' in Athena is pointing at the
> right folder. Completely new to presto.
>
> This is what my SQL looks like - what have I done wrong?
>
> select (unnest( split("line", ' ') ) ) as words, COUNT from "default"."tableone" GROUP BY(words)
>
> On Wed, Apr 15, 2020 at 9:40 AM Sebastian Nagel <seba...@commoncrawl.org <mailto:seba...@commoncrawl.org>> wrote:
>
> Hi all,
>
> the crawl archives of March/April 2020 are now available. The crawl was run from March 28
> to April 10. It covers 2.85 billion web pages or 280 TiB of uncompressed content. As usual,
> more details about the crawl and information how to access and use the data can be found
> on the Common Crawl blog [1].
>
> Please note that we will also merge the next two monthly crawls as a joint May/June crawl
> which is planned to start in the last week of May and to be released between June 10 and 15.
>
> Best,
> Sebastian
>
> [1] https://commoncrawl.org/2020/04/march-april-2020-crawl-archive-now-available/
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/CAO9dMdMTrk_E63rN2oUfc7b_s9fs9%2BCpVxxNCPwjHsOMQbuf6A%40mail.gmail.com
> <https://groups.google.com/d/msgid/common-crawl/CAO9dMdMTrk_E63rN2oUfc7b_s9fs9%2BCpVxxNCPwjHsOMQbuf6A%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Ala Anvari

unread,
Jun 8, 2020, 10:48:39 AM6/8/20
to common...@googlegroups.com
Hi Sebastian! 

Apologies for misfiling question.

I do actually have some working sparklyr code for this, but I'm a bit scared of scaling up on ec2 tbh.

I was hoping athena would be a more time efficient way of essentially doing the same thing.

Tableone has one column 'line', that stores a string for each line in the file.

Athena transparently deals with gzip, so i connected to the dataset using a csv import function with the delimiter set to something \n ish.


Many thanks,
A

To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/5c34ed3c-ef29-ed3d-41c7-556cc1af609f%40commoncrawl.org.
Reply all
Reply to author
Forward
0 new messages