Exporting filtered URL's

Søren Lindbo

unread,

Nov 14, 2021, 10:09:19 AM11/14/21

to Common Crawl

Hello,

I am looking for a way to export specific URL's from the Common Crawl data - I do not know exactly how one can filter through the data, but Ideally I would want to be able to export all URL's from a specific country or in a specific language, e.g. Danish.

I might also want all websites from a specific US state, e.g. Texas - would that be possible?

I have been advised to use the Advertools package in Python to do this. Would that make sense or does someone else have alternative suggestions?

Best regards,

Soren Lindbo

Sebastian Nagel

unread,

Nov 15, 2021, 4:02:59 AM11/15/21

to common...@googlegroups.com

Hi Søren,

> export all URL's from a specific
> country or in a specific language, e.g. Danish.

That's easily done via the columnar index, see

https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
https://github.com/commoncrawl/cc-index-table

> I might also want all websites from a specific US state, e.g. Texas -
> would that be possible?

Well, that isn't that easy. First, what does it mean: 1 - hosted in
Texas, 2 - from an entity located in Texas, or 3 - content about Texas?

The columnar index does not include IP addresses, so even 1 requires
to look into the WARC files, 2 and 3 for sure because it's about
identifying content.

I've never worked with the Advertools package. You want to use it
to extract data from the HTML?

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/c32440c4-88a7-455b-adb1-6ecab51b50d2n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/c32440c4-88a7-455b-adb1-6ecab51b50d2n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Søren Lindbo

unread,

Nov 15, 2021, 10:16:43 AM11/15/21

to Common Crawl

Hello Sebastian,

Thanks a lot for the reply. I will look into extracting country specific URL's with your two suggestions.

I would be interested in websites hosted in Texas. Ideally I would simply want access to all URL's from the United States, like with Denmark, but as this will be a very large data set (I presume) I was interested in ways to break it down into more manageable data sets.

It was suggested to me to use Advertools to find the URL's, but it seems like Advertools is a tool for scraping / extracting data from websites that one already has. Not a tool to actually retrieve data / urls from an index like Common Crawl. Perhaps there was a misunderstanding on this part.

I will get back to you.

Thanks!

Søren Lindbo

unread,

Dec 14, 2021, 10:42:34 AM12/14/21

to Common Crawl

Hello,

I followed the steps outlined defined here: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

I ran into an error I cannot figure out how to fix "Access denied when writing to location: s3://commoncrawl/cc-index/table/cc-main/warc/Unsaved/2021/12/14/eca04922-816a-4b84-8b66-45522894f3f4.txt

This query ran against the "ccindex" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: eca04922-816a-4b84-8b66-45522894f3f4"

Picture of it also attached - the AWS forums do not seem to be of much assistance - can anyone assist? It would be greatly appreciated.

Best regards!

AWS fail.jpg

Sebastian Nagel

unread,

Dec 14, 2021, 10:54:09 AM12/14/21

to common...@googlegroups.com

Hi Søren,

afaics the database "ccindex" already exists and is already selected in
the "tool bar" on the left side.

> "Access denied when writing to location:
>
s3://commoncrawl/cc-index/table/cc-main/warc/Unsaved/2021/12/14/eca04922-816a-4b84-8b66-45522894f3f4.txt

Well, this is expected as you should not have write permissions
to the bucket s3://commoncrawl/

You need to configure the query output location properly and select
a bucket you have write permissions (owned by you), see
https://docs.aws.amazon.com/athena/latest/ug/querying.html

Best,
Sebastian

On 12/14/21 16:42, Søren Lindbo wrote:
> Hello,
>

> I followed the steps outlined defined
> here: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
>
> I ran into an error I cannot figure out how to fix "Access denied when
> writing to location:
> s3://commoncrawl/cc-index/table/cc-main/warc/Unsaved/2021/12/14/eca04922-816a-4b84-8b66-45522894f3f4.txt
> This query ran against the "ccindex" database, unless qualified by the
> query. Please post the error message on our forum

> <https://forums.aws.amazon.com/forum.jspa?forumID=242&start=0> or
> contact customer support
> <https://console.aws.amazon.com/support/home?#/case/create?issueType=technical&serviceCode=amazon-athena&categoryCode=query-related-issue> with

> <https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/>
>
> https://github.com/commoncrawl/cc-index-table

> <https://groups.google.com/d/msgid/common-crawl/c32440c4-88a7-455b-adb1-6ecab51b50d2n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/c32440c4-88a7-455b-adb1-6ecab51b50d2n%40googlegroups.com?utm_medium=email&utm_source=footer>>.

>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/common-crawl/d3435e30-30c4-41cb-981b-ff72b5ac3ca1n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/d3435e30-30c4-41cb-981b-ff72b5ac3ca1n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Søren Lindbo

unread,

Jan 7, 2022, 9:09:38 AM1/7/22

to Common Crawl

Hello Sebastian,

I am running the following command in the Athena Query editor: MSCK REPAIR TABLE ccindex

I am doing this per the instruction of this post: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

I am confronted with the following message:

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

"Access denied when writing output to url: s3://commoncrawl/cc-index/table/cc-main/warc/Unsaved/2022/01/07/098c5e89-b2a4-45c4-b7e1-2f6c0b87ef20.txt . Please ensure you are allowed to access the S3 bucket. If specifying an expected bucket owner, confirm the bucket is owned by the expected account. If you are encrypting query results with KMS key, please ensure you are allowed to access your KMS key

This query ran against the "ccindex" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 098c5e89-b2a4-45c4-b7e1-2f6c0b87ef20"

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

How do I get access?

Sebastian Nagel

unread,

Jan 7, 2022, 10:51:24 AM1/7/22

to common...@googlegroups.com

Hi Søren,

> "Access denied when writing output to url:
>
s3://commoncrawl/cc-index/table/cc-main/warc/Unsaved/2022/01/07/098c5e89-b2a4-45c4-b7e1-2f6c0b87ef20.txt

You need to configure the query output location properly and select
a bucket you have write permissions (owned by you), see

https://docs.aws.amazon.com/athena/latest/ug/querying.html#query-results-specify-location

This applies also to queries which do not have a result strictly
speaking. Athena also writes the query metadata to the output location.

Best,
Sebastian

On 1/7/22 15:09, Søren Lindbo wrote:
> Hello Sebastian,
>

> I am running the following command in the Athena Query editor: MSCK
> REPAIR TABLE ccindex
>
> I am doing this per the instruction of this
> post: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
>
> I am confronted with the following message:
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> "Access denied when writing output to url:
> s3://commoncrawl/cc-index/table/cc-main/warc/Unsaved/2022/01/07/098c5e89-b2a4-45c4-b7e1-2f6c0b87ef20.txt

> . *Please ensure you are allowed to access the S3 bucket*. If specifying

> an expected bucket owner, confirm the bucket is owned by the expected
> account. If you are encrypting query results with KMS key, please ensure
> you are allowed to access your KMS key
>
> This query ran against the "ccindex" database, unless qualified by the
> query. Please post the error message on our forum

> <https://forums.aws.amazon.com/forum.jspa?forumID=242&start=0> or
> contact customer support
> <https://console.aws.amazon.com/support/home?#/case/create?issueType=technical&serviceCode=amazon-athena&categoryCode=query-related-issue> with

> <https://groups.google.com/d/msgid/common-crawl/d3435e30-30c4-41cb-981b-ff72b5ac3ca1n%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/d3435e30-30c4-41cb-981b-ff72b5ac3ca1n%40googlegroups.com?utm_medium=email&utm_source=footer>>.

>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/common-crawl/b2c134a1-8566-4232-8720-451e1d4682a6n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/b2c134a1-8566-4232-8720-451e1d4682a6n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Søren Lindbo

unread,

Jan 25, 2022, 11:14:36 AM1/25/22

to Common Crawl

Hello Sebastian,

I managed to export all the data I needed - thank you so much! I've now completed the first step in a long process of setting up my own business.