Exporting filtered URL's

15 views
Skip to first unread message

Søren Lindbo

unread,
Nov 14, 2021, 10:09:19 AMNov 14
to Common Crawl
Hello,

I am looking for a way to export specific URL's from the Common Crawl data - I do not know exactly how one can filter through the data, but Ideally I would want to be able to export all URL's from a specific country or in a specific language, e.g. Danish.

I might also want all websites from a specific US state, e.g. Texas - would that  be possible?

I have been advised to use the Advertools package in Python to do this. Would that make sense or does someone else have alternative suggestions?


Best regards,
Soren Lindbo

Sebastian Nagel

unread,
Nov 15, 2021, 4:02:59 AMNov 15
to common...@googlegroups.com
Hi Søren,

> export all URL's from a specific
> country or in a specific language, e.g. Danish.

That's easily done via the columnar index, see

https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
https://github.com/commoncrawl/cc-index-table

> I might also want all websites from a specific US state, e.g. Texas -
> would that be possible?

Well, that isn't that easy. First, what does it mean: 1 - hosted in
Texas, 2 - from an entity located in Texas, or 3 - content about Texas?

The columnar index does not include IP addresses, so even 1 requires
to look into the WARC files, 2 and 3 for sure because it's about
identifying content.

I've never worked with the Advertools package. You want to use it
to extract data from the HTML?

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/c32440c4-88a7-455b-adb1-6ecab51b50d2n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/c32440c4-88a7-455b-adb1-6ecab51b50d2n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Søren Lindbo

unread,
Nov 15, 2021, 10:16:43 AMNov 15
to Common Crawl
Hello Sebastian,

Thanks a lot for the reply. I will look into extracting country specific URL's with your two suggestions.

I would be interested in websites hosted in Texas. Ideally I would simply want access to all URL's from the United States, like with Denmark, but as this will be a very large data set (I presume) I was interested in ways to break it down into more manageable data sets.

It was suggested to me to use Advertools to find the URL's, but it seems like Advertools is a tool for scraping / extracting data from websites that one already has. Not a tool to actually retrieve data / urls from an index like Common Crawl. Perhaps there was a misunderstanding on this part.

I will get back to you.


Thanks!

Reply all
Reply to author
Forward
0 new messages