Hi Søren,
> export all URL's from a specific
> country or in a specific language, e.g. Danish.
That's easily done via the columnar index, see
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
https://github.com/commoncrawl/cc-index-table
> I might also want all websites from a specific US state, e.g. Texas -
> would that be possible?
Well, that isn't that easy. First, what does it mean: 1 - hosted in
Texas, 2 - from an entity located in Texas, or 3 - content about Texas?
The columnar index does not include IP addresses, so even 1 requires
to look into the WARC files, 2 and 3 for sure because it's about
identifying content.
I've never worked with the Advertools package. You want to use it
to extract data from the HTML?
Best,
Sebastian
> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
common-crawl...@googlegroups.com
> <mailto:
common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/common-crawl/c32440c4-88a7-455b-adb1-6ecab51b50d2n%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/c32440c4-88a7-455b-adb1-6ecab51b50d2n%40googlegroups.com?utm_medium=email&utm_source=footer>.