> export all URL's from a specific
> country or in a specific language, e.g. Danish.
That's easily done via the columnar index, see
> I might also want all websites from a specific US state, e.g. Texas -
> would that be possible?
Well, that isn't that easy. First, what does it mean: 1 - hosted in
Texas, 2 - from an entity located in Texas, or 3 - content about Texas?
The columnar index does not include IP addresses, so even 1 requires
to look into the WARC files, 2 and 3 for sure because it's about
I've never worked with the Advertools package. You want to use it
to extract data from the HTML?
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> To view this discussion on the web visit