Is there an easy way to get a count of URLs for a particular domain?

Aaron Kempf

unread,

Nov 2, 2021, 5:28:49 AM11/2/21

to Common Crawl

Hey;

I am a database guy, so I've got a question of scope. I want to get a page count (Number of URLs) for a particular domain, but i don't want to do this for ONE domain... I want to do it for 100k Domains.

I'm sorry to be a pain, I'm just targetting some of these companies as clients, and I want to get my head around how LARGE these websites are.

Does this dataset actually have comprehensive data of the whole internet?

Or should I just start counting the URLs in each sitemap for these 100k Domains?

And I guess the REALLLY important question is this, it's a XML.gz. Are there tools to help automate getting this data dump into Postgres or mySQL (on my end) if I DO decide to download? I mean, I don't know how the FUCK i'm gonna import a 300gb XML file. I can easily do MAYBE a 10gb file.

And then, I'm genuinely confused by the large number of different files to download. All I want is a count of URLs by root domain like Ford.com and ESPN.com.

I just want a list of

Ford.com 1234

ESPN.com 1452

I don't really have an extra terabyte of space lying around to do this. And maybe I should just send a subset of this data to a new VPS, I could spin one up. But I don't really want to do that, it's going to be EXPENSIVE because of the size of the data.

Cheers, Thanks for any help you guys can provide. I'm sorry about asking so many questions I guess the most important question is: "Which File Do I need?' Once I find out how large it is, I can make decisions on everything else almost immediately. Thanks

Aaron Kempf

Microsoft Certified IT Professional

Sebastian Nagel

unread,

Nov 2, 2021, 5:57:53 AM11/2/21

to common...@googlegroups.com

Hi Aaron,

> page count (Number of URLs) for a particular domain, but i don't want
> to do this for ONE domain... I want to do it for 100k Domains.

We do not have a complete database of all links found by the crawler.

> Does this dataset actually have comprehensive data of the whole
> internet?

Definitely not. Our target is to provide sample web data. A
comprehensive crawl would exceed our resources. The crawler
strictly respects the robots.txt rules which also implies
that all web sites which choose not to be crawled are missing
in the data.

Maybe this data could be valuable for your use case:

1. the monthly crawl metrics include per-domain counts (number of page
captures, unique URLs visited):
https://commoncrawl.github.io/cc-crawl-statistics/plots/domains
https://github.com/commoncrawl/cc-crawl-statistics
Domain counts are also for download, see here
https://groups.google.com/g/common-crawl/c/vsD4vBpDdG0

2. the columnar index includes a column "url_host_registered_domain"
see

https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

3. the webgraph releases - here the latest one:

https://commoncrawl.org/2021/10/host-and-domain-level-web-graphs-jun-jul-sep-2021/
also include domains linked from crawled data, no page counts but
the number of known subdomains.

> Or should I just start counting the URLs in each sitemap for these
> 100k Domains?

You could extract sitemap links from the robots.txt datasets. That's
actually done by our crawler every month. However, we need to sample
already at this point. A single sitemap index may list 2.5 billion
URLs (Google Plus for example did so). We use
https://github.com/crawler-commons/crawler-commons/
(parsing XML is done by a SAX parser which scales well)

Best,
Sebastian

> --
> You received this message because you are subscribed to the Google
> Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/4f67275a-4a7a-466a-9479-6bde08ad970bn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/4f67275a-4a7a-466a-9479-6bde08ad970bn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Aaron Kempf

unread,

Nov 3, 2021, 3:42:38 PM11/3/21

to common...@googlegroups.com

Ok thank you so much. I am curious if there is anything I can do to help you guys on the database side. I’m pretty under utilized. I have a decent amount of experience with Linux, but I really can do many different things well. I just need to find a decent cause to volunteer for. And I think very highly of your organization. I wish I could contribute towards your project in some way.

I don’t think I could help you guys with hadoop or anything like that, but I see data everywhere I look.

That said I’m out of town for the next couple of days. If you want to see a resume I can email one this weekend.

Aaron Kempf

https://LinkedIn.com/in/AaronKempf

Microsoft Certified IT Professional

509-493-2837

> an email to common-crawl+unsubscribe@googlegroups.com
> <mailto:common-crawl+unsub...@googlegroups.com>.

> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/4f67275a-4a7a-466a-9479-6bde08ad970bn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/4f67275a-4a7a-466a-9479-6bde08ad970bn%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/Ycl4T0kWsig/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/203eb106-1918-e31c-c78e-bb6b8190b9a6%40commoncrawl.org.

Reply all

Reply to author

Forward