Website classification

20 views
Skip to first unread message

Mustafa Kural

unread,
Jul 13, 2022, 2:23:21 PMJul 13
to Common Crawl
Hi, I just joined the group and I have following questions.

I am working on a project to classify websites based on the technology stack they are using: eg: using Apache server, using Wordpress CMS, has google tag manager code, etc...

I can get many such info from html source. 

So I need root domain names, no need for internal pages.

As far as I see I can get domain names from CC crawl data and scrape them for my purposes.

My questions are as follows as you guys have tremendous experience on what I plan to do:
  1. I guess not all domains in the world has been crawled and recorded by CC: how many root domains CC dataset has and what type of websites are not crawled, what is the reason for missing data? Maybe I can work on to complete missing part.
  2. What would be the best method to scrape the information I aim to fetch from all domains one by one:
    1. Which AWS infrastructure (or else) is the the best (price/performance) as such a information scraping will require a lot of hardware resources and/or time
    2. What is the best software to create such scraping bots: I guess Python and relevant libraries would be a good choice??
    3. Is there any ready to use bots/crawlers/scrapers that I can use for my purposes instead of coding them from scratch
I would appreciate if you can share your experience so that I can start with right foot.

Thank you and really appreciate you creating such a valuable data for all of us.

Regards
Murat




Sebastian Nagel

unread,
Jul 14, 2022, 7:48:43 AMJul 14
to common...@googlegroups.com
Hi Mustafa,

> So I need root domain names, no need for internal pages.

Our hyperlink graphs contains the most complete list of host names
or domains below the registry suffix:

https://commoncrawl.org/2022/03/host-and-domain-level-web-graphs-oct-nov-jan-2021-2022/

Even hosts / domains not crawled but known from links are included.


> 1. I guess not all domains in the world has been crawled and recorded
> by CC: how many root domains CC dataset has and what type of
> websites are not crawled, what is the reason for missing data?

The latest crawl includes data from 35 million registered domains and 44
million hosts. You get a higher coverage if multiple crawls are
combined. Reasons why a host/domain isn't included: not sampled,
excluded by robots.txt, DNS does not resolve, HTTP connection failed,
domain owner asked to to be included, etc.


> 2. What would be the best method to scrape the information I aim to
> fetch from all domains one by one:

Definitely depends on your expectations:


> I can get many such info from html source.

Some technologies could be hidden in Javascript or CSS.
Such, the most reliable results likely require to use a web browser:
- via browser automation (Selenium, Puppeteer, Playwright - [1,2,3])
- ev. utilizing a browser plugin such as Wappalyzer [4,5]

If you looking at HTTP headers and HTML is sufficient, you need
much less resources (maybe 50-100 times less than using a web browser).
Then using Common Crawl data directly is also an option. Note:
Common Crawl only captures the HTML page but no page dependencies
(Javascript, CSS, images, web fonts, etc.)


Notes:
- for HTTP headers, the robots.txt dataset [6] might be useful:
it's much smaller and also includes responses from servers which
otherwise disallow crawling in their robots.txt
- some information might be visible in the webgraphs (with some noise),
eg. all Wordpress sites link to "w.org" [7]


Best,
Sebastian

[1] https://www.selenium.dev/
[2] https://pptr.dev/
[3] https://playwright.dev/
[4] https://addons.mozilla.org/en-US/firefox/addon/wappalyzer/
[5]
https://chrome.google.com/webstore/detail/wappalyzer-technology-pro/gppongmhjkpfnbhagpmjfkannfbllamg
[6] https://commoncrawl.org/2016/09/robotstxt-and-404-redirect-data-sets/
[7] https://groups.google.com/g/common-crawl/c/IjXp6qylOD8/m/6R3AeZeMBAAJ

On 7/13/22 20:23, Mustafa Kural wrote:
> Hi, I just joined the group and I have following questions.
>
> I am working on a project to classify websites based on the technology
> stack they are using: eg: using Apache server, using Wordpress CMS, has
> google tag manager code, etc...
>
> I can get many such info from html source. 
>
> So I need root domain names, no need for internal pages.
>
> As far as I see I can get domain names from CC crawl data and scrape
> them for my purposes.
>
> My questions are as follows as you guys have tremendous experience on
> what I plan to do:
>
> 1. I guess not all domains in the world has been crawled and recorded
> by CC: how many root domains CC dataset has and what type of
> websites are not crawled, what is the reason for missing data? Maybe
> I can work on to complete missing part.
> 2. What would be the best method to scrape the information I aim to
> fetch from all domains one by one:
> 1. Which AWS infrastructure (or else) is the the best
> (price/performance) as such a information scraping will require
> a lot of hardware resources and/or time
> 2. What is the best software to create such scraping bots: I guess
> Python and relevant libraries would be a good choice??
> 3. Is there any ready to use bots/crawlers/scrapers that I can use

Mustafa Kural

unread,
Jul 14, 2022, 9:18:45 AMJul 14
to Common Crawl
Hi Sebastian,
That's a lot of great helpful information you distill from your experience, and would take a lot of time to research for me.

I will study all this as it will also be necessary to spot right developer (maybe you can also suggest).

Appreciate your support and be karma with you...
Reply all
Reply to author
Forward
0 new messages