Hi Mustafa,
> So I need root domain names, no need for internal pages.
Our hyperlink graphs contains the most complete list of host names
or domains below the registry suffix:
https://commoncrawl.org/2022/03/host-and-domain-level-web-graphs-oct-nov-jan-2021-2022/
Even hosts / domains not crawled but known from links are included.
> 1. I guess not all domains in the world has been crawled and recorded
> by CC: how many root domains CC dataset has and what type of
> websites are not crawled, what is the reason for missing data?
The latest crawl includes data from 35 million registered domains and 44
million hosts. You get a higher coverage if multiple crawls are
combined. Reasons why a host/domain isn't included: not sampled,
excluded by robots.txt, DNS does not resolve, HTTP connection failed,
domain owner asked to to be included, etc.
> 2. What would be the best method to scrape the information I aim to
> fetch from all domains one by one:
Definitely depends on your expectations:
> I can get many such info from html source.
Some technologies could be hidden in Javascript or CSS.
Such, the most reliable results likely require to use a web browser:
- via browser automation (Selenium, Puppeteer, Playwright - [1,2,3])
- ev. utilizing a browser plugin such as Wappalyzer [4,5]
If you looking at HTTP headers and HTML is sufficient, you need
much less resources (maybe 50-100 times less than using a web browser).
Then using Common Crawl data directly is also an option. Note:
Common Crawl only captures the HTML page but no page dependencies
(Javascript, CSS, images, web fonts, etc.)
Notes:
- for HTTP headers, the robots.txt dataset [6] might be useful:
it's much smaller and also includes responses from servers which
otherwise disallow crawling in their robots.txt
- some information might be visible in the webgraphs (with some noise),
eg. all Wordpress sites link to "
w.org" [7]
Best,
Sebastian
[1]
https://www.selenium.dev/
[2]
https://pptr.dev/
[3]
https://playwright.dev/
[4]
https://addons.mozilla.org/en-US/firefox/addon/wappalyzer/
[5]
https://chrome.google.com/webstore/detail/wappalyzer-technology-pro/gppongmhjkpfnbhagpmjfkannfbllamg
[6]
https://commoncrawl.org/2016/09/robotstxt-and-404-redirect-data-sets/
[7]
https://groups.google.com/g/common-crawl/c/IjXp6qylOD8/m/6R3AeZeMBAAJ
On 7/13/22 20:23, Mustafa Kural wrote:
> Hi, I just joined the group and I have following questions.
>
> I am working on a project to classify websites based on the technology
> stack they are using: eg: using Apache server, using Wordpress CMS, has
> google tag manager code, etc...
>
> I can get many such info from html source.
>
> So I need root domain names, no need for internal pages.
>
> As far as I see I can get domain names from CC crawl data and scrape
> them for my purposes.
>
> My questions are as follows as you guys have tremendous experience on
> what I plan to do:
>
> 1. I guess not all domains in the world has been crawled and recorded
> by CC: how many root domains CC dataset has and what type of
> websites are not crawled, what is the reason for missing data? Maybe
> I can work on to complete missing part.
> 2. What would be the best method to scrape the information I aim to
> fetch from all domains one by one:
> 1. Which AWS infrastructure (or else) is the the best
> (price/performance) as such a information scraping will require
> a lot of hardware resources and/or time
> 2. What is the best software to create such scraping bots: I guess
> Python and relevant libraries would be a good choice??
> 3. Is there any ready to use bots/crawlers/scrapers that I can use