Here's the output from the program that I posted yesterday when run against the latest index (common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/)
1.6 billion (1,646,697,495) total URLs with 97.6% being HTML pages.
Top 20 MIME types:
1607516137 |
text/html |
7879590 |
image/jpeg |
7735819 |
text/xml |
3111864 |
application/pdf |
3037987 |
text/plaincd |
2598422 |
image/png |
2384916 |
application/rss+xml |
1781801 |
application/atom+xml |
1748910 |
unk |
1422284 |
text/calendar |
1174435 |
application/xml |
1022890 |
application/xhtml+xml |
678183 |
application/octet-stream |
532237 |
image/gif |
359618 |
audio/x-wav |
323651 |
application/json |
197564 |
unknown/unknown |
139109 |
text/HTML |
135076 |
video/x-ms-asf |
134047 |
application/vnd.google-earth.kml+xml |
Actually, there are a few more HTML pages than that if you include all these ways to spell "HTML" :-)
1607516137 text/html
1022890 application/xhtml+xml
139109 text/HTML
42645 application/vnd.wap.xhtml+xml
3512 Text/html
1638 download/html
661 text/x-server-parsed-html
598 Text/HTML
385 image/html
377 TEXT/HTML
277 text/html,text/html