Number of URLs in March 2015 index & MIME type breakdown

64 views
Skip to first unread message

Tom Morris

unread,
Apr 25, 2015, 12:33:17 PM4/25/15
to common...@googlegroups.com
Here's the output from the program that I posted yesterday when run against the latest index (common-crawl/cc-index/collections/CC-MAIN-2015-14/indexes/)

1.6 billion (1,646,697,495) total URLs with 97.6% being HTML pages.

Top 20 MIME types:

1607516137 text/html
7879590 image/jpeg
7735819 text/xml
3111864 application/pdf
3037987 text/plaincd
2598422 image/png
2384916 application/rss+xml
1781801 application/atom+xml
1748910 unk
1422284 text/calendar
1174435 application/xml
1022890 application/xhtml+xml
678183 application/octet-stream
532237 image/gif
359618 audio/x-wav
323651 application/json
197564 unknown/unknown
139109 text/HTML
135076 video/x-ms-asf
134047 application/vnd.google-earth.kml+xml

Actually, there are a few more HTML pages than that if you include all these ways to spell "HTML" :-)

1607516137 text/html
1022890 application/xhtml+xml
139109 text/HTML
42645 application/vnd.wap.xhtml+xml
 3512 Text/html
 1638 download/html
  661 text/x-server-parsed-html
  598 Text/HTML
  385 image/html
  377 TEXT/HTML
  277 text/html,text/html

José González-Brenes

unread,
Oct 19, 2016, 4:19:23 PM10/19/16
to Common Crawl
Hello Tom,

How did you get this? I'm having problems getting the index online API  returning mime types other than text.

Thanks,
Jose
Reply all
Reply to author
Forward
0 new messages