Hi Dianne,
That's an interesting task and something I've pondered myself. As part of the code I wrote for showing how to process the new WARC based Common Crawl dataset, one of the examples actually generates the HTML tag frequency count from the raw HTML content. I wrote it quickly as an example, so it might still have some bugs, but it might be a reasonable starting point.
It assumes you're interested in usage across all web pages on the Internet, not averaged across domains or similar. Normalizing them from the size of the domain might be an interesting variation considering it could be seen as "unfair" to heavily weight a website just as they push out more pages.
I've unfortunately not had the time to run the MapReduce program over any large amount of data yet, so I don't have any good numbers for you, but...
The two relevant files are:
To pull out just the HTML tags that are composed of letters and numbers (a-z, A-Z, 0-9), you can run the following command on the reduced output:
egrep -E "^[0-9A-Za-z]+ [0-9]+" --binary-files=text part-r-00000 > only_text
This is necessary as there's a lot of binary garbage in the output.
Attached is an example of the output that I generated by looking at a single 859MB compressed WARC file.
It's pointing to the frequency for the h1 tag, followed by the frequency for a number of the h[0-9] tags all the way from h1 to h15.
Hopefully that's of help, good luck!