Frequency data HTML tags used on websites

487 views

Skip to first unread message

dianne finch

unread,

Mar 14, 2014, 9:47:00 AM3/14/14

to common...@googlegroups.com

Hello,

I am new to this group and have a question that may be naive.
Does anyone know if I can find data showing how often certain HTML tags are used on websites.

I'd like to be able to compare the usage stats for <p> <div> <section> and other tags.

Any ideas?

Thanks very much
dmfinch

Stephen Merity

unread,

Apr 6, 2014, 7:42:07 PM4/6/14

to common...@googlegroups.com

Hi Dianne,

That's an interesting task and something I've pondered myself. As part of the code I wrote for showing how to process the new WARC based Common Crawl dataset, one of the examples actually generates the HTML tag frequency count from the raw HTML content. I wrote it quickly as an example, so it might still have some bugs, but it might be a reasonable starting point.

It assumes you're interested in usage across all web pages on the Internet, not averaged across domains or similar. Normalizing them from the size of the domain might be an interesting variation considering it could be seen as "unfair" to heavily weight a website just as they push out more pages.

I've unfortunately not had the time to run the MapReduce program over any large amount of data yet, so I don't have any good numbers for you, but...

The two relevant files are:

https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/mapreduce/WARCTagCounter.java

https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/mapreduce/TagCounterMap.java

To pull out just the HTML tags that are composed of letters and numbers (a-z, A-Z, 0-9), you can run the following command on the reduced output:
egrep -E "^[0-9A-Za-z]+ [0-9]+" --binary-files=text part-r-00000 > only_text

This is necessary as there's a lot of binary garbage in the output.

Attached is an example of the output that I generated by looking at a single 859MB compressed WARC file.

It's pointing to the frequency for the h1 tag, followed by the frequency for a number of the h[0-9] tags all the way from h1 to h15.

https://gist.github.com/Smerity/10012635#file-part-r-00000-L1422

Hopefully that's of help, good luck!

Reply all

Reply to author

Forward

0 new messages