when you say logging you don't mean debugging output do you? in either case you pretty much never want to write to standard out. sometimes it's useful to write trace info but it's more for the exception cases.
i think what you want to do is basically this code...
but emitting one record per html page (with the term count) instead of one record per token.
pageText = pageText.replaceAll("\\s+", " ");
output.collect(new Text(titleOfPage), new LongWritable(pageText.split(" ").size());
and because you want to output only 1 record per page you don't need any reduce step (since you have no aggregation going on)
if you only want to deal with the visible text of the page (and not the raw html) you might be better off using the TextData version of the corpus (that has the html stripped off)
it depends on how much you want to do your own content extraction.
the only problem with the TextData is it doesn't have the page title extracted (i'm assuming when you say title you mean the html title tag?) the TextData version in general would be easier, but you'd be missing that :/
does this give you a rough idea of how to progress? let me know..
mat