|logging from mapper||Blake Messer||12/29/12 3:14 PM|
I'm modifying the code in the Mapreduce for the Masses tutorial. My end goal is to go through each html page, count the words on the page, and log that total along with the title on the page. My question is... how do I log from the mapper? Just typing System.out.println("string") won't actually get logged anywhere in the /logs file that gets stored on my S3. What's the standard way to log from the mapper or reducer?
|Re: logging from mapper||Mat Kelcey||12/29/12 7:36 PM|
when you say logging you don't mean debugging output do you? in either case you pretty much never want to write to standard out. sometimes it's useful to write trace info but it's more for the exception cases.
i think what you want to do is basically this code...
but emitting one record per html page (with the term count) instead of one record per token.
and because you want to output only 1 record per page you don't need any reduce step (since you have no aggregation going on)
if you only want to deal with the visible text of the page (and not the raw html) you might be better off using the TextData version of the corpus (that has the html stripped off)
it depends on how much you want to do your own content extraction.
the only problem with the TextData is it doesn't have the page title extracted (i'm assuming when you say title you mean the html title tag?) the TextData version in general would be easier, but you'd be missing that :/
does this give you a rough idea of how to progress? let me know..
On 29 December 2012 15:14, Blake Messer <rblake...@gmail.com> wrote:
|Re: logging from mapper||Blake Messer||12/30/12 11:39 PM|
That helps immensely. I have to have the title-- but I don't mind stripping out the html myself.
This is what I went with, and it seems to be giving me the expected output.
output.collect(new Text(Jsoup.parse(content).title()), new LongWritable(pageText.split(" ").length));
Without a reducer-- will all of the output go to a single file? Or will each mapper generate it's own output file?
Thanks again for your help!
|Re: logging from mapper||Mat Kelcey||12/31/12 9:38 AM|
as long as you're happy with the result you're good to go. the exact definition of "number of words on a webpage" is open to a lot of debate :) i'd do it manually on a few random samples and see if it looks ok.
yeah, that's right. each split will result in one output file. in this case since the arc.gz files aren't splittable you'll get one output file per arc file.
some other ideas
1) scale up slowly; make sure everything works with 1 arc file, then 10 arc files, etc, until you're happy to run as big as you can spend.
2) try/catch like a boss. no matter how perfect your code seems weird things happen after a billion records.
https://github.com/ssalevan/cc-helloworld/blob/master/src/org/commoncrawl/tutorial/WordCountMapper.java#L51 is a good example. you don't want one or two rogue records killing your job.
3) use counters, they help you trust your result. eg in this code https://github.com/matpalm/common-crawl-quick-hacks/blob/master/links_in_metadata/src/com/matpalm/ExtractTldLinks.java#L107 i keep track of a couple of reasonable null check counts. eyeballing these numbers can really help prove things are working as expected. ( just keep the number of distinct counters bounded )
4) related to 2) "mapred.max.map.failures.percent" is super important for these kind of jobs, unless you actually do care that some records are corrupted... https://github.com/ssalevan/cc-helloworld/blob/master/src/org/commoncrawl/tutorial/HelloWorld.java#L110
let us know how you get along!
To view this discussion on the web visit https://groups.google.com/d/msg/common-crawl/-/K6M6aung9HIJ.