Hi all,
I'm trying to use mrjob for running hadoop on EMR, and can't figure out how to setup logging (user generated logs in map/reduce steps) so I will be able to access them after the cluster is terminated.
I have tried to setup logging using the `logging` module, `print` and `sys.stderr.write()` but without luck so far. I tried to add my logger to stderr using `mrjob.util.log_to_stream` but it didn't worked either. The only option which works for me is to write the logs to a file then SSH the machine and read it, but its cumbersome. I would like my logs to go to stderr/stdout/syslog and be automatically collected to S3, so I can view them after the cluster is terminated.
I know unit-testing using the local runner is highly recommended, but sometimes unexpected inputs make raise some exceptions/problems which aren't covered in the tests. I would very appreciate any help, as I my debugging process is very slow due to this limitation.
Here is the word_freq example with logging:
"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re
import logging
import logging.handlers
import sys
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper_init(self):
self.logger = logging.getLogger()
self.logger.setLevel(logging.INFO)
self.logger.addHandler(logging.FileHandler("/tmp/mr.log"))
self.logger.addHandler(logging.StreamHandler())
self.logger.addHandler(logging.StreamHandler(sys.stdout))
self.logger.addHandler(logging.handlers.SysLogHandler())
def mapper(self, _, line):
self.logger.info("Test logging: %s", line)
sys.stderr.write("Test stderr: %s\n" % line)
print "Test print: %s" % line
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def combiner(self, word, counts):
yield (word, sum(counts))
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
Regards,
Beka