setup logging on EMR

amit...@gmail.com

unread,

Oct 6, 2014, 4:28:48 AM10/6/14

to mr...@googlegroups.com

Hi all,

I'm trying to use mrjob for running hadoop on EMR, and can't figure out how to setup logging (user generated logs in map/reduce steps) so I will be able to access them after the cluster is terminated.

I have tried to setup logging using the `logging` module, `print` and `sys.stderr.write()` but without luck so far. I tried to add my logger to stderr using `mrjob.util.log_to_stream` but it didn't worked either. The only option which works for me is to write the logs to a file then SSH the machine and read it, but its cumbersome. I would like my logs to go to stderr/stdout/syslog and be automatically collected to S3, so I can view them after the cluster is terminated.

I know unit-testing using the local runner is highly recommended, but sometimes unexpected inputs make raise some exceptions/problems which aren't covered in the tests. I would very appreciate any help, as I my debugging process is very slow due to this limitation.

Here is the word_freq example with logging:

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re
import logging
import logging.handlers
import sys

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper_init(self):
        self.logger = logging.getLogger()
        self.logger.setLevel(logging.INFO)
        self.logger.addHandler(logging.FileHandler("/tmp/mr.log"))
        self.logger.addHandler(logging.StreamHandler())
        self.logger.addHandler(logging.StreamHandler(sys.stdout))
        self.logger.addHandler(logging.handlers.SysLogHandler())

    def mapper(self, _, line):
        self.logger.info("Test logging: %s", line)
        sys.stderr.write("Test stderr: %s\n" % line)
        print "Test print: %s" % line
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))


if __name__ == '__main__':
    MRWordFreqCount.run()

Regards,
Beka

Amit Beka

unread,

Oct 27, 2014, 12:51:50 AM10/27/14

to mr...@googlegroups.com

Hi,

I'm trying to send this one more time - in hope someone can give me a clue as how to export logs from an EMR instance.

Thanks,

Beka

--
You received this message because you are subscribed to the Google Groups "mrjob" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mrjob+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben Chess

unread,

Oct 27, 2014, 2:59:08 PM10/27/14

to mr...@googlegroups.com

Hey Beka,

If you write to stderr it’ll be written into S3 after the job is finished (interspersed with other stderr output however) in the following path:

<s3_log_uri>/<jobflow-id>/task-attempts/<job-id>/<attempt-id>/stderr

Note that you should never log to stdout - you’ll interfere with mrjob’s stream back to Hadoop.

Hope that helps,

Ben

Amit Beka

unread,

Oct 29, 2014, 10:18:34 AM10/29/14

to mr...@googlegroups.com

It does help - I always looked at the other directories offered by EMR, and never there.

Thanks,
Beka

Reply all

Reply to author

Forward