NPE in TaskLogAppender.flush()

485 views
Skip to first unread message

aaron morton

unread,
Aug 7, 2011, 6:14:50 AM8/7/11
to Brisk Users
Running brisk as a single node, on a MBP with OSX Lion, using the bin distro.  My tasks are somehow succeeding with multiple attempts but I'm getting some errors logged.

When I run a Hive query like (on an auto generated table)

select * from TimeSeriesMap where row_key = '20110625015500';

I see a couple of these turn up in the cassandra log

 INFO 21:44:23,771 Error from attempt_201108072143_0001_m_000002_0: java.lang.NullPointerException
at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:67)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:264)
at org.apache.hadoop.mapred.Child$4.run(Child.java:261)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)


Looking at the Hadoop logs for the task attempts I see this from stderr:

2011-08-07 21:54:30.693 java[43039:1903] Unable to load realm info from SCDynamicStore

This from stdout for the same task attempt: 

 INFO 21:54:31,716 Logging initialized
 INFO 21:54:31,805 Cleaning up job
 INFO 21:54:31,805 Committing job
 INFO 21:54:31,901 Task:attempt_201108072143_0001_m_000001_1 is done. And is in the process of commiting
 INFO 21:54:31,907 Task 'attempt_201108072143_0001_m_000001_1' done.
 WARN 21:54:31,908 Error running child
java.lang.NullPointerException
at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:67)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:264)
at org.apache.hadoop.mapred.Child$4.run(Child.java:261)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
 INFO 21:54:31,911 Runnning cleanup for the task

I've got not great insights, and it's probably a Hadoop thing, but wondering if anyone else has encountered this ? 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton

Nate McCall

unread,
Aug 12, 2011, 2:27:54 PM8/12/11
to brisk...@googlegroups.com
Aaron,
I just got around this by modifying mapred-queue-acls.xml

The default value of blank blocks all jobs (!) and setting this to '*' worked.

This is actually detailed in the comments in the above file, but took
me forever to track down the source. Let me know if this works for you
as well.

aaron morton

unread,
Aug 14, 2011, 9:42:44 PM8/14/11
to brisk...@googlegroups.com
mmm, tried setting that but no joy. Still getting the same error. 

The docs http://hadoop.apache.org/common/docs/r0.20.0/cluster_setup.html say the those settings are ignored unless mapred.acls.enabled (in mapred-site.xml) is set to true. It's false by default, I turned it on but it did not make a difference. 

May try the hadoop list. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton

ps

unread,
Aug 15, 2011, 2:49:49 AM8/15/11
to Brisk Users
Am getting similar errors with wordcount example, where map and reduce
jobs finish successfully but the job cleanup throws the NPE. Do let me
know your findings.

Thanks.

aaron morton

unread,
Aug 15, 2011, 3:26:50 AM8/15/11
to brisk...@googlegroups.com
Will do, this is my thread on hadoop common if you want to play along at home 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton

aaron morton

unread,
Aug 18, 2011, 9:15:57 AM8/18/11
to brisk...@googlegroups.com
Got some where, this is going to be a long email. 

The short version is cassandra and hadoop are all using the same log4j logger repository and they are stamping on each other. IIRC the repository in log4* allows different apps in the same process to have different configs. More links at the bottom of the email.

Here we go….

I applied this patch https://gist.github.com/1153956 to log4j-1.2.16 and dropped the jar into brisk/resources/brisk/lib and  made sure that was the only log4j jar out there.

Modify the resources/hadoop/conf/mapred-site.xml file top pass the log4 debug setting into the task JVM as follows 

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx256m -Dlog4j.debug=true</value>
  <description>
    Tune your mapred jvm arguments for best performance. 
    Also see documentation from jvm vendor.
  </description>
</property>

Then started brisk (bin/brisk cassandra -f -t) and then hive (bin/brisk hive) and did my query that breaks. Remember this is the error
java.lang.NullPointerException
at org.apache.hadoop.mapred.TaskLogAppender.flush(TaskLogAppender.java:67)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:264)
at org.apache.hadoop.mapred.Child$4.run(Child.java:261)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)

The map task is actually completing but failing before it can communicate back to the task tracker, if you wait long enough this turns up in the hadoop logs
INFO 00:31:16,465 attempt_201108190021_0001_m_000001_0: Task attempt_201108190021_0001_m_000001_0 failed to report status for 600 seconds. Killing!
 INFO 00:31:16,481 Process Thread Dump: lost task

Once task has errored open the hadoop job logs e.g. brisk/logs/hadoop/userlogs/job_blah/attempt_blah/stdout. 

This TLA appender is configured in resources/hadoop/conf/log4j.properties as an instance of org.apache.hadoop.mapred.TaskLogAppender and is set as the appender for the root logger in o.a.h.mapred.TaskRunner .

This is the TLA logger been configured the first time (from the stdout log above), reset() is called as part of opening the file. Whats happening here is correct…

log4j: Parsing appender named "TLA".
log4j: Parsing layout options for "TLA".
log4j: Setting property [conversionPattern] to [%d{ISO8601} %p %c: %m%n].
log4j: End of parsing for "TLA".
log4j: Setting property [isCleanup] to [false].
log4j: Setting property [totalLogFileSize] to [0].
log4j: Setting property [taskId] to [attempt_201108190021_0001_m_000000_0].
log4j: setFile called for [TLA]: /Users/aaron/frameworks/brisk/brisk-1.0~beta2/logs/hadoop/userlogs/job_201108190021_0001/attempt_201108190021_0001_m_000000_0/syslog, true
log4j: Reset called for appender [TLA].
log4j: java.lang.Thread.getStackTrace(Thread.java:1503)
org.apache.log4j.WriterAppender.reset(WriterAppender.java:350)
org.apache.log4j.FileAppender.reset(FileAppender.java:346)
org.apache.log4j.FileAppender.setFile(FileAppender.java:288)
org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
org.apache.hadoop.mapred.TaskLogAppender.activateOptions(TaskLogAppender.java:48)
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:809)
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:735)
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:615)
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:502)
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:547)
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:483)
org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
org.apache.log4j.Logger.getLogger(Logger.java:104)
org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:229)
org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:65)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
java.lang.reflect.Constructor.newInstance(Constructor.java:513)
org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:529)
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:235)
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:209)
org.apache.commons.logging.LogFactory.getLog(LogFactory.java:351)
org.apache.hadoop.mapred.Child.<clinit>(Child.java:61)
log4j: setQWForFiles called for [TLA]
log4j: setFile ended for [TLA]: qw = [org.apache.log4j.helpers.QuietWriter@584fce71]
log4j: Parsed "TLA" options.


Now the bad stuff. reset() is been called on the appender, and this closes it and sets the qw (for QuietWriter) var to null, giving us the NPE later….

log4j: Reset called for appender [TLA].
log4j: java.lang.Thread.getStackTrace(Thread.java:1503)
org.apache.log4j.WriterAppender.reset(WriterAppender.java:350)
org.apache.log4j.FileAppender.reset(FileAppender.java:346)
org.apache.log4j.WriterAppender.close(WriterAppender.java:207)
org.apache.hadoop.mapred.TaskLogAppender.close(TaskLogAppender.java:77)
org.apache.log4j.helpers.AppenderAttachableImpl.removeAllAppenders(AppenderAttachableImpl.java:141)
org.apache.log4j.Category.removeAllAppenders(Category.java:891)
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:726)
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:615)
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:502)
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:395)
org.apache.log4j.PropertyWatchdog.doOnChange(PropertyConfigurator.java:922)
org.apache.log4j.helpers.FileWatchdog.checkAndConfigure(FileWatchdog.java:89)
org.apache.log4j.helpers.FileWatchdog.<init>(FileWatchdog.java:58)
org.apache.log4j.PropertyWatchdog.<init>(PropertyConfigurator.java:914)
org.apache.log4j.PropertyConfigurator.configureAndWatch(PropertyConfigurator.java:461)
org.apache.cassandra.service.AbstractCassandraDaemon.<clinit>(AbstractCassandraDaemon.java:100)
org.apache.cassandra.config.DatabaseDescriptor.<clinit>(DatabaseDescriptor.java:356)
org.apache.cassandra.utils.FBUtilities.getLocalAddress(FBUtilities.java:175)
org.apache.cassandra.hadoop.fs.CassandraFileSystemThriftStore.initialize(CassandraFileSystemThriftStore.java:128)
org.apache.cassandra.hadoop.fs.CassandraFileSystem.initialize(CassandraFileSystem.java:59)
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1310)
org.apache.hadoop.fs.FileSystem.access$100(FileSystem.java:65)
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1328)
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226)
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:109)
org.apache.hadoop.mapred.Child$4.run(Child.java:258)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:396)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
org.apache.hadoop.mapred.Child.main(Child.java:253)

The call stack shows AbstractCassandraDaemon spinning up and settings a log4j property watcher. This is closing all the appenders, you can see TaskLogAppender.close() in the stack .   

The TLA appender has been closed and we are not watching the the property file that contains it's config. It's Deady Mc Dead Dead.

here are the links I've found about repositories, it's 1am now so will need to pick this up later…


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton

Nate McCall

unread,
Aug 18, 2011, 7:17:48 PM8/18/11
to brisk...@googlegroups.com
Aaron,
Thanks for all the extra details. I'll go over this shortly and see if
I can get any further. Let me know if you find anything else out.

aaron morton

unread,
Aug 18, 2011, 9:14:19 PM8/18/11
to brisk...@googlegroups.com
I'm going to submit a patch into Hadoop to make their TaskLogAppender a little safer. Other Appender sub classes check the appender state and fail silently (with internal log4j logging) when the appender is closed. 

I dont think the first solution here http://articles.qos.ch/sc.html will work for us. 

Doing some reading about creating a RepositorySelector now. I'll try to grab you on IRC before i write anything. 

(I used to work at a company that ported log4j to .net , remembering lots of useless details )

Cheers
 
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton

aaron morton

unread,
Aug 19, 2011, 12:23:46 AM8/19/11
to brisk...@googlegroups.com
Came up with a hacky work around, added an option to stop configuring log4j, creating logger repositories would be tricky for us. See the ticket for more info...


-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton

Nate McCall

unread,
Aug 19, 2011, 11:44:45 AM8/19/11
to brisk...@googlegroups.com
I'll take hacky if it works - particularly when dealing with the
config soup of Java logging between multiple open source projects!

Thanks again for all of your effort here.

Reply all
Reply to author
Forward
0 new messages