Exceeded max jobconf size

799 views
Skip to first unread message

Matt Stump

unread,
Mar 12, 2012, 1:09:44 PM3/12/12
to cascal...@googlegroups.com
Has anyone encountered the following exception and if so what did they do about it?

java.io.IOException: java.io.IOException: Exceeded max jobconf size

Stack trace:

Exception in thread "main" cascading.flow.FlowException: unhandled exception 
at cascading.flow.Flow.complete(Flow.java:821) 
at cascalog.api$_QMARK__.doInvoke(api.clj:210) 
at clojure.lang.RestFn.invoke(RestFn.java:421) 
at courier.lib.yum.core$_main$fn__5835.invoke(core.clj:384) 
at courier.lib.yum.core$_main.invoke(core.clj:382) 
at clojure.lang.AFn.applyToHelper(AFn.java:185) 
at clojure.lang.AFn.applyTo(AFn.java:151) 
at courier.lib.yum.core.main(Unknown Source) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
at java.lang.reflect.Method.invoke(Method.java:597) 
at org.apache.hadoop.util.RunJar.main(RunJar.java:156) 
Caused by: org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.io.IOException: Exceeded max jobconf size: 5861149 limit: 5242880 
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3954) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
at java.lang.reflect.Method.invoke(Method.java:597) 
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:396) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) 
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) 
Caused by: java.io.IOException: Exceeded max jobconf size: 5861149 limit: 5242880 
at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:408) 
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3952)

Sam Ritchie

unread,
Mar 12, 2012, 1:12:24 PM3/12/12
to cascal...@googlegroups.com
Matt,

Are you trying to source from a very large number of paths? Other possibilities might be:

A memory source tap with a very, very large sequence
Really large op parameters

Do any of these ring a bell?

Cheers,
Sam

-- 
Sam Ritchie
Sent with Sparrow

Matt Stump

unread,
Mar 12, 2012, 1:20:46 PM3/12/12
to cascal...@googlegroups.com
How big does a memory tap need to be to qualify for very big?  Is 300 - 500MB too big?  It might go as high as 1G.  Because that is most likely the culprit then.

Sam Ritchie

unread,
Mar 12, 2012, 1:39:13 PM3/12/12
to cascal...@googlegroups.com
Yup, that's it. The memory tap gets serialized into the JobConf.

You can get around this with the lazy-generator in cascalog.ops. This function accepts a sequence, pours it into a SequenceFile and returns an hfs-seqfile tap. Here's an example for a sequence called "lazy-seq":

(require '[cascalog.io :as io]
            '[cascalog.ops :as c])

    (with-fs-tmp [_ tmp-path]
      (let [lazy-tap (lazy-generator tmp-path lazy-seq)]
      (?<- (stdout)
           [?field1 ?field2 ... etc]
           (lazy-tap ?field1 ?field2)
           ...)))

You're going to run into issues if you try to use a sequence that's bound to a var, like this:

(def my-seq (for [x (range 1000000) y (range 1000000)] [x y]))

(with-fs-tmp ,,,using my-seq with lazy-generator,,,)

Since Clojure will hold on to the head of a sequence, bound to a var, this would cause the entire sequence to be realized in memory with no garbage collection. It's best to create the lazy sequence and pass it in as a function argument, or create it in a let binding.

Let me know if that all makes sense!
-- 
Sam Ritchie
Sent with Sparrow

Matt Stump

unread,
Mar 13, 2012, 10:52:59 PM3/13/12
to cascal...@googlegroups.com
Hi Sam, 

Thanks for your response.  So, I modified my code to use the lazy-generator as you described, but I'm running into a little problem.  For testing purposes I'm outputting to stdout.  When I use lazy-generator I get no output, but when I forgo lazy-generator and force evaluation of my sequence by wrapping it in vec, everything works as expected and I do get output on stdout, that is until I get the "Exceed max jobconf size" exception.  I've created a small gist with both forms of the function, if you have a moment would you mind taking a look? I'm sure it's something stupid that I'm overlooking. Thank you very much for your assistance.

https://gist.github.com/2033166

Andy Xue

unread,
Oct 18, 2012, 5:09:18 PM10/18/12
to cascal...@googlegroups.com
hey - i just came across this error as wel. i don't have a memory tap or a particularly large path list size; however, what i do is use a lookup map to do a hashmap join -- so something like this 

(defn my-query 
[src-query lookup-data-path]
  (let [lookup-map (mk-map lookup-data-path)]
    (<- [!val] (src-query :> !key) (get lookup-map !key :> !val))))

is the lookup-map being passed into the mapper function via the jobConf? I wonder if there is way around this? Thanks

Nathan Marz

unread,
Oct 20, 2012, 5:50:59 AM10/20/12
to cascal...@googlegroups.com
Yes, it is. The only way around it is to use the distributed cache. You'll have to poke around on how to do that, I don't remember off the top of my head.
--
Twitter: @nathanmarz
http://nathanmarz.com

Reply all
Reply to author
Forward
0 new messages