Jars uploading

25 views
Skip to first unread message

Mateusz Fedoryszak

unread,
Aug 1, 2013, 11:50:08 AM8/1/13
to scoob...@googlegroups.com
Hi guys,

I've been looking through jars uploading code (LibJars.configureJars method).

It seems every time DistributedCache.addFileToClassPath is called, entries to mapreduce.job.classpath.files and mapreduce.job.cache.files are added. Hadoop doesn't check if they are not already there. As LibJars.configureJars is called several times, it leads to many duplicated entries. When you have many dependencies, you'll easily exceed classpath entries limit and Hadoop task will fail setup.

It looks as if someone were trying to get round that by removing duplicates from mapred.classpath. But it doesn't help: in fact, this property doesn't seem to be used anywhere in Hadoop.

I'd add checking if the file is not already present in distributed classpath before adding it and remove references to mapred.classpath. But I'm not sure mapred.classpath is really unused. Does any one remember, why jars uploading is using it?

Cheers,
Mateusz


Eric Torreborre

unread,
Aug 1, 2013, 11:40:23 PM8/1/13
to scoob...@googlegroups.com
Hi Mateusz,

That's right, this setting is not used.

I just committed a modification where I am:

 - removing the setting
 - checking the mapred.job.classpath.files setting before adding a "fileToClassPath"

And my tests are passing ok on the cluster.

Please grab the latest 0.8.0-SNAPSHOT to try this new code.

Thanks,

Eric.

Mateusz Fedoryszak

unread,
Aug 8, 2013, 5:37:29 AM8/8/13
to scoob...@googlegroups.com
Great! Thanks.
Reply all
Reply to author
Forward
0 new messages