10 views

Skip to first unread message

Konstantin Boudnik

unread,

Jun 8, 2013, 5:19:47 PM6/8/13

to spar...@googlegroups.com

Wow,

I was even aware about this email until Matt mentioned it to me offline. Looks
like my email filters are screwed up.

At any rate, I think hadoop-client approach would work, although it is pity
that client software needs to go to such lengths to just work with a platform.

Essentially, the goal of the proposed exercise is to achieve the following:
- eliminate the need to package hadoop libraries and their transitive
dependencies. Just with this one the size of the dist package would be cut
in half or so. The startup protocol for spark would need to change a bit to
reflect the need to add add hadoop jar and trans. deps to the classpath.

I understand that Bigtop deployment isn't the only scenario that Spark is
interested in, so once assembly is done it might have to be massages a
little a bit during the packaging by Bigtop.

- Scala redistribution. Currently, all Scala stuff is being Shader'ed into
the same fat-jar. I think for real system deployment it make sense to
simply make Spark package to depend on a Scala package. However,
considering a somewhat lesser popularity of Scala among Linux distros, it
might makes - for Bigtop itself - to package and supply a needed version
of Scala along with the distribution. But this is different from this
conversation and would be solved elsewhere.

It is damn hot today, and my brain is melting. So I will try to put something
together in the next few days and will publish the pull request for further
considerations and discussion.

Regards,
Cos

On Tuesday, June 4, 2013 11:07:14 PM UTC-7, Matt Massie wrote:
>Cos
>
>Thanks for the email, Cos. Good to hear from you.
>
>Our plan is to cleanup and simplify the Spark build for the 0.8 release. We've
>talked with leaders of other projects that integrate with Hadoop (e.g. Hive,
>Parquet) and the consensus was to use the "hadoop-client" artifact with a
>simple shim (e.g. HadoopShims, ContextUtil) that uses reflection at runtime.
>This approach will allow us to release a single artifact for Spark that is
>binary compatible with all versions of Hadoop.
>
>I think, in general, the community will support this change if it simplifies
>deployment and works seamlessly. I believe it will.
>
>If you're interested in helping with this effort, we'd love your help. Is the
>high-level approach of using hadoop-client with a shim in line with you
>thinking on how to avoid jar hell? - hide quoted text -
>
>
>>On Monday, June 3, 2013 11:07:14 PM UTC-7, Reynold Xin wrote:
>>
>> Moving the discussion to spark-dev, and copying Matt/Jey as they have
>>looked into the binary packaging for Spark on precisely the Hadoop dependency
>>issue.
>>
>> FYI Cos, at Berkeley this morning we discussed some methods to allow a
>>single binary jar for Spark that would work with both Hadoop1 and Hadoop2.
>>Matt, can you comment on this?
>>
>>
>>> On Mon, Jun 3, 2013 at 10:33 PM, Konstantin Boudnik <c...@apache.org> wrote:
>>>
>>> Guys,
>>>
>>> I am working on BIGTOP-715 to include latest Spark into Bigtop's Hadoop stack
>>> with Hadoop 2.0.5-alpha.
>>>
>>> As a temp. hack I am reusing the fat-jar created by shader. And, as always
>>> with Shader. there's something that looks like a potential problem. By
>>> default, repl-bin project will be packing all hadoop dependencies into the
>>> same fatjar. This is essentially allows to deploy Spark independent of the
>>> presence of Hadoop's binaries, making Spark deb package pretty much
>>> standalone.
>>>
>>> However, it might create potential issues of jar-hell: say Spark got compiled
>>> against Hadoop 2.0.3-alpha and then I want to use it against Hadoop
>>> 2.0.5-alpha. Both of these version are binary compatible with each other.
>>> Hence, I should be able to re-use Hadoop binaries that are readily available
>>> from my Hadoop cluster, instead of installing a fresh yet slightly different
>>> set of the dependencies.
>>>
>>> Now, my understanding is that Spark doesn't really depends on low-level HDFS
>>> or YARN APIs and only uses what's publicly available for a normal client
>>> application. That makes it potentially possible to run Spark against any
>>> Hadoop2 cluster and use dynamic classpath configuration, unless the concrete
>>> binaries are included into the package.
>>>
>>> Would the dev. community be willing to accept an improvement of the binary
>>> packaging in a form of proper assembly instead or in parallel with shader's
>>> fatjar?
>>>
>>> --
>>> Take care,
>>> Cos