Re: Looking into Maven build for spark

Konstantin Boudnik

unread,

Jun 8, 2013, 5:21:12 PM6/8/13

to spar...@googlegroups.com

Wow,

I was even aware about this email until Matt mentioned it to me offline. Looks
like my email filters are screwed up.

At any rate, I think hadoop-client approach would work, although it is pity
that client software needs to go to such lengths to just work with a platform.

Essentially, the goal of the proposed exercise is to achieve the following:
- eliminate the need to package hadoop libraries and their transitive
dependencies. Just with this one the size of the dist package would be cut
in half or so. The startup protocol for spark would need to change a bit to
reflect the need to add add hadoop jar and trans. deps to the classpath.

I understand that Bigtop deployment isn't the only scenario that Spark is
interested in, so once assembly is done it might have to be massages a
little a bit during the packaging by Bigtop.

- Scala redistribution. Currently, all Scala stuff is being Shader'ed into
the same fat-jar. I think for real system deployment it make sense to
simply make Spark package to depend on a Scala package. However,
considering a somewhat lesser popularity of Scala among Linux distros, it
might makes - for Bigtop itself - to package and supply a needed version
of Scala along with the distribution. But this is different from this
conversation and would be solved elsewhere.

It is damn hot today, and my brain is melting. So I will try to put something
together in the next few days and will publish the pull request for further
considerations and discussion.

Regards,
Cos

On Tuesday, June 4, 2013 11:07:14 PM UTC-7, Matt Massie wrote:
>Cos
>
>Thanks for the email, Cos. Good to hear from you.
>
>Our plan is to cleanup and simplify the Spark build for the 0.8 release. We've
>talked with leaders of other projects that integrate with Hadoop (e.g. Hive,
>Parquet) and the consensus was to use the "hadoop-client" artifact with a
>simple shim (e.g. HadoopShims, ContextUtil) that uses reflection at runtime.
>This approach will allow us to release a single artifact for Spark that is
>binary compatible with all versions of Hadoop.
>
>I think, in general, the community will support this change if it simplifies
>deployment and works seamlessly. I believe it will.
>
>If you're interested in helping with this effort, we'd love your help. Is the
>high-level approach of using hadoop-client with a shim in line with you
>thinking on how to avoid jar hell? - hide quoted text -
>
>
>>On Monday, June 3, 2013 11:07:14 PM UTC-7, Reynold Xin wrote:
>>
>> Moving the discussion to spark-dev, and copying Matt/Jey as they have
>>looked into the binary packaging for Spark on precisely the Hadoop dependency
>>issue.
>>
>> FYI Cos, at Berkeley this morning we discussed some methods to allow a
>>single binary jar for Spark that would work with both Hadoop1 and Hadoop2.
>>Matt, can you comment on this?
>>
>>
>>> On Mon, Jun 3, 2013 at 10:33 PM, Konstantin Boudnik <c...@apache.org> wrote:
>>>
>>> Guys,
>>>
>>> I am working on BIGTOP-715 to include latest Spark into Bigtop's Hadoop stack
>>> with Hadoop 2.0.5-alpha.
>>>
>>> As a temp. hack I am reusing the fat-jar created by shader. And, as always
>>> with Shader. there's something that looks like a potential problem. By
>>> default, repl-bin project will be packing all hadoop dependencies into the
>>> same fatjar. This is essentially allows to deploy Spark independent of the
>>> presence of Hadoop's binaries, making Spark deb package pretty much
>>> standalone.
>>>
>>> However, it might create potential issues of jar-hell: say Spark got compiled
>>> against Hadoop 2.0.3-alpha and then I want to use it against Hadoop
>>> 2.0.5-alpha. Both of these version are binary compatible with each other.
>>> Hence, I should be able to re-use Hadoop binaries that are readily available
>>> from my Hadoop cluster, instead of installing a fresh yet slightly different
>>> set of the dependencies.
>>>
>>> Now, my understanding is that Spark doesn't really depends on low-level HDFS
>>> or YARN APIs and only uses what's publicly available for a normal client
>>> application. That makes it potentially possible to run Spark against any
>>> Hadoop2 cluster and use dynamic classpath configuration, unless the concrete
>>> binaries are included into the package.
>>>
>>> Would the dev. community be willing to accept an improvement of the binary
>>> packaging in a form of proper assembly instead or in parallel with shader's
>>> fatjar?
>>>
>>> --
>>> Take care,
>>> Cos

signature.asc

Konstantin Boudnik

unread,

Jun 6, 2013, 2:31:25 AM6/6/13

to spar...@googlegroups.com

Wow, until Matt mentioned this email elsewhere I wasn't even aware the my
filters are eating away spark-dev@ mailings ... Damn. Hopefully, it is fixed
now.

I think this is very unfortunate situation for the downstream that the quite
complex and fragile solutions have to be implemented just to keep working with
underlying platform. But it is certainly not in the frame of this particular
discussion - may be another time ;)

I think shims/reflection would work for the purpose. The goal of this build
improvement excercise is to achive a couple of things:
1. avoid packing Hadoop's transitive dependencies: if installed Hadoop libs
are being added to the classpath in the runtime, then it is quite expected
to have the transitive dependencies of the same libraries to be added as
well. My estimation that this along will reduce the size of the final
package by about 50%. Not to mention simplification of the deployment.

2. Scala deployment: the original version of the Bigtop packaging for Spark
was simply wrapping and redistributing the same Scala version that was
used for the build. I trust that current Maven build shade it into the
far-jar.

It appears that Scala doesn't have the same level of popularity among
Linux platforms as Java or Groovy enjoy, hence some way out is needed.
Do you think it would make sense to, perhaps, package Scala separately as
a part of Bigtop stack and simply make it a dependency for Spark's
package? I do understand that Spark world isn't limited by Bigtop realm,
and Spark's own packaging mechanism would need to solve it differently,
perhaps by keeping some variation of the current approach.

It seems the 1. above would be most likely solved by a better Maven assembly
in place of Shader construct. Let's me try to put together an initial version
of the assembly patch over the weekend and send out a pull request for
developers to see and comment. Then we can take it from there.

Appreciate the input!

Cos

> On Tuesday, June 4, 2013 11:07:14 PM UTC-7, Matt Massie wrote:

> Cos-

>
> Thanks for the email, Cos. Good to hear from you.
>
> Our plan is to cleanup and simplify the Spark build for the 0.8 release. We've
> talked with leaders of other projects that integrate with Hadoop (e.g. Hive,
> Parquet) and the consensus was to use the "hadoop-client" artifact with a
> simple shim (e.g. HadoopShims, ContextUtil) that uses reflection at runtime.
> This approach will allow us to release a single artifact for Spark that is
> binary compatible with all versions of Hadoop.
>
> I think, in general, the community will support this change if it simplifies
> deployment and works seamlessly. I believe it will.
>
> If you're interested in helping with this effort, we'd love your help. Is the
> high-level approach of using hadoop-client with a shim in line with you
> thinking on how to avoid jar hell?
>

signature.asc

Konstantin Boudnik

unread,

Jun 9, 2013, 2:52:36 PM6/9/13

to Artem Smirnov, spar...@googlegroups.com

I'm sorry - you're right of course. I meant spark-developers@ group

Sorry for the spam
Cos

On Sun, Jun 09, 2013 at 11:08AM, Artem Smirnov wrote:
> This message is a little bit offtopic here. Must have posted to a wrong
> group.
>
> On Sunday, June 9, 2013 12:21:12 AM UTC+3, Konstantin Boudnik wrote:
> >

Reply all

Reply to author

Forward