Re: Looking into Maven build for spark

Konstantin Boudnik

unread,

Jun 9, 2013, 2:55:40 PM6/9/13

to spark-de...@googlegroups.com

Wow,

I was even aware about this email until Matt mentioned it to me offline. Looks
like my email filters are screwed up.

At any rate, I think hadoop-client approach would work, although it is pity
that client software needs to go to such lengths to just work with a platform.

Essentially, the goal of the proposed exercise is to achieve the following:
- eliminate the need to package hadoop libraries and their transitive
dependencies. Just with this one the size of the dist package would be cut
in half or so. The startup protocol for spark would need to change a bit to
reflect the need to add add hadoop jar and trans. deps to the classpath.

I understand that Bigtop deployment isn't the only scenario that Spark is
interested in, so once assembly is done it might have to be massages a
little a bit during the packaging by Bigtop.

- Scala redistribution. Currently, all Scala stuff is being Shader'ed into
the same fat-jar. I think for real system deployment it make sense to
simply make Spark package to depend on a Scala package. However,
considering a somewhat lesser popularity of Scala among Linux distros, it
might makes - for Bigtop itself - to package and supply a needed version
of Scala along with the distribution. But this is different from this
conversation and would be solved elsewhere.

It is damn hot today, and my brain is melting. So I will try to put something
together in the next few days and will publish the pull request for further
considerations and discussion.

Regards,
Cos

On Tuesday, June 4, 2013 11:07:14 PM UTC-7, Matt Massie wrote:
>Cos
>
>Thanks for the email, Cos. Good to hear from you.
>
>Our plan is to cleanup and simplify the Spark build for the 0.8 release. We've
>talked with leaders of other projects that integrate with Hadoop (e.g. Hive,
>Parquet) and the consensus was to use the "hadoop-client" artifact with a
>simple shim (e.g. HadoopShims, ContextUtil) that uses reflection at runtime.
>This approach will allow us to release a single artifact for Spark that is
>binary compatible with all versions of Hadoop.
>
>I think, in general, the community will support this change if it simplifies
>deployment and works seamlessly. I believe it will.
>
>If you're interested in helping with this effort, we'd love your help. Is the
>high-level approach of using hadoop-client with a shim in line with you
>thinking on how to avoid jar hell? - hide quoted text -
>
>
>>On Monday, June 3, 2013 11:07:14 PM UTC-7, Reynold Xin wrote:
>>
>> Moving the discussion to spark-dev, and copying Matt/Jey as they have
>>looked into the binary packaging for Spark on precisely the Hadoop dependency
>>issue.
>>
>> FYI Cos, at Berkeley this morning we discussed some methods to allow a
>>single binary jar for Spark that would work with both Hadoop1 and Hadoop2.
>>Matt, can you comment on this?
>>
>>
>>> On Mon, Jun 3, 2013 at 10:33 PM, Konstantin Boudnik <c...@apache.org> wrote:
>>>
>>> Guys,
>>>
>>> I am working on BIGTOP-715 to include latest Spark into Bigtop's Hadoop stack
>>> with Hadoop 2.0.5-alpha.
>>>
>>> As a temp. hack I am reusing the fat-jar created by shader. And, as always
>>> with Shader. there's something that looks like a potential problem. By
>>> default, repl-bin project will be packing all hadoop dependencies into the
>>> same fatjar. This is essentially allows to deploy Spark independent of the
>>> presence of Hadoop's binaries, making Spark deb package pretty much
>>> standalone.
>>>
>>> However, it might create potential issues of jar-hell: say Spark got compiled
>>> against Hadoop 2.0.3-alpha and then I want to use it against Hadoop
>>> 2.0.5-alpha. Both of these version are binary compatible with each other.
>>> Hence, I should be able to re-use Hadoop binaries that are readily available
>>> from my Hadoop cluster, instead of installing a fresh yet slightly different
>>> set of the dependencies.
>>>
>>> Now, my understanding is that Spark doesn't really depends on low-level HDFS
>>> or YARN APIs and only uses what's publicly available for a normal client
>>> application. That makes it potentially possible to run Spark against any
>>> Hadoop2 cluster and use dynamic classpath configuration, unless the concrete
>>> binaries are included into the package.
>>>
>>> Would the dev. community be willing to accept an improvement of the binary
>>> packaging in a form of proper assembly instead or in parallel with shader's
>>> fatjar?
>>>
>>> --
>>> Take care,
>>> Cos

signature.asc

Konstantin Boudnik

unread,

Jun 11, 2013, 1:08:48 AM6/11/13

to spark-de...@googlegroups.com

[moving to spark-developers, bcc'ing spark-users - my troubles are over:
there's an irrelevant spark-dev@ group ;( ]

Matei,

It would be an ideal way to deal with situation. If not - we can hack the
assembly in the Bigtop during the build (pretty suboptimal though); so I like
the idea of having two different assemblies.

Cos

On Sun, Jun 09, 2013 at 04:03PM, Matei Zaharia wrote:
> No worries, Cos. To comment on your proposal: we can also add separate
> assembly targets for "normal" Spark users and BigTop. I believe the SBT
> assembly tool allows that. For the one for "normal" users, we'll probably
> include both Scala and a default version of hadoop-client, and they'd bring
> in their own version of Hadoop only if they link to it specifically in their
> own project.
>
> Matei

> > --
> > You received this message because you are subscribed to the Google Groups "Spark Users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
> > For more options, visit https://groups.google.com/groups/opt_out.
> >
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Spark Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

signature.asc

Konstantin Boudnik

unread,

Jul 4, 2013, 3:23:03 PM7/4/13

to spark-de...@googlegroups.com

Guys,

I have a working version of the assembly and will publish the pull-request
shortly.

non-standard classfiers are a lot of pain as usual though. Here's question.
'hadoop2' profile isn't really a hadoop2 profile - it is a cdh4 profile in the
reality. And in my opinion it should be called as such to avoid confusion.
Any objections on this?

Also, is there any particular reason to use 1.7.1.cloudera.2 instead of a
standard
http://mvnrepository.com/artifact/org.apache.avro/avro/1.7.1

It is kinda 'frowned upon' to mix ASF and nonASF provided artifacts in the
release. And I think we might get called on that once the incubation process
is well underway (please correct me if I am wrong at this, but there were
discussions of the sort in the Hadoop project).

Looking forward for your input,
Cos

I can fix it by a separate pull-request as well, while I am at it.

Konstantin Boudnik

unread,

Jul 7, 2013, 5:54:32 PM7/7/13

to spark-de...@googlegroups.com

So, any objections to rename the profiles per the plan below? Just to
reiterate:
hadoop2 -> cdh4
hadoop2-yarn -> hadoop2
which will make the whole naming more consistent.

I also would suggest to change the way we are naming the artifacts, because
the use of classifiers in order to distinguish the versions is potentially a
problematic way. Maven is agnostic about the way of versioning is done, so
spark-project:spark-core:jar:0.8-hadoop2-SNAPSHOT
would reflect that this a version of 0.8-SNAPSHOT with different set
of dependencies from hadoop2. Wheres
spark-project:spark-core:jar:hadoop2:0.8-SNAPSHOT
inclines that the version is built FOR hadoop2 (e.g. with different set of gcc
options or something; and a different content of the project binaries
themselves).

The version string is adequately used by non-maven build system (if needed);
it is correctly sorted and processed by maven servers, etc. I can go ahead and
fix a patch for it rather quickly if there's no objection from the community.

Cos

> You received this message because you are subscribed to the Google Groups "Spark Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to spark-develope...@googlegroups.com.

Mark Hamstra

unread,

Jul 7, 2013, 11:42:45 PM7/7/13

to spark-de...@googlegroups.com, c...@apache.org

Another wrinkle is that the artifacts will soon need to be renamed anyway as part of the Apache move. Maybe it is best to combine this effort with that move/renaming.

Konstantin Boudnik

unread,

Jul 8, 2013, 12:26:18 AM7/8/13

to Mark Hamstra, spark-de...@googlegroups.com

Hey Mark.

that certainly makes sense to combine the work. Also, the renaming of the
profiles might makes to be performed at the same time?

I will send out the pull request for all this in the morning.
Thanks,
Cos

> > send an email to spark-users...@googlegroups.com <javascript:>.

> > > > > > For more options, visit https://groups.google.com/groups/opt_out.
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > You received this message because you are subscribed to the Google
> > Groups "Spark Users" group.
> > > > > To unsubscribe from this group and stop receiving emails from it,

> > send an email to spark-users...@googlegroups.com <javascript:>.

> > > > > For more options, visit https://groups.google.com/groups/opt_out.
> > > > >
> > > > >
> > >
> > >
> > > --
> > > You received this message because you are subscribed to the Google
> > Groups "Spark Developers" group.
> > > To unsubscribe from this group and stop receiving emails from it, send

> > an email to spark-develope...@googlegroups.com <javascript:>.

Konstantin Boudnik

unread,

Jul 9, 2013, 11:01:15 PM7/9/13

to spark-de...@googlegroups.com

Any chance someone can review the
https://github.com/mesos/spark/pull/675

Appreciate how busy you are, though

Cos

On Thu, Jul 04, 2013 at 12:23PM, Konstantin Boudnik wrote:

> You received this message because you are subscribed to the Google Groups "Spark Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to spark-develope...@googlegroups.com.

Konstantin Boudnik

unread,

Aug 3, 2013, 2:57:50 AM8/3/13

to d...@spark.incubator.apache.org

[Bcc: spark-de...@googlegroups.com]

Guys, just wanted to close the loop on this.

I have committed the packaging support for Spark (BIGTOP-715) into Bigtop
master. The packaging is built on top of the Maven assembly and provides
standard Linux services to control master and worker daemons.

Shark is next in the pipeline ;)
Cos

signature.asc

Reply all

Reply to author

Forward