Desired features and fixes?

1 view
Skip to first unread message

Ian Eslick

unread,
Feb 6, 2011, 10:06:26 AM2/6/11
to Clojure Hadoop
I'd like to start a message thread regarding the features people would
like to see from clojure-hadoop. Let's split this into immediate
improvements / fixes that might go into a short term release and
longer term features that would extend its capabilities. Here is my
two cents:

Immediate:

1) Make it easier to talk with different platforms like HBase and 3rd
party Hadoop adapters using the job abstraction
e.g. my branch allows you to pass a configuration function that is
passed the newly created Job object so you
can run arbitrary code prior to Job submission. I use it to
configure adapters, HBase scans, the distributed cache, etc.

2) Some support code for working with the distributed cache

Longer term:

1) Interactive testing
Internally Compass Labs has some tools that might benefit clojure-
hadoop. We can do interactive testing of map and reduce operators from
the Repl and would like to be able to run job submissions from the
Repl also.

2) Higher level abstractions
We're interested in removing some of the visible complexity of the
job macro and exposing a modestly higher level language along the
lines of Colossal Pipe, where you can define job steps and flows and
sources / sinks and dependencies in a single job so more complex flows
can be cleanly combined into a single submission to a cluster.

Ian

Alex Ott

unread,
Feb 6, 2011, 10:12:52 AM2/6/11
to clojure...@googlegroups.com, ro...@burningswell.com
Hello

Ian Eslick at "Sun, 6 Feb 2011 07:06:26 -0800 (PST)" wrote:
IE> I'd like to start a message thread regarding the features people would
IE> like to see from clojure-hadoop. Let's split this into immediate
IE> improvements / fixes that might go into a short term release and
IE> longer term features that would extend its capabilities. Here is my
IE> two cents:

IE> Immediate:

IE> 1) Make it easier to talk with different platforms like HBase and 3rd
IE> party Hadoop adapters using the job abstraction
IE> e.g. my branch allows you to pass a configuration function that is
IE> passed the newly created Job object so you
IE> can run arbitrary code prior to Job submission. I use it to
IE> configure adapters, HBase scans, the distributed cache, etc.

IE> 2) Some support code for working with the distributed cache

IE> Longer term:

IE> 1) Interactive testing
IE> Internally Compass Labs has some tools that might benefit clojure-
IE> hadoop. We can do interactive testing of map and reduce operators from
IE> the Repl and would like to be able to run job submissions from the
IE> Repl also.

As I remember, Roman already implemented some code, that allows to run jobs
from repl, but I hadn't used it

I want also to investigate, is it possible to implement clojure-hadoop
without gen-classes, etc.

IE> 2) Higher level abstractions
IE> We're interested in removing some of the visible complexity of the
IE> job macro and exposing a modestly higher level language along the
IE> lines of Colossal Pipe, where you can define job steps and flows and
IE> sources / sinks and dependencies in a single job so more complex flows
IE> can be cleanly combined into a single submission to a cluster.

Very interesting idea, and if I remember correctly, Yahoo, or Cloudera had
project, that allows to describe workflows, etc.

--
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/ http://alexott.net/
http://alexott-ru.blogspot.com/
Skype: alex.ott

Ulises

unread,
Feb 6, 2011, 10:31:16 AM2/6/11
to clojure...@googlegroups.com
> Very interesting idea, and if I remember correctly, Yahoo, or Cloudera had
> project, that allows to describe workflows, etc.

On this note, what is the goal of clojure-hadoop? Is it to interact
and make plain vanilla hadoop stuff accessible from clojure or
something else? My question stems from the fact that the description
of the workflow framework, with sinks, etc. sounds like implementing a
cascading/pig/etc. clone so I'm wondering what the ultimate goal for
clojure-hadoop is?

On a perhaps tangential note, what are the chances that clojure-hadoop
is released under a different license? The reason behind my question
is that if you all agree that the Eclipse license is the way to go
then I won't be able to contribute anything to the project :(

U

Alex Ott

unread,
Feb 6, 2011, 10:39:49 AM2/6/11
to clojure...@googlegroups.com
Hello

On Sun, Feb 6, 2011 at 4:31 PM, Ulises <ulises....@gmail.com> wrote:
>> Very interesting idea, and if I remember correctly, Yahoo, or Cloudera had
>> project, that allows to describe workflows, etc.
>
> On this note, what is the goal of clojure-hadoop? Is it to interact
> and make plain vanilla hadoop stuff accessible from clojure or
> something else? My question stems from the fact that the description
> of the workflow framework, with sinks, etc. sounds like implementing a
> cascading/pig/etc. clone so I'm wondering what the ultimate goal for
> clojure-hadoop is?

I think, that main idea is to be able hadoop stuff from clojure, not
to re-invent existing frameworks. Although, we could provide DSLs
built on top of existing libraries

> On a perhaps tangential note, what are the chances that clojure-hadoop
> is released under a different license? The reason behind my question
> is that if you all agree that the Eclipse license is the way to go
> then I won't be able to contribute anything to the project :(

Which license is applicable for you? I can personally use almost any
license that allows to use library in commercial environment. But if
we'll try to re-license it, we'll need to get permission from original
author

--
With best wishes,                    Alex Ott, MBA

http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Ulises

unread,
Feb 6, 2011, 11:05:36 AM2/6/11
to clojure...@googlegroups.com
> I think, that main idea is to be able hadoop stuff from clojure, not
> to re-invent existing frameworks.  Although, we could provide DSLs
> built on top of existing libraries
>

Oh, indeed, there's nothing stopping one in providing a DSL on top of
clojure-hadoop to ease some tasks. I was just wondering about people's
thoughts regarding the future of clojure-hadoop.

> Which license is applicable for you? I can personally use almost any
> license that allows to use library in commercial environment. But if
> we'll try to re-license it, we'll need to get permission from original
> author

I will compile a list, however I do know right now that the Eclipse
license is no good. Apache would work but I will provide a longer
list. Please keep in mind that even if we decided to release
clojure-hadoop under a different license, I might still not be able to
contribute. However, if the library is to be released under the
current license, then I'm *guaranteed* not to be able to contribute.

U

Ian Eslick

unread,
Feb 6, 2011, 11:29:09 AM2/6/11
to clojure...@googlegroups.com

On Feb 6, 2011, at 8:05 AM, Ulises wrote:

>> I think, that main idea is to be able hadoop stuff from clojure, not
>> to re-invent existing frameworks. Although, we could provide DSLs
>> built on top of existing libraries
>>
>
> Oh, indeed, there's nothing stopping one in providing a DSL on top of
> clojure-hadoop to ease some tasks. I was just wondering about people's
> thoughts regarding the future of clojure-hadoop.

I'm only suggesting a DSL to make it easy to compose the pieces that usually go into a Job object so it's easy to remember and compose details for different inputs. I'm using HBase and Vertica adapters which require that some code be run at job creation time and that requires that jobs use particular input formats, etc. I have some internal tools which allow me to name common configurations and compose them into a single job.

There are workflow managers like Oozie and Azkaban (we're using Azkaban), but sometimes we have algorithms that have 3-4 passes of M-R and you'd like to treat them as single units at the workflow manager rather than have to track dependencies at that higher level. It's a modest layer of functionality on top of the job specification. I would imagine this would be another optional abstraction layer available to users, nothing that fundamentally changes the nature of clojure-hadoop.

>> Which license is applicable for you? I can personally use almost any
>> license that allows to use library in commercial environment. But if
>> we'll try to re-license it, we'll need to get permission from original
>> author
>
> I will compile a list, however I do know right now that the Eclipse
> license is no good. Apache would work but I will provide a longer
> list. Please keep in mind that even if we decided to release
> clojure-hadoop under a different license, I might still not be able to
> contribute. However, if the library is to be released under the
> current license, then I'm *guaranteed* not to be able to contribute.
>
> U

Any reason not to go with Apache 2.0 license? Then clojure-hadoop is compatible with the rest of the Cloudera/Hadoop infrastructure.

Alex Ott

unread,
Feb 7, 2011, 3:59:20 AM2/7/11
to clojure...@googlegroups.com
Hi

On Sun, Feb 6, 2011 at 5:29 PM, Ian Eslick <i...@compasslabs.com> wrote:
>
>>> Which license is applicable for you? I can personally use almost any
>>> license that allows to use library in commercial environment. But if
>>> we'll try to re-license it, we'll need to get permission from original
>>> author
>>
>> I will compile a list, however I do know right now that the Eclipse
>> license is no good. Apache would work but I will provide a longer
>> list. Please keep in mind that even if we decided to release
>> clojure-hadoop under a different license, I might still not be able to
>> contribute. However, if the library is to be released under the
>> current license, then I'm *guaranteed* not to be able to contribute.
>>

> Any reason not to go with Apache 2.0 license?  Then clojure-hadoop is compatible with the rest of the Cloudera/Hadoop infrastructure.

I have no argues agains Apache license - we only need to get
permission from all authors/contributors

Ulises

unread,
Feb 7, 2011, 4:39:33 AM2/7/11
to clojure...@googlegroups.com
> I have no argues agains Apache license - we only need to get
> permission from all authors/contributors

If that were the case I think it should be ok as we are contributing
to several apache projects already.

PS: thanks for your efforts in accomodating a single person's requirements :)

U

Reply all
Reply to author
Forward
0 new messages