Roadmap/Feedback

6 views
Skip to first unread message

Chris K Wensel

unread,
May 5, 2009, 7:57:21 PM5/5/09
to cascading-user
Hey all

Just wanted to thank everyone for all the feedback and comments on the
list for the last few months.

I'm beginning planning for a minor release that will add some
incremental features while attempting to keep backwards compatibility.

My first question to everyone is how important is backward
compatibility in a 1.x release? That is, API compatibility and
semantic compatibility (changing how a function behaves slightly, for
example).

Clearly maintenance releases (1.x.y) should remain compatible in both
areas (unless the semantic changes are actually bug fixes, like what
happened to our date operations).

Further, what changes or features would you like to see in the next
minor or possibly next major release?

For example, we are thinking of adding Fields.REPLACE. This would
allow field replacement to be inlined directly by a Pipe, vs the
traditional field 'drifting' through renaming.

Please feel free to reply to the list, or directly to me.

cheers,
chris

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/

mlimotte

unread,
May 6, 2009, 11:28:53 AM5/6/09
to cascading-user
Hey Chris. Thanks for all your work on this platform and your
dedication to supporting users on this list.

Fields.REPLACE would definitely be a welcome addition.

As far as backward compatibility, in general, I think it's important
to maintain it. However, being that you're still in the early 1.x
releases, you can probably get away with a few deviations at this
point. In my case, our code base is not so large that making the
changes would be overwhelming. I'm sure that will change as the
platform matures and our use of it extends. One thing that I didn't
find (hopefully I didn't just miss it), while upgrading from 0.8.2 to
1.0.10 was an upgrade guide. It would be helpful if there were some
migration notes that spell out the differences in the interface one by
one. I had to discover that through trial and error.

Additional feature requests:

* How about a writeDOT() method on Cascade. With only a writeDOT() on
flows, I don't have an automatic way to visualize a graph of flows
that make up a cascade. Each flow would be connected to other flows
by virtue of shared taps for it's sources and sinks.

* FlowConnector constructor accepts properties which, as far as I can
tell, overwrite the properties in JobConf. In some cases a merge
might be better. The particular case I'm thinking of is
"mapred.child.java.opts". I have a value set in hadoop-site.xml, but
I would like to add something to it for a particular flow. If I pass
the property in the FlowConnector constructor, it overrides the values
set in hadoop-site.xml.

* DateFormatter - how about an option to specify the incoming
timestamp is expressed in seconds, rather than milliseconds. I'm
processing a log file that only has second granularity, so I have to
include an extra step to use an ExpressionFunction to multiple the
value by 1000 first.

* Not sure if this is a limitation of Hadoop, but some way to keep
field name information in the sequence file. This way when we use a
tap derived from a sequence file we can immediately start referring to
fields by name.

Marc
Feeva Technology, Inc.

Esé

unread,
May 6, 2009, 3:21:32 PM5/6/09
to cascading-user


On May 6, 8:28 am, mlimotte <mslimo...@gmail.com> wrote:
> Hey Chris.  Thanks for all your work on this platform and your
> dedication to supporting users on this list.

Seconded. Many thanks!

>
> Fields.REPLACE would definitely be a welcome addition.

Agreed! And I agree with the other points as well.

In addition, cascading looks like it would be a great fit for Amazon
EMR and is sure to attract many users new to both. With that in mind,
I would love a new EMR specific example for cascading - particularly
one that clarifies some of the points made in the adjoining "Using
Taps In HDFS" thread.

Cheers!

E.

Chris K Wensel

unread,
May 6, 2009, 6:08:25 PM5/6/09
to cascadi...@googlegroups.com
>
> Hey Chris. Thanks for all your work on this platform and your
> dedication to supporting users on this list.
>

my pleasure!

all these suggestions are great. let me see what I can do with them..

ckw

Chris K Wensel

unread,
May 6, 2009, 6:14:40 PM5/6/09
to cascadi...@googlegroups.com
> In addition, cascading looks like it would be a great fit for Amazon
> EMR and is sure to attract many users new to both. With that in mind,
> I would love a new EMR specific example for cascading - particularly
> one that clarifies some of the points made in the adjoining "Using
> Taps In HDFS" thread.


Keep your eyes open. Just might have something Real Soon Now.

to your points specifically, you want to use the local HDFS as your
default in all your jobs, and only integrate with S3 to pull/push the
data that needs to live longer than your cluster.

So just use Hfs and relative paths everywhere, except when that data
is in S3 or must go to S3 (new Hfs( "s3n://....." ))

And my recommendation is to use s3n:// not s3://, this way other apps
can get at the data (s3cmd, http://, etc). The drawback is that you
must consider that on input, you can only have one mapper for every
file being read from S3 (in the first MR job in your Flow).

p.s., there is always this too..
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2293&categoryID=263

ckw

Esé

unread,
May 12, 2009, 1:10:37 AM5/12/09
to cascading-user


On May 6, 3:14 pm, Chris K Wensel <ch...@wensel.net> wrote:
> > In addition, cascading looks like it would be a great fit for Amazon
> > EMR and is sure to attract many users new to both. With that in mind,
> > I would love a new EMR specific example for cascading - particularly
> > one that clarifies some of the points made in the adjoining "Using
> > Taps In HDFS" thread.
>
> Keep your eyes open. Just might have something Real Soon Now.

Looking forward to it!

And, BTW, thanks for the tips below. I am getting ready to start
testing my cascading application in EMR this week. Expect to pull
about 20GB from S3 to generate reports and such on a regular basis.
Keeping my fingers crossed - there are lots and lots of aggregations
in there :-)

E,

>
> to your points specifically, you want to use the local HDFS as your  
> default in all your jobs, and only integrate with S3 to pull/push the  
> data that needs to live longer than your cluster.
>
> So just use Hfs and relative paths everywhere, except when that data  
> is in S3 or must go to S3 (new Hfs( "s3n://....." ))
>
> And my recommendation is to use s3n:// not s3://, this way other apps  
> can get at the data (s3cmd, http://, etc). The drawback is that you  
> must consider that on input, you can only have one mapper for every  
> file being read from S3 (in the first MR job in your Flow).
>
> p.s., there is always this too..http://developer.amazonwebservices.com/connect/entry.jspa?externalID=...

Chris K Wensel

unread,
May 12, 2009, 10:14:49 AM5/12/09
to cascadi...@googlegroups.com
>> Keep your eyes open. Just might have something Real Soon Now.
>
> Looking forward to it!


check out
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2440

A Cascading app written by Amazon for CloudFront.

ckw

--
Chris K Wensel

Ivan Brusic

unread,
May 12, 2009, 10:26:56 AM5/12/09
to cascadi...@googlegroups.com
I downloaded the code out of curiosity (I am not using Cloudfront), and I was pleased to see that they were using Cascading.

I had a phone call last week with the EMR group at Amazon as part of market research on their part. They stated that quite a few EMR users were also using Cascading and most users were very pleased. They said they might create some Cascading based AMIs.

Ivan

Chris K Wensel

unread,
May 12, 2009, 10:59:01 AM5/12/09
to cascadi...@googlegroups.com
>
> I had a phone call last week with the EMR group at Amazon as part of
> market research on their part. They stated that quite a few EMR
> users were also using Cascading and most users were very pleased.
> They said they might create some Cascading based AMIs.


That's really cool!

cheers,
chris

Esé

unread,
May 19, 2009, 3:16:06 PM5/19/09
to cascading-user
On May 12, 7:14 am, Chris K Wensel <ch...@wensel.net> wrote:
> >> Keep your eyes open. Just might have something Real Soon Now.
>
> > Looking forward to it!
>
> check outhttp://developer.amazonwebservices.com/connect/entry.jspa?externalID=...
>
> A Cascading app written by Amazon for CloudFront.

I forgot to thank you for this pointer - this app is worth a thousand
words. Should be required reading for anyone new to but wishing to
work with both cascading/EMR.

A quick question on this app, if you don't mind me asking. The line:

Tap source = new FileNameFilteredHfs(startDate, endDate, new
TextLine(), inputPath);

This is a custom hadoop tap filter that picks out the log files
(created as "yyyy-MM-dd-HH") falling between the input startDate and
endDate. Given that the input path is s3n, does the filtering occur
prior to copying the log files from s3 to local hdfs or after? I'd
presume the former, otherwise it'd be a bit of a waste, but wanted to
confirm.

Thanks!

Pavel Kolesnikov

unread,
May 19, 2009, 3:33:45 PM5/19/09
to cascadi...@googlegroups.com
2009/5/19 Esé <opusdp...@gmail.com>:

>
> A quick question on this app, if you don't mind me asking. The line:
>
>    Tap source = new FileNameFilteredHfs(startDate, endDate, new
> TextLine(), inputPath);
>
> This is a custom hadoop tap filter that picks out the log files
> (created as "yyyy-MM-dd-HH") falling between the input startDate and
> endDate. Given that the input path is s3n, does the filtering occur
> prior to copying the log files from s3 to local hdfs or after? I'd
> presume the former, otherwise it'd be a bit of a waste, but wanted to
> confirm.

Yes, it works exactly as you expect. The FileNameFilteredHfs calls
FileInputFormat.setInputPathFilter internally.

Pavel

Esé

unread,
May 19, 2009, 3:51:10 PM5/19/09
to cascading-user
Got it Pavel - thanks!

On May 19, 12:33 pm, Pavel Kolesnikov <pavel.kolesni...@gmail.com>
wrote:
> 2009/5/19 Esé <opusdpeng...@gmail.com>:
Reply all
Reply to author
Forward
0 new messages