Cascading 3.0 wip

74 views
Skip to first unread message

Chris K Wensel

unread,
Dec 10, 2014, 1:11:08 AM12/10/14
to cascadi...@googlegroups.com
Hey all

Quick note that we pushed a new 3.0 wip 61.

You can see the major changes here:


For the most part, we removed all deprecations from 2.x line.

We want to make a major push to finalize Cascading 3, and that includes getting feedback on the changes we had to make to have dual MapReduce and Tez DAG support. 

So if you have a framework or project on Cascading 2.x, please give porting to 3.0 wip a shot, and fire off feedback on any improvements or questions.

** Now is the time to make a case for any major API changes. **

Running your tests on Cascading 3 with the MapReduce planner would also be great. Any holes in our planner would be best discovered sooner than later.

Testing on Tez would be great too!

But knowing the core APIs can be considered stable and applications running on MapReduce are as robust on Cascading 3.0 as they are on Cascading 2.x are our primary concerns.

ckw

Chris K Wensel




Luis Casillas

unread,
Dec 15, 2014, 4:35:10 PM12/15/14
to cascadi...@googlegroups.com
Our jobs seem to run fine in 3.0-wip-61 and MR2.  

I also managed to run somehow run successfully in Tez, but the obstacle there has been setting up an environment with Tez, given that (a) the Tez distribution is source-only, and (b) the installation instructions for Tez are a bit terse.  I went through the Tez 0.5.x installation instructions on the Hadoop cluster in my laptop and got the jobs to run, but it used Tez local mode for all of them and experienced the problems with threads hanging at the end of the job (and each consuming 100% CPU).  Overall I call this a partial success, since I don't expect pre-release software to have polished documentation, but well, those were the stumbling points.

After that I did see you have a fork of vagrant-cascading-hadoop-cluster that has Tez built into it, but it's set up to use VMware instead of VirtualBox, and by that point I just didn't have the time to figure out how to edit the Vagrantfile to use that.  I hope to give that a shot in the near future.  I'd say however that the thing I'd find the most helpful is Elastic MapReduce bootstrap actions to install Tez and Cascading SDK 3.0-wip-xx (maybe you do already have the latter).

Keep up the great work!

Luis

This message and any files or text attached to it are intended only for the recipients named above, and contain information that is confidential or privileged.  If you are not an intended recipient, you must not read, copy, use or disclose this communication. Please also notify the sender by replying to this message, and then delete all copies of it from your system. 

Chris K Wensel

unread,
Dec 15, 2014, 5:04:11 PM12/15/14
to cascadi...@googlegroups.com
Thanks Luis!

Here is a gist of my EMR voodo


the first is a boostrap action addtion, the second I run to exec Load. note the pre-amble to setup Tez on the cluster. this is run in screen on the master.

feel free to pull Tez from our bucket if it isn’t private. (my transmit is crashing, so I can test it without a reboot). I can make it public if isn’t, just let me know.

as for vagrant, it should be trivial to switch. but, i switched from virtual box for a reason.

ckw

-- 
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/413a4d92-bdf8-4794-a00b-1f4699b2cc53%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Luis Casillas

unread,
Dec 15, 2014, 5:48:00 PM12/15/14
to cascadi...@googlegroups.com
Thanks!  I just checked and the S3 path is publicly accessible.

Luis Casillas

unread,
Dec 15, 2014, 6:55:58 PM12/15/14
to cascadi...@googlegroups.com
Well, it looks like your Tez bootstrap code relies on some other bootstrap actions or environmental factors that differ between your environment and my own (AMI 3.3.1, Amazon Hadoop 2.4.0).   Here's what I got so far (bootstrap action and log output):


It's certainly something dumb, I'll come back and troubleshoot it when I get the chance...


On Monday, December 15, 2014 2:48:00 PM UTC-8, Luis Casillas wrote:
Thanks!  I just checked and the S3 path is publicly accessible.

On Monday, December 15, 2014 2:04:11 PM UTC-8, Chris K Wensel wrote:
Thanks Luis!

Here is a gist of my EMR voodo


the first is a boostrap action addtion, the second I run to exec Load. note the pre-amble to setup Tez on the cluster. this is run in screen on the master.

feel free to pull Tez from our bucket if it isn’t private. (my transmit is crashing, so I can test it without a reboot). I can make it public if isn’t, just let me know.

as for vagrant, it should be trivial to switch. but, i switched from virtual box for a reason.

ckw

Chris K Wensel

unread,
Dec 15, 2014, 7:09:09 PM12/15/14
to cascadi...@googlegroups.com
Last ami i used was 

AMI_VERSION=3.2.1

looks like I need to change that default now on my next run.

and looking at the logs, looks like the jars are not public (as feared)


just got transmit up, and indeed they weren’t public and indeed transmit doesn’t wan’t to recursively change the perms anyway. so just manually updated them.

try it now.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.

For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Luis Casillas

unread,
Dec 15, 2014, 7:24:18 PM12/15/14
to cascadi...@googlegroups.com
Oh, I was able to list the contents of the bucket, so I guess I jumped from that to the conclusion that they're public.  That's only one of three different error types that I got, so I'm not going to retry right away; I'm planning to start a live cluster, log in through SSH and snoop around the master to figure out exactly what commands to run.  But that'll have to wait until tomorrow...

Thanks!

Chris K Wensel

unread,
Dec 15, 2014, 7:34:22 PM12/15/14
to cascadi...@googlegroups.com
I should point out that the second file is _not_ a bootstrap action, you need  to shell in and run it.

ckw


For more options, visit https://groups.google.com/d/optout.

Chris K Wensel




Luis Casillas

unread,
Dec 16, 2014, 8:30:16 PM12/16/14
to cascadi...@googlegroups.com
Well, here's an even stronger statement that I just learned the hard way: the file cannot work as a bootstrap action, because EMR runs those before starting hadoop, and the copy-to-HDFS command there requires hadoop to be up.  I've made some progress on this, but it still doesn't work for me.  I've updated the gist to show where I'm at:


Failure message shown there.  Two details that may be relevant:
  1. I'm submitting my jar as a step to the EMR cluster, not running it from the command line on the master as the original example does.
  2. The sources and sinks are going to S3.  Google results vaguely suggest that the class org.apache.hadoop.mapred.DirectFileOutputCommitter that cannot be found is S3-related.
I probably won't pick this up again until next week.
Reply all
Reply to author
Forward
0 new messages