Camus direct writes to s3 possible?

138 views
Skip to first unread message

Kevin Su

unread,
Apr 22, 2015, 2:26:13 PM4/22/15
to camu...@googlegroups.com
Hi all,

I have Camus running in EMR with the following properties file snippet:

etl.destination.path=/tmp/camus
etl.execution.base.path=/tmp/camus/exec
etl.execution.history.path=/tmp/camus/camus/exec/history

And then I'm doing a s3distcp to copy data from the EMR instances into s3.  

This part is functioning well.  

What I'd like to do is skip the part where Camus writes stuff to hdfs://tmp and go directly to s3.  

I've tried messing with the properties file but can't seem to get it working.  I've added:
fs.defaultFS=s3*://bucket
fs.s3*.awsAccessKeyId=key
fs.s3*.awsSecretAccessKey=secret

But I keep getting this error:
Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.

Perhaps caused by this:
2015-04-22 17:36:34,708 INFO org.apache.hadoop.mapreduce.Cluster (main): Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider due to error: Error in instantiating YarnClient

Normally, I'd mess around with the hadoop conf files like core-site.xml, but since this is on EMR, that's a little cumbersome as I want to keep spinning up new instances.  

My main questions are:
1.  Is there something really obvious I am missing about configuring camus.properties to write to s3.
2.  Is there some automagic way I can override core-site.xml and the like in EMR without sshing onto the box and changing it myself?  

Cheers,
Kev

laksh...@tokbox.com

unread,
Oct 12, 2015, 5:36:12 PM10/12/15
to Camus - Kafka ETL for Hadoop
I am also running in to the same issue while trying to write to amazon s3 from Camus running in EMR.

Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.


Any thoughts to how to get it to work?

Kevin Su

unread,
Oct 12, 2015, 7:59:54 PM10/12/15
to Camus - Kafka ETL for Hadoop
Hi, I ended up forking Camus in order to allow writes to s3.  

The main issue was that because the code was accessing FileSystem like so:
FileSystem fs = FileSystem.get(job.getConfiguration());
It defaulted to hdfs and thus didn't like it when I tried to write to s3.  If you change this line and other lines like it to:  

Path execBasePath = new Path(props.getProperty(ETL_EXECUTION_BASE_PATH));
FileSystem fs = FileSystem.get(execBasePath.toUri(), job.getConfiguration());

It'll give you an fs instance that points to the same kind of file system as what you specified in your execution base path.  

I made a gist of the diff for that particular commit.  It's been a while since I made it though and LinkedIn has since stopped supporting Camus and moved to Gobblin, but you can check it out if you like.  

laksh...@tokbox.com

unread,
Oct 13, 2015, 5:30:14 PM10/13/15
to Camus - Kafka ETL for Hadoop
Thanks Kevin for the response. I will look into your gist codebase to see whether I can reuse it.

Thanks,
Reply all
Reply to author
Forward
0 new messages