Camus direct writes to s3 possible?

Kevin Su

unread,

Apr 22, 2015, 2:26:13 PM4/22/15

to camu...@googlegroups.com

Hi all,

I have Camus running in EMR with the following properties file snippet:

etl.destination.path=/tmp/camus

etl.execution.base.path=/tmp/camus/exec

etl.execution.history.path=/tmp/camus/camus/exec/history

And then I'm doing a s3distcp to copy data from the EMR instances into s3.

This part is functioning well.

What I'd like to do is skip the part where Camus writes stuff to hdfs://tmp and go directly to s3.

I've tried messing with the properties file but can't seem to get it working. I've added:

fs.defaultFS=s3*://bucket

fs.s3*.awsAccessKeyId=key

fs.s3*.awsSecretAccessKey=secret

But I keep getting this error:

Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.

Perhaps caused by this:

2015-04-22 17:36:34,708 INFO org.apache.hadoop.mapreduce.Cluster (main): Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider due to error: Error in instantiating YarnClient

Normally, I'd mess around with the hadoop conf files like core-site.xml, but since this is on EMR, that's a little cumbersome as I want to keep spinning up new instances.

My main questions are:

1. Is there something really obvious I am missing about configuring camus.properties to write to s3.

2. Is there some automagic way I can override core-site.xml and the like in EMR without sshing onto the box and changing it myself?

Cheers,

Kev

laksh...@tokbox.com

unread,

Oct 12, 2015, 5:36:12 PM10/12/15

to Camus - Kafka ETL for Hadoop

I am also running in to the same issue while trying to write to amazon s3 from Camus running in EMR.

Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.

Any thoughts to how to get it to work?

Kevin Su

unread,

Oct 12, 2015, 7:59:54 PM10/12/15

to Camus - Kafka ETL for Hadoop

Hi, I ended up forking Camus in order to allow writes to s3.

The main issue was that because the code was accessing FileSystem like so:

FileSystem fs = FileSystem.get(job.getConfiguration());

It defaulted to hdfs and thus didn't like it when I tried to write to s3. If you change this line and other lines like it to:

Path execBasePath = new Path(props.getProperty(ETL_EXECUTION_BASE_PATH));

FileSystem fs = FileSystem.get(execBasePath.toUri(), job.getConfiguration());

It'll give you an fs instance that points to the same kind of file system as what you specified in your execution base path.

I made a gist of the diff for that particular commit. It's been a while since I made it though and LinkedIn has since stopped supporting Camus and moved to Gobblin, but you can check it out if you like.

https://gist.github.com/kevinsu/988d32d4c1939aff1374

laksh...@tokbox.com

unread,

Oct 13, 2015, 5:30:14 PM10/13/15

to Camus - Kafka ETL for Hadoop

Thanks Kevin for the response. I will look into your gist codebase to see whether I can reuse it.

Thanks,

Reply all

Reply to author

Forward