Using a custom data writer with Camus

Showing 1-7 of 7 messages
Using a custom data writer with Camus Ken Goodhope 8/1/13 1:12 PM
Hello Everyone,

I tracked down the commit that added the ability to specify a custom data writer for camus.  We still need to add documentation on how to make use of this, but for those who want to start using it now you can look through the patch to see what classes you need to extend and what configuration params you need to set.

This commit was only added to the kafka .8 branch.  Internally, we are off of kafka .7 and soon this branch will become trunk.  If someone needs this functionality with kafka .7 we'll need to backport this patch.

Thanks to Sam Meder for adding this.  We should now be in a position to handle any kind of data by specifying a custom decoder and a custom record writer.

Ken
Re: Using a custom data writer with Camus Felix GV 8/1/13 1:20 PM
Wow very cool :) !

I have an off-topic question regarding 0.8: is LinkedIn fully done with the migration? During the Kafka meet up after the Hadoop Summit you guys were saying you still had about 2/3 of your consumers reading from the (mirrored) 0.7 cluster. Also, are you guys planning a 0.8.1 release soon or are things looking super stable as is right now?

--
Felix


--
You received this message because you are subscribed to the Google Groups "Camus - Kafka ETL for Hadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to camus_etl+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Re: Using a custom data writer with Camus Ken Goodhope 8/1/13 1:30 PM
I'm not sure if we still have .7 consumers or not, or if we are planning a 0.8.1 release.  The Kafka team could answer more definitively.  All I know for sure is all our Camus instances are now pulling from .8.
Re: Using a custom data writer with Camus Felix GV 8/1/13 1:32 PM
Ok thanks for the info Ken :)

I'll read/ask on the Kafka list when the time comes for us to migrate to 0.8 :) ...

--
Felix
Re: Using a custom data writer with Camus Jay Kreps 8/1/13 8:26 PM
Current state: consumers are 100% migrated, producers maybe 60%.

-Jay


On Thu, Aug 1, 2013 at 1:20 PM, Felix GV <fe...@mate1inc.com> wrote:

Re: Using a custom data writer with Camus Andrew Otto 8/14/13 9:45 AM
Thanks Ken!

I'm busy trying to figure out how to make Camus read directly from a text Kafka topic and write the raw text bytes to HDFS.

I think I've figured out the MessageDecoder bit, but am having trouble implementing a working RecordWriterProvider.  

When I try to run that as is, everything looks pretty good.  The job finishes succesfully, and I see info about bytes being written.  However, nothing is actually written to my etl.destination.path.  

I'm sure I'm just missing an obvious but really important step.  (This is actually my first time writing Hadoop Java stuff).

Below is what I have so far:



package org.wikimedia.analytics.kraken.etl;
 
import com.linkedin.camus.coders.CamusWrapper;
import com.linkedin.camus.etl.IEtlKey;
import com.linkedin.camus.etl.RecordWriterProvider;
import com.linkedin.camus.etl.kafka.mapred.EtlMultiOutputFormat;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 
 
public class TextRecordWriterProvider implements RecordWriterProvider {
public final static String EXT = ".txt";
 
@Override
public String getFilenameExtension() {
return EXT;
}
 
@Override
public RecordWriter<IEtlKey, CamusWrapper> getDataRecordWriter(
TaskAttemptContext context,
String fileName,
CamusWrapper data,
FileOutputCommitter committer) throws IOException, InterruptedException {
// Path path = committer.getWorkPath();
// path = new Path(path, EtlMultiOutputFormat.getUniqueFile(context, fileName, EXT));
// -- I think I need to somehow use this Path to tell the returned RecordWriter
// where to write, but I'm not sure how to do this. I've tried:
// outputFormat.setOutputPath(<Job>, path), but I don't know where to get Job from.
// All I've got is context, and I find a way to get the context's Job from that.
 
// This doesn't write anything to etl.destination.path. :/
FileOutputFormat outputFormat = new TextOutputFormat();
return outputFormat.getRecordWriter(context);
}
}



Re: Using a custom data writer with Camus Andrew Otto 8/15/13 7:08 AM
Ok, I think I got it!  Or, at least something that works.  I just needed to use FileSystem .create().


package org.wikimedia.analytics.kraken.etl;
 
import com.linkedin.camus.coders.CamusWrapper;
import com.linkedin.camus.etl.IEtlKey;
import com.linkedin.camus.etl.RecordWriterProvider;
import com.linkedin.camus.etl.kafka.mapred.EtlMultiOutputFormat;
import java.io.IOException;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
 
public class TextRecordWriterProvider implements RecordWriterProvider {
public final static String EXT = ".txt";
 
@Override
public String getFilenameExtension() {
return EXT;
}
 
@Override
public RecordWriter<IEtlKey, CamusWrapper> getDataRecordWriter(
TaskAttemptContext context,
String fileName,
CamusWrapper data,
FileOutputCommitter committer) throws IOException, InterruptedException {
 
 
Path path = new Path(
committer.getWorkPath(),
EtlMultiOutputFormat.getUniqueFile(context, fileName, EXT)
);
 
final FSDataOutputStream writer = path.getFileSystem(
context.getConfiguration()
).create(path);
 
return new RecordWriter<IEtlKey, CamusWrapper>() {
@Override
public void write(IEtlKey ignore, CamusWrapper data) throws IOException {
String record = ((String)data.getRecord() + "\n");
writer.write(record.getBytes());
}
 
@Override
public void close(TaskAttemptContext arg0) throws IOException, InterruptedException {
writer.close();
}
};
}
}