Any examples or test caes for RCFileInputFormat?

1,203 views
Skip to first unread message

Deepak Jain

unread,
May 13, 2013, 3:15:50 AM5/13/13
to elephant...@googlegroups.com
I did not find any sample programs or unit test case for
https://github.com/kevinweil/elephant-bird/blob/master/rcfile/src/main/java/com/twitter/elephantbird/mapreduce/output/RCFileOutputFormat.java

Could you please point me to any ?
Regards,
Deepak

Deepak Jain

unread,
May 14, 2013, 3:56:26 AM5/14/13
to elephant...@googlegroups.com
I started something like this
1) Convert data from Text to RCFile format

Driver

public Job constructJob(final String jobName, final Path inputPath, final Path outputPath, Class<?> jarClass,
            String compressionScheme, final Configuration configuration, final long colCount) throws IOException {
        if (!"none".equalsIgnoreCase(compressionScheme)) {
            configuration.set(RCFileOutputFormat.COMPRESSION_CODEC_CONF, compressionScheme);
        }
        RCFileOutputFormat.setColumnNumber(configuration, (int) colCount);
        // configuration.set("mapred.child.java.opts", "-Xmx4096m");
        Job job = new Job(configuration, jobName);
        job.setJarByClass(jarClass);
        job.setNumReduceTasks(0);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(RCFileOutputFormat.class);

        TextInputFormat.addInputPath(job, inputPath);
        RCFileOutputFormat.setOutputPath(job, outputPath);

        job.setMapperClass(RCFileWriteMapper.class);
        return job;
    }

There are no reducers here.

RCFileWriteMapper
    protected void map(LongWritable key, Text value,
            Mapper<LongWritable, Text, Void, BytesRefArrayWritable>.Context context) throws java.io.IOException,
            InterruptedException {
        String row = value.toString();
        StringTokenizer tokenize = new StringTokenizer(row);
        int columnIndex = 0;
        BytesRefArrayWritable dataWrite = new BytesRefArrayWritable(10);
        while (tokenize.hasMoreElements()) {
            Text columnValue = new Text(tokenize.nextToken());
            BytesRefWritable bytesRefWritable = new BytesRefWritable();
            bytesRefWritable.set(columnValue.getBytes(), 0, columnValue.getLength());
            // ensure the if required the capacity is increased
            dataWrite.resetValid(columnIndex);
            dataWrite.set(columnIndex, bytesRefWritable);
            columnIndex++;
        }
        context.write(null, dataWrite);
    }

The above code did work correctly But Is this how RCFileOutputFormat should be used ? Are we expected to break a given row in map() into columns, store them using BytesRefWritable as this is what RCFile expects ? Or can it be avoided and handled automatically by RCFileOutputFormat ?


II) Read specific columns of RCFile and convert them into Text

Driver:
        Job job = new Job(configuration, jobName);
        job.setJarByClass(jarClass);
        job.setNumReduceTasks(0);

        job.setInputFormatClass(RCFileBaseInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        ColumnProjectionUtils.setReadColumnIDs(configuration, columnIds);
        RCFileInputFormat.addInputPath(new JobConf(configuration), inputPath);
        TextOutputFormat.setOutputPath(job, outputPath);

        job.setMapperClass(RCFileReadMapper.class);
        return job;


I had to use org.apache.hadoop.hive.ql.io.RCFileInputFormat as RCFileBaseInputFormat (elephant-bird) did not have addInputPath() Will this work ?

Mapper:
I do not know what are the datatypes of input and output here ? I tried something like this
public class RCFileReadMapper extends Mapper<Void, BytesRefArrayWritable, Void, Text> {

    protected void map(Void key, BytesRefArrayWritable value,
            Mapper<Void, BytesRefArrayWritable, Void, Text>.Context context) throws java.io.IOException,
            InterruptedException {
        System.out.println(key);
        System.out.println(value);
    }

}

When running the program with above driver and mapper, the mapper was never invoked and i got the following exception

Exception in thread "main" java.lang.RuntimeException: java.lang.InstantiationException
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1021)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1041)
    at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:959)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:912)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:416)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:912)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
    at com.benchmark.columnar.rcfile.direct.mr.BenchmarkRCFileMR.performRead(BenchmarkRCFileMR.java:139)
    at com.benchmark.columnar.rcfile.direct.mr.BenchmarkRCFileMR.doReadBenchmark(BenchmarkRCFileMR.java:93)
    at com.benchmark.columnar.rcfile.direct.mr.BenchmarkRCFileMR.main(BenchmarkRCFileMR.java:54)
Caused by: java.lang.InstantiationException
    at sun.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:113)
    ... 14 more

Regards,
Deepak

ÐΞ€ρ@Ҝ (๏̯͡๏)

unread,
May 14, 2013, 12:13:34 PM5/14/13
to elephant...@googlegroups.com
Any suggestions please
--
You received this message because you are subscribed to a topic in the Google Groups "elephantbird-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elephantbird-dev/FSWyOks2v4U/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to elephantbird-d...@googlegroups.com.
To post to this group, send email to elephant...@googlegroups.com.
Visit this group at http://groups.google.com/group/elephantbird-dev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 


--
Deepak


Raghu Angadi

unread,
May 14, 2013, 1:02:25 PM5/14/13
to elephant...@googlegroups.com

replies inline:

I don't think it will take more than the 'colCount'.

 
            dataWrite.resetValid(columnIndex);
            dataWrite.set(columnIndex, bytesRefWritable);
            columnIndex++;
        }
        context.write(null, dataWrite);
    }

The above code did work correctly But Is this how RCFileOutputFormat should be used ? Are we expected to break a given row in map() into columns, store them using BytesRefWritable as this is what RCFile expects ? Or can it be avoided and handled automatically by RCFileOutputFormat ?

Yes, you need to split and provide the content for the columns. see putNext() in RCFilePigStorage() for e.g.

 


II) Read specific columns of RCFile and convert them into Text

Driver:
        Job job = new Job(configuration, jobName);
        job.setJarByClass(jarClass);
        job.setNumReduceTasks(0);

        job.setInputFormatClass(RCFileBaseInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        ColumnProjectionUtils.setReadColumnIDs(configuration, columnIds);
        RCFileInputFormat.addInputPath(new JobConf(configuration), inputPath);
        TextOutputFormat.setOutputPath(job, outputPath);

        job.setMapperClass(RCFileReadMapper.class);
        return job;


I had to use org.apache.hadoop.hive.ql.io.RCFileInputFormat as RCFileBaseInputFormat (elephant-bird) did not have addInputPath() Will this work ?

Yeah, we use org.apache.hadoop.hive.ql.io.RCFileInputFormat as well. But hive.ql.io.RCFileInputFormat  is written for old 'mapred' interface. You need to use a wrapper for the new "mapreduce" interface:

Try (this is what RCFilePigStorage does in getInputFormat()):

MapReduceInputFormatWrapper.setInputFormat(org.apache.hadoop.hive.ql.io.RCFileInputFormat.class, job);

in place of 'job.setInputFormat(org.apache.hadoop.hive.ql.io.RCFileInputFormat.class)'
 
Raghu.

--
You received this message because you are subscribed to the Google Groups "elephantbird-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elephantbird-d...@googlegroups.com.

Deepak Jain

unread,
May 15, 2013, 4:44:26 AM5/15/13
to elephant...@googlegroups.com
Hi,
I tried couple of options

I )

Job job = new Job(configuration, jobName);
        job.setJarByClass(jarClass);
        job.setNumReduceTasks(0);

        job.setMapOutputKeyClass(Void.class);
        job.setMapOutputValueClass(BytesRefArrayWritable.class);
        job.setOutputKeyClass(Void.class);
        job.setOutputValueClass(BytesRefArrayWritable.class);
        MapReduceInputFormatWrapper.setInputFormat(org.apache.hadoop.hive.ql.io.RCFileInputFormat.class, job);
        job.setOutputFormatClass(TextOutputFormat.class);

        ColumnProjectionUtils.setReadColumnIDs(configuration, columnIds);
        configuration.set("mapred.input.dir", inputPath.toString());
        TextOutputFormat.setOutputPath(job, outputPath);

        job.setMapperClass(RCFileReadMapper.class);
        return job;

II) I replaced the bold text into
      RCFileInputFormat.addInputPath(new JobConf(configuration), inputPath);
RCFileInputFormat is old API and does not support taking configuration object as first parameter, instead it expects JobConf. ( Unlike TextInputFormat.addInputPath(job, inputPath);)

In both cases i observed the properties variable of Configurations object within Job and found mapred.input.dir was never set. Upon running both the versions of driver i was shown the below exception
13/05/15 14:07:04 ERROR security.UserGroupInformation: PriviledgedActionException as:deepakkv cause:java.io.IOException: No input paths specified in job
Exception in thread "main" java.io.IOException: No input paths specified in job
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:156)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
    at com.twitter.elephantbird.mapreduce.input.MapReduceInputFormatWrapper.getSplits(MapReduceInputFormatWrapper.java:97)

Just before firing the job, i noted down the below properties
mapred.reduce.tasks=0
mapreduce.inputformat.class=com.twitter.elephantbird.mapreduce.input.MapReduceInputFormatWrapper
mapreduce.outputformat.class=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
mapred.input.dir= This was missing
mapred.output.dir=/home/deepakkv/benchmark/BenchmarkRCFileMR/data/textOutput/COLS-1-4-6-8-9
mapreduce.map.class=com.ebay.benchmark.columnar.rcfile.direct.mr.mapper.RCFileReadMapper
elephantbird.class.for.MapReduceInputFormatWrapper=org.apache.hadoop.hive.ql.io.RCFileInputFormat

Questions
1) If i solely use MapReduceInputFormatWrapper then why does it not provide API to add input paths ? Typically with inputformat class you set it as format class and then add input paths
Ex: job.setInputFormatClass(TextInputFormat.class); and TextInputFormat.addInputPath(job, inputPath);
With MapReduceInputFormatWrapper i can do
MapReduceInputFormatWrapper.setInputFormat(org.apache.hadoop.hive.ql.io.RCFileInputFormat.class, job); but where should i add input path ?

2) As MapReduceInputFormatWrapper does not have any api to set input path, how should i set input path ?

Regards,
Deepak

On Monday, May 13, 2013 12:45:50 PM UTC+5:30, Deepak Jain wrote:

Deepak Jain

unread,
May 15, 2013, 12:43:30 PM5/15/13
to elephant...@googlegroups.com
Any suggestion on how to get this working ?

Raghu Angadi

unread,
May 15, 2013, 12:49:05 PM5/15/13
to elephant...@googlegroups.com

On Wed, May 15, 2013 at 1:44 AM, Deepak Jain <deep...@gmail.com> wrote:
II) I replaced the bold text into
      RCFileInputFormat.addInputPath(new JobConf(configuration), inputPath);
RCFileInputFormat is old API and does not support taking configuration object as first parameter, instead it expects JobConf. ( Unlike TextInputFormat.addInputPath(job, inputPath);)

FileInputFormat.setInputPaths(configuration, inputPath);

Many InputFormats including RCFileInputFormat just use paths from FileInputFormat.


Kundan Dere

unread,
Oct 29, 2013, 2:30:19 AM10/29/13
to elephant...@googlegroups.com

Hey Deepak,

      I was searching for RCfile Input and output format example. I am just curious that you got any or not ?
Please reply

On Monday, May 13, 2013 12:45:50 PM UTC+5:30, Deepak Jain wrote:
Reply all
Reply to author
Forward
0 new messages