BigDecimal & Date Datatype - How to use with parquet-cascading?

Bhavesh Shah

unread,

Jun 26, 2015, 3:35:53 AM6/26/15

to cascadi...@googlegroups.com

Hi,

I am trying to use BigDecimal & Date datatype with parquet-cascading. I have created some sample job using it but it throws exception if I try to use BigDecimal/Date type in "message type". When I write the "message type" in ParquetTupleScheme, in that I am not able to use BigDecimal & Date. Below is the sample code:

public class ReadWriteParquet {

                static String textInputPath = "inputOutput/input/in1.txt";
                static String parquetOutputPath = "inputOutput/output/parquet-out";
                static String textOutputPath = "inputOutput/output/text-out";

                public static void main(String[] args) throws Exception {
                                ReadWriteParquet.write();
                                ReadWriteParquet.read();
                }

                private static void read() {
                                Scheme parquetinput = new ParquetTupleScheme(new Fields("Name",
                                                                "College", "Branch", "Age", "Doj", "BigDeci"));
                                Scheme textoutput = new TextDelimited(true, ",");

                                Tap source = new Hfs(parquetinput, parquetOutputPath);
                                Tap sink = new Hfs(textoutput, textOutputPath, SinkMode.REPLACE);

                                Pipe pipe = new Pipe("Read Parquet");
                                pipe = new GroupBy(pipe, new Fields("Branch"));

                                Properties hadoopProps = new Properties();
                                AppProps.setApplicationJarClass(hadoopProps, ReadWriteParquet.class);
                                TupleSerializationProps.addSerialization(hadoopProps,
                                                                BigDecimalSerialization.class.getName());

                                FlowDef flowdef = FlowDef.flowDef().addSource(pipe, source)
                                                                .addTailSink(pipe, sink);
                                HadoopFlowConnector hd = new HadoopFlowConnector(hadoopProps);
                                hd.connect(flowdef).complete();
                }

                private static void write() {
                                DateType dateType = new DateType("dd/MM/yyyy");
                                Fields fields = new Fields("Name", "College", "Branch", "Age", "Doj",
                                                                "BigDeci").applyTypes(String.class, String.class, String.class,
                                                                Integer.class, dateType, BigDecimal.class);

                                Scheme input = new TextDelimited(fields, true, ",");

                                Scheme parquetout = new ParquetTupleScheme(
                                                                fields, fields, "message ReadWriteParquet {required Binary Name; required Binary College; required Binary Branch; optional int64 Age; required int64 Doj; required Double BigDeci; }");

                                Tap source = new Hfs(input, textInputPath);
                                Tap sink = new Hfs(parquetout, parquetOutputPath, SinkMode.REPLACE);

                                Pipe pipe = new Pipe("Write Parquet");

                                FlowDef flowdef = FlowDef.flowDef().addSource(pipe, source)
                                                                .addTailSink(pipe, sink);
                                new HadoopFlowConnector().connect(flowdef).complete();
                }
}

In above code, you can see that in "message type" I have used Int64 for date as there is no provision for date datatype and Double for BigDecimal. Saying Int64 to date it writes the data as long values but I want date to be written in some particular format. And same with BigDecimal, To run the job I have mapped it to Double but I want to map it to BigDecimal.

Is there any way to deal with date and bigdecimal datatypes in parquet-cascading while writing the data? Please let me know if any of the workaround is there.

Thanks,

Bhavesh

Andre Kelpe

unread,

Jun 29, 2015, 1:55:36 AM6/29/15

to cascadi...@googlegroups.com

I think you might be better off asking on the parquet-user list, since parquet-cascading is part of Apache Parquet: https://parquet.apache.org/community/

- André

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/b9c28380-b34d-4b38-959a-977238ae71b1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

André Kelpe
an...@concurrentinc.com
http://concurrentinc.com

shree

unread,

Aug 5, 2015, 3:35:41 PM8/5/15

to cascading-user

Hi Bhavesh,

Sorry to pull you into my issue. I am also working on ParquetReadWrite functionality. It works fine in local IDE but when i try to run in Hadoop cluster as a Hadoop jar am getting java.lang.ClassNotFoundException: cascading.scheme.Scheme