Issue while reading timestamp data from parquet file format

834 views
Skip to first unread message

santlal gupta

unread,
Jul 22, 2015, 4:36:17 AM7/22/15
to cascading-user
hi,

I am beginner in the cascading parquet .

I have data in the timestamp format i.e. yyyy-MM-dd h:m:s.f. I want to create a  parquet file that store this data. But cascading parquet does not support the timestamp datatype. So while creating Message i have used BINARY. So this data is stored successfully in the parquet file. After that i want to load this data into hive with field type TIMESTAMP. In hive this file is successfully loaded but while reading it give exception.
 

Code :

 

 

package com.parquet.TimestampTest;

 

import cascading.flow.FlowDef;

import cascading.flow.hadoop.HadoopFlowConnector;

import cascading.pipe.Pipe;

import cascading.scheme.Scheme; 

import cascading.scheme.hadoop.TextDelimited;

import cascading.tap.SinkMode;

import cascading.tap.Tap;

import cascading.tap.hadoop.Hfs;

import cascading.tuple.Fields;

import parquet.cascading.ParquetTupleScheme;

 

public class GenrateTimeStampParquetFile {

 

                static String inputPath = "target/input/timestampInputFile1";

                static String outputPath = "target/parquetOutput/TimestampOutput";


                public static void main(String[] args) {

                                write();

                }


                private static void write() {

 

                                Fields field = new Fields("timestampField").applyTypes(String.class);

                                Scheme sourceSch = new TextDelimited(field, false, "\n");

                                Fields outputField = new Fields("timestampField");

                                Scheme sinkSch = new ParquetTupleScheme(field, outputField,

                                                                "message TimeStampTest{optional binary timestampField ;}");


                                Tap source = new Hfs(sourceSch, inputPath); 

                                Tap sink = new Hfs(sinkSch, outputPath, SinkMode.REPLACE);

 

                                Pipe pipe = new Pipe("Hive timestamp");

                                FlowDef fd = FlowDef.flowDef().addSource(pipe, source).addTailSink(pipe, sink);

                                new HadoopFlowConnector().connect(fd).complete();

                }

 }

 

Input file:

 

timestampInputFile1

 

timestampField

1988-05-25 15:15:15.254

1987-05-06 14:14:25.362

 

After running the code following files are generated.

 

Output :

 

1. part-00000-m-00000.parquet

2. _SUCCESS

3. _metadata

4. _common_metadata

 

I have created the table in hive to load the  part-00000-m-00000.parquet file.

I have written following query in the hive.

 

Query :


hive> create table test3(timestampField timestamp) stored as parquet; 

hive> load data local inpath  '/home/hduser/parquet_testing/part-00000-m-00000.parquet' into table test3;

hive> select  * from test3;


After running above command I got following as output.

 

Output :

OK

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" 

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

 

But I have got above exception.

 

So is there any way to store the timestamp data in the parquet file format . so that after loading data in hive, it can be successfully read from hive.

  

Currently I am using

   Hive 1.1.0-cdh5.4.2.

  Cascading 2.5.1

  parquet-format-2.2.0

 

Thanks

 

Santlal J. Gupta

Andre Kelpe

unread,
Jul 22, 2015, 3:25:26 PM7/22/15
to cascadi...@googlegroups.com
Hi,

I think you will have to ask the Parquet developers how to solve this. Hive seems to do the right thing, since it encounters a binary blob and reads it as bytes. The right answer is to add support for timestamps in parquet-cascading. Asking on their mailing list, should yield better results: http://parquet.apache.org/community/

- André

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/0f360685-f2f9-4d50-9fb8-ceb8db41872c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Reply all
Reply to author
Forward
0 new messages