Issue while reading timestamp data from parquet file format

834 views

Skip to first unread message

santlal gupta

unread,

Jul 22, 2015, 4:36:17 AM7/22/15

to cascading-user

hi,

I am beginner in the cascading parquet .

I have data in the timestamp format i.e. yyyy-MM-dd h:m:s.f. I want to create a parquet file that store this data. But cascading parquet does not support the timestamp datatype. So while creating Message i have used BINARY. So this data is stored successfully in the parquet file. After that i want to load this data into hive with field type TIMESTAMP. In hive this file is successfully loaded but while reading it give exception.

Code :

package com.parquet.TimestampTest;

import cascading.flow.FlowDef;

import cascading.flow.hadoop.HadoopFlowConnector;

import cascading.pipe.Pipe;

import cascading.scheme.Scheme;

import cascading.scheme.hadoop.TextDelimited;

import cascading.tap.SinkMode;

import cascading.tap.Tap;

import cascading.tap.hadoop.Hfs;

import cascading.tuple.Fields;

import parquet.cascading.ParquetTupleScheme;

public class GenrateTimeStampParquetFile {

static String inputPath = "target/input/timestampInputFile1";

static String outputPath = "target/parquetOutput/TimestampOutput";

public static void main(String[] args) {

write();

}

private static void write() {

Fields field = new Fields("timestampField").applyTypes(String.class);

Scheme sourceSch = new TextDelimited(field, false, "\n");

Fields outputField = new Fields("timestampField");

Scheme sinkSch = new ParquetTupleScheme(field, outputField,

"message TimeStampTest{optional binary timestampField ;}");

Tap source = new Hfs(sourceSch, inputPath);

Tap sink = new Hfs(sinkSch, outputPath, SinkMode.REPLACE);

Pipe pipe = new Pipe("Hive timestamp");

FlowDef fd = FlowDef.flowDef().addSource(pipe, source).addTailSink(pipe, sink);

new HadoopFlowConnector().connect(fd).complete();

}

Input file:

timestampInputFile1

timestampField

1988-05-25 15:15:15.254

1987-05-06 14:14:25.362

After running the code following files are generated.

Output :

1. part-00000-m-00000.parquet

2. _SUCCESS

3. _metadata

4. _common_metadata

I have created the table in hive to load the part-00000-m-00000.parquet file.

I have written following query in the hive.

Query :

hive> create table test3(timestampField timestamp) stored as parquet;

hive> load data local inpath '/home/hduser/parquet_testing/part-00000-m-00000.parquet' into table test3;

hive> select * from test3;

After running above command I got following as output.

Output :

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder"

SLF4J: Defaulting to no-operation (NOP) logger implementation

SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritable

But I have got above exception.

So is there any way to store the timestamp data in the parquet file format . so that after loading data in hive, it can be successfully read from hive.

Currently I am using

Hive 1.1.0-cdh5.4.2.

Cascading 2.5.1

parquet-format-2.2.0

Thanks

Santlal J. Gupta

Andre Kelpe

unread,

Jul 22, 2015, 3:25:26 PM7/22/15

to cascadi...@googlegroups.com

Hi,

I think you will have to ask the Parquet developers how to solve this. Hive seems to do the right thing, since it encounters a binary blob and reads it as bytes. The right answer is to add support for timestamps in parquet-cascading. Asking on their mailing list, should yield better results: http://parquet.apache.org/community/

- André

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/0f360685-f2f9-4d50-9fb8-ceb8db41872c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

André Kelpe
an...@concurrentinc.com
http://concurrentinc.com

Reply all

Reply to author

Forward

0 new messages