kite-data-core 0.17.1 possible bug

7 views
Skip to first unread message

Giovanni G

unread,
Feb 5, 2015, 11:45:14 AM2/5/15
to cdk...@cloudera.org
Hello,

I've recently started using kite to write/read Parquet datasets. I've noticed the following thing, which seems like a bug to me:

I created a Parquet dataset with the following partition strategy:

val ps: PartitionStrategy = new PartitionStrategy.Builder().
      year("day", "year").
      month("day", "month").
      build()

("day" is a field containing a timestamp)

I noticed that whenever a record contains a "day" value corresponding to the first day of a month,
this record gets serialized under the partition corresponding to the preceding month
 
E.g:  if "day" = 631148400000     [i.e. 1990-01-01]
the record gets serialized under ".../year=1989/month=12" instead of ".../year=1990/month=01"

Is this a known thing?

Cheers,

GG 

 

Ryan Blue

unread,
Feb 5, 2015, 12:39:24 PM2/5/15
to Giovanni G, cdk...@cloudera.org
Hi Giovanni,

I think this is happening because the timestamp is converted to a year,
month, and day in UTC. Java's date may display the value with your
current time zone, but storage is always consistent by always using UTC.

Does that explain what you're seeing?

rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.
Reply all
Reply to author
Forward
0 new messages