problem with spark connector: timestamps are assumed to be in local time zone

2,365 views
Skip to first unread message

Andy Davidson

unread,
Mar 18, 2016, 2:27:49 PM3/18/16
to DataStax Spark Connector for Apache Cassandra
I am using pyspark 1.6.0 and
datastax:spark-cassandra-connector:1.6.0-M1-s_2.10 to analyze time series
data

The data is originally captured by a spark streaming app and written to
Cassandra. The value of the timestamp comes from spark

Rdd.foreachRDD(new VoidFunction2<JavaRDD<String>, Time>()
write to cassandra
Š});

I am confident the time stamp is stored correctly in cassandra and that
the clocks on the machines in my cluster are set correctly

I noticed that if I used Cassandra CQLSH to select a data set between two
points in time the row count did not match the row count I got when I did
the same select in spark using SQL, It appears the spark sql assumes all
timestamp strings are in the local time zone.


Here is what I expect. (this is what is returned by CQLSH)
cqlsh> select
... count(row_key) as num_samples, sum(count) as total, max(count)
as max
... from
... notification.json_timeseries
... where
... row_key in (Œred', Œblue')
... and created > '2016-03-12 00:30:00+0000'
... and created <= '2016-03-12 04:30:00+0000'
... allow filtering;

num_samples | total| max
-------------+------------------+---------------
3242 |11277 | 17


Here is my pyspark select statement. Notice the Œcreated column encodes
the timezone¹. I am running this on my local mac (in PST timezone) and
connecting to my data center (which runs on UTC) over a VPN.

rawDF = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="json_timeseries", keyspace="notification")\
.load()


rawDF.registerTempTable(tmpTableName)



stmnt = "select \
row_key, created, count, unix_timestamp(created) as unixTimeStamp, \
unix_timestamp(created, 'yyyy-MM-dd HH:mm:ss.z') as hack, \
to_utc_timestamp(created, 'gmt') as gmt \
from \
rawTable \
where \
(created > '{0}') and (created <= '{1}') \
and \
(row_key = Œred' or row_key = Œblue¹) \
)".format('2016-03-12 00:30:00+0000', '2016-03-12 04:30:00+0000')

rawDF = sqlCtx.sql(stmnt).cache()




I get a different values for row count, max, Š

If I convert the UTC time stamp string to my local timezone the row count
matches the count returned by cqlsh

# PST timezone works, matches cassandra cqlsh
# .format('2016-03-11 16:30:00+0000', '2016-03-11 20:30:00+0000')

Am I doing something wrong in my pyspark code?


Kind regards

Andy



---------------------------------------------------------------------
To unsubscribe, e-mail: user-uns...@spark.apache.org
For additional commands, e-mail: user...@spark.apache.org

Russell Spitzer

unread,
Mar 18, 2016, 2:35:16 PM3/18/16
to DataStax Spark Connector for Apache Cassandra
Unfortunately part of Spark SQL. They have based their type on java.sql.timestamp (and date) which adjust to the client timezone when displaying and storing. 
See discussions
http://stackoverflow.com/questions/9202857/timezones-in-sql-date-vs-java-sql-date
And Code
https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93

--
You received this message because you are subscribed to the Google Groups "DataStax Spark Connector for Apache Cassandra" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-connector-...@lists.datastax.com.
--

Russell Spitzer

unread,
Mar 18, 2016, 2:38:42 PM3/18/16
to DataStax Spark Connector for Apache Cassandra

Andy Davidson

unread,
Mar 18, 2016, 3:09:38 PM3/18/16
to spark-conn...@lists.datastax.com
Hi Russell

Nice analysis. Seems like a bug in spark. This must have been reported before. Should I file a bug with spark?

Andy

Russell Spitzer

unread,
Mar 18, 2016, 5:34:16 PM3/18/16
to spark-conn...@lists.datastax.com
It's actually JDBC standard sort of thing, Spark is just doing what a ton of other JDBC clients do :( But feel free to bring it up with them.
Reply all
Reply to author
Forward
0 new messages