Spark SQL generates Invalid Timestamp String (2018-05-06 00:01:56.0 v.s. 2018-05-06 00:01:56) in the WHERE Clause of a ClickHouse SQL

313 views
Skip to first unread message

Yj H

unread,
Jun 7, 2018, 7:24:46 AM6/7/18
to ClickHouse
Background:
We want to load ClickHouse data into a Spark dataframe by Spark SQL, via ClickHouse JDBC driver.

=== Spark code START ===
val df = (spark.read.format("ru.yandex.clickhouse.ClickHouseDriver")
      .jdbc(url=s"jdbc:clickhouse://${server}:${8123}/default",
          table="demo",
          properties=readerProperties))
      .select("time", "metric", "domain", "longValue", "doubleValue", "stringValue", "tagStr", "`tags.key`", "`tags.value`")
      .where("time > cast('2018-05-06 00:01:56' as timestamp)")   // If where is omitted, it works.

=== Spark code END ===

Regarding the where clause, time is a DateTime column in the ClickHouse table.

Problem:
We find Spark generates the following SQL and sends it to ClickHouse server:

SELECT time,metric,domain,longValue,doubleValue,stringValue,tagStr,tags.key,tags.value FROM demo WHERE (time > '2018-05-06 00:01:56.0')

Note that there is an additional ".0" at the end of the timestamp liternal. It is 2018-05-06 00:01:56.0, not the expected 2018-05-06 00:01:56.

As a result, ClickHouse server responds with an error:
String is too long for DateTime: 2018-05-06 00:01:56.0, e.what() = DB::Exception
....

Root cause
In org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD#compileValue of Spark, it simply calls Timestamp.toString() to create a timestamp String in the SQL where clause, not calling ClickHouse JDBC driver to generate the string in the correct format.

How can we fix this formatting issue ?
Is it possible for ClickHouse to accept 2018-05-06 00:01:56.0 as a valid DateTime value?

ClickHouse version: 1.1.54385, Spark version: 2.2.0






Dmitry Berezhnov

unread,
Jun 8, 2018, 2:46:33 PM6/8/18
to ClickHouse
Hi,
Can you show table structure?
And what happens if you change 

.where("time > cast('2018-05-06 00:01:56' as timestamp)") 
to
.where("time > '2018-05-06 00:01:56'") 
?

Yj H

unread,
Jun 9, 2018, 12:15:15 AM6/9/18
to ClickHouse
If we change to where("time > '2018-05-06 00:01:56'"),
Spark generates a Clickhouse SQL without this filtering condition in the WHERE clause. Instead, filtering by time will only be done in Spark, not by ClickHouse.

This is because Spark transforms the time into a String column by a implicit CAST. As a result, it cannot generate the CAST using ClickHouse SQL.
Reply all
Reply to author
Forward
0 new messages