We want to load ClickHouse data into a Spark dataframe by Spark SQL, via ClickHouse JDBC driver.
=== Spark code START ===
val df = (spark.read.format("ru.yandex.clickhouse.ClickHouseDriver")
.jdbc(url=s"jdbc:clickhouse://${server}:${8123}/default",
table="demo",
properties=readerProperties))
.select("time", "metric", "domain", "longValue", "doubleValue", "stringValue", "tagStr", "`tags.key`", "`tags.value`")
.where("time > cast('2018-05-06 00:01:56' as timestamp)") // If where is omitted, it works.
=== Spark code END ===
Regarding the where clause, time is a DateTime column in the ClickHouse table.
Problem:
We find Spark generates the following SQL and sends it to ClickHouse server:
SELECT time,metric,domain,longValue,doubleValue,stringValue,tagStr,tags.key,tags.value FROM demo WHERE (time > '2018-05-06 00:01:56.0')
Note that there is an additional ".0" at the end of the timestamp liternal. It is 2018-05-06 00:01:56.0, not the expected 2018-05-06 00:01:56.
As a result, ClickHouse server responds with an error:
String is too long for DateTime: 2018-05-06 00:01:56.0, e.what() = DB::Exception
....
Root cause
In org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD#compileValue of Spark, it simply calls Timestamp.toString() to create a timestamp String in the SQL where clause, not calling ClickHouse JDBC driver to generate the string in the correct format.
How can we fix this formatting issue ?
Is it possible for ClickHouse to accept 2018-05-06 00:01:56.0 as a valid DateTime value?
ClickHouse version: 1.1.54385, Spark version: 2.2.0