I was doing some experiment with generated column. But its not working as expected. Sorry for being too descriptive. Below is code snippet which i tried
Here
timestamp column is in epoch
ptimestamp column is partition column. Its also in epoch and its value is calculated with some formula dependent on timestamp column
DeltaTable.create(spark)
.tableName("part")
.addColumn("timestamp", LongType)
.addColumn("name",StringType)
.addColumn(DeltaTable.columnBuilder(spark, "ptimestamp")
.dataType(LongType)
.generatedAlwaysAs("CAST((timestamp-timestamp%3600000) as Long)")
.build)
.partitionedBy("ptimestamp")
.execute()
I was able to create table. But i am facing 2 issue as of now.
- I am not able to insert data without filling up generated column i.e below command gives error
spark.sql("INSERT INTO part (timestamp, name) VALUES (1629010822000, 'ashok1')").show
Executing above gives below error. As i understand generated column should get filled up automatically.
org.apache.spark.sql.AnalysisException: Cannot write to 'test.part', not enough data columns; target table has 3 column(s) but the inserted data has 2 column(s)
2. I added few records(adding value for generated column too). when i query table with timestamp column. As per hidden partitioning, partition column should be part of plan.
But i don't see partition column included as filter. my sql query is
explain select * from part where timestamp =1629165660000;
| == Physical Plan ==
*(1) Filter (isnotnull(timestamp#5283L) AND (timestamp#5283L = 1629165660000))
+- *(1) ColumnarToRow
+- FileScan parquet test.part[timestamp#5283L,name#5284,ptimestamp#5285L] Batched: true, DataFilters: [isnotnull(timestamp#5283L), (timestamp#5283L = 1629165660000)], Format: Parquet, Location: TahoeLogFileIndex[**/test.db/part], PartitionFilters: [], PushedFilters: [IsNotNull(timestamp), EqualTo(timestamp,1629165660000)], ReadSchema: struct<timestamp:bigint,name:string>
However as expected if i query with partition column, partition filter is coming in plan.
explain select * from part where ptimestamp=1629165600000;
| == Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet test.part[timestamp#5293L,name#5294,ptimestamp#5295L] Batched: true, DataFilters: [], Format: Parquet, Location: TahoeLogFileIndex[****/test.db/part], PartitionFilters: [isnotnull(ptimestamp#5295L), (ptimestamp#5295L = 1629165600000)], PushedFilters: [], ReadSchema: struct<timestamp:bigint,name:string>
Also, i don't see "generated column" command as part of "create table" syntax. Any plan to introduce it in future ?