Dataproc apache iceberg: iceberg is not a valid Spark SQL Data Source

716 views
Skip to first unread message

Punyawee Posri

unread,
May 16, 2023, 1:54:55 AM5/16/23
to Google Cloud Dataproc Discussions
Hi Google Cloud Dataproc Team!

I'm now trying to use apache iceberg for my nestjs project to speed up our database query on some of our features. So, I've been try to test the query speed of apache iceberg on ssh (Google Dataproc). But I'm now facing an issue while using apache iceberg on spark via google dataproc ssh.

What I've done for now?
  1. I've already create a dataproc cluster follow by an official document from google as described here https://cloud.google.com/dataproc-metastore/docs/apache-iceberg
  2. I've already connect to vm-instance through the ssh connection
  3. I've completed the configuration of a spark-shell (follow by an official document from google) and also for spark-sql follow by an official document from an iceberg as described here https://iceberg.apache.org/docs/latest/getting-started/)
  4. Finally, I've found that the configuration that worked for me was
    spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.0.0
  5. With the package configuration on step 4, it made my table creation process worked completely.
    Screenshot 2566-05-16 at 12.21.36.png
  6. But when I've tried to just SELECT * FROM {my created table with 'USING iceberg'}, It gave me an error 
    java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: iceberg is not a valid Spark SQL Data Source
    Screenshot 2566-05-16 at 12.23.36.png

*** My questions are...
  1. What I've missed here?
  2. Is it the right way to use apache iceberg as a new database instead of postgresql on my nestjs project?
*** PS:
  • I've been tried to recheck with SELECT * FROM {another created table without 'USING iceberg'} then everything works fine.
    Screenshot 2566-05-16 at 12.24.49.png
  • With this error (java.util.concurrent.ExecutionException: org.apache.spark.sql.AnalysisException: iceberg is not a valid Spark SQL Data Source) I've try to use a new catalog name as discussed here https://github.com/apache/iceberg/issues/1756. But all of the answer within the forum seems not to work for me.
    For example:
    - use SELECT * FROM local.default.{myCreatedTable}
    - use SELECT * FROM default.{myCreatedTable}
    - use SELECT * FROM {myCreatedTable}
Thanks
Reply all
Reply to author
Forward
0 new messages