Flink on GKE with Nessie and Iceberg

41 views
Skip to first unread message

canopenerda

unread,
Sep 16, 2022, 2:28:38 AM9/16/22
to projectnessie
Hi folks,

I'm experimenting a data lake solution with Flink on GKE streaming data from Kafka to Iceberg tables with Nessie Catalog, and I'm facing a challenge that relatively frequent commit interval in Flink results in a lot of small files (both data and manifest files). I found a page about managment services at https://projectnessie.org/features/management/ and a lot of mentioned features are in progress and not available in released version. Would like to check on the progress and more than happy to contribute.

Vikram Roopchand

unread,
Sep 16, 2022, 2:37:12 AM9/16/22
to canopenerda, projectnessie
Hi There,


best regards,
Vikram

On Fri, Sep 16, 2022 at 11:58 AM canopenerda <mq...@calmseainc.com> wrote:
Hi folks,

I'm experimenting a data lake solution with Flink on GKE streaming data from Kafka to Iceberg tables with Nessie Catalog, and I'm facing a challenge that relatively frequent commit interval in Flink results in a lot of small files (both data and manifest files). I found a page about managment services at https://projectnessie.org/features/management/ and a lot of mentioned features are in progress and not available in released version. Would like to check on the progress and more than happy to contribute.

--
You received this message because you are subscribed to the Google Groups "projectnessie" group.
To unsubscribe from this group and stop receiving emails from it, send an email to projectnessi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/projectnessie/14a78000-d44e-4b02-a3a4-d61972cc3653n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ajantha Bhat

unread,
Sep 16, 2022, 2:45:35 AM9/16/22
to Vikram Roopchand, canopenerda, projectnessie
Hi,
a) https://projectnessie.org/features/management/ 
This document points to the older implementations in Nessie. They don't exist anymore. We will be updating it once the new implementation is merged (very soon)

b) I'm facing a challenge that relatively frequent commit interval in Flink results in a lot of small files (both data and manifest files)
You can use the Iceberg's flink action for compaction. (it compacts the small files and creates new manifests)
https://iceberg.apache.org/docs/latest/flink/#rewrite-files-action

Note that old files are cleaned using the expire_snapshots and remove_orphan files procedure in Iceberg.
But they are not branch/tag aware. Hence, you must not use them.

We are in the process of reviewing and splitting the PR for a merge for a "Nessie-based GC (to delete unreferenced files) CLI tool"
Once it is merged, you can use this tool for removing old/expired files from storage safely. 

c) Your contributions are welcome! 

Thanks,
Ajantha



Malcolm Qian

unread,
Sep 16, 2022, 2:47:14 AM9/16/22
to Vikram Roopchand, projectnessie
Thanks Vikram for the timely response.

I tried the Flink action to compact the files. The logic is like in the code snippet below. If I comment out the filter, the action will fail with OOM and with the filter on, no compaction is triggered.
image.png
Regards,
canopenerda

Ajantha Bhat

unread,
Sep 16, 2022, 2:54:06 AM9/16/22
to Malcolm Qian, Vikram Roopchand, projectnessie
I tried the Flink action to compact the files. The logic is like in the code snippet below. If I comment out the filter, the action will fail with OOM and with the filter on, no compaction is triggered.

This issue is independent of Nessie. As compaction works on the table on a single branch.
I believe OOM would happen with other catalogs also.
We should discuss this issue with the Iceberg community (may be really the resources are not enough for compaction or check how others are configuring compaction actions with Flink) 

Thanks,
Ajantha

Vikram Roopchand

unread,
Sep 16, 2022, 2:59:04 AM9/16/22
to Ajantha Bhat, canopenerda, projectnessie
Dear Ajantha,

Hope you are doing well.

Apart from the CLI, could you also provide a public API for the same ?

Thanks,
Best regards,
Vikram

Malcolm Qian

unread,
Sep 16, 2022, 3:00:45 AM9/16/22
to Ajantha Bhat, Vikram Roopchand, projectnessie
Thanks Ajantha for really exciting news!
Is the new implementation available in the public repository?
In the document, it mentioned GC only supports dynamo backend. In my company, most of the infras are in Google Cloud, so my concern is if there is a supported alternative in GCP.

Regards,
canopenerda

Ajantha Bhat

unread,
Sep 16, 2022, 3:25:30 AM9/16/22
to Malcolm Qian, Vikram Roopchand, projectnessie
Is the new implementation available in the public repository?
of course open source.
This is the WIP PR that I mentioned above.  
https://github.com/projectnessie/nessie/pull/4991


In the document, it mentioned GC only supports dynamo backend. In my company, most of the infras are in Google Cloud, so my concern is if there is a supported alternative in GCP.
The new implementation is based on Postgres DB, we are also discussing in the PR to use the existing Nessie backend DB itself to store GC-related information.  


Apart from the CLI, could you also provide a public API for the same ?
Yes, public APIs will be available. CLI is built on top of those APIs.

Reply all
Reply to author
Forward
0 new messages