Updating Meta data properties such as Location

8 views
Skip to first unread message

Vikram Roopchand

unread,
Dec 19, 2022, 7:14:58 AM12/19/22
to projectnessie
Hi There,

In the POC setup we have at work, we were using an IP address for accessing the HDFS. And consequently, it (via core-site.xml) was set up like this as well. 

<property>
        <name>fs.defaultFS</name>
        <value>hdfs://192.168.1.28:9000</value>
</property>

Parquet files were accessed with the complete path using the IP address: For example hdfs://192.168.1.28:9000/blah/blah.parquet. This is also what went into Nessie catalog as the location. We could check via the web url.

As luck would have it, the IP address changed and Nessie stopped accessing those files with "no route to host " exception (via HDFS libs). We had to delete the catalog (persisted on postgres) and re-run the migration to get those test tables back. 

Is it possible that hostname/ip or initial path can be provided from outside and nessie only store paths relative to it ? What happened here can always happen in Prod. Also, I don't think it to be a great practice of storing the exact/complete URL, that binds you too tightly to the specifics.

best regards,
Vikram

Ajantha Bhat

unread,
Dec 19, 2022, 7:37:12 AM12/19/22
to Vikram Roopchand, projectnessie
Hi Vikram, 

What was the warehouse location configured for nessie? 
Just updating that configuration to a new IP address was not enough? 


We had to delete the catalog (persisted on postgres) and re-run the migration to get those test tables back. 

Maybe I am not understanding this.  Do you mean Nessie's back-end store (postgres) was purged? 
And how did you achieve the migration? Can you please share more details?   

Thanks, 
Ajantha


--
You received this message because you are subscribed to the Google Groups "projectnessie" group.
To unsubscribe from this group and stop receiving emails from it, send an email to projectnessi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/projectnessie/CAKxebTD%2BCPvkNEpEJG0wpLxRCBWtopCQTM755cMMzyskaxvtbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ajantha Bhat

unread,
Dec 19, 2022, 8:13:35 AM12/19/22
to Vikram Roopchand, projectnessie
Thinking more about this. 
 
Parquet files were accessed with the complete path using the IP address: For example hdfs://192.168.1.28:9000/blah/blah.parquet. This is also what went into Nessie catalog as the location. We could check via the web url.

I think it is Iceberg's design. Not specific to Nessie. Behaviour should be the same for all the catalogs.
Iceberg stores the absolute path in its metadata files.
So, Iceberg tables cannot be read if the directory is moved or when the path is changed (like this IP address case).  

There was a proposal to store the relative path in iceberg metadata (https://github.com/apache/iceberg/issues/1617)
(it was more than a year ago and that discussion became stale). Maybe we can reopen the discussion in the Iceberg community.

Thanks, 
Ajantha


Vikram Roopchand

unread,
Dec 19, 2022, 8:19:55 AM12/19/22
to Ajantha Bhat, projectnessie
Hi Ajantha,

Thanks for replying.
 
What was the warehouse location configured for nessie? 

Warehouse was on HDFS. Configured like HDFS_PATH/DATALAKE_WAREHOUSE_LOCATION ... for example hdfs://192.168.1.28:9000/warehouse
 
Just updating that configuration to a new IP address was not enough? 

Unfortunately, not.
 

We had to delete the catalog (persisted on postgres) and re-run the migration to get those test tables back. 

Maybe I am not understanding this.  Do you mean Nessie's back-end store (postgres) was purged? 

Yes.
 
And how did you achieve the migration? Can you please share more details?   

Migration reads the parquets and recreates them as iceberg tables. Much like "ds.writeTo(DICatalogUtil.getTableName2(null, tableName)).createOrReplace();"

best regards,
Vikram

Vikram Roopchand

unread,
Dec 19, 2022, 8:24:31 AM12/19/22
to Ajantha Bhat, projectnessie
Thank you, I will check this and get back to you. 

best regards,
Vikram

Vikram Roopchand

unread,
Dec 20, 2022, 1:09:14 AM12/20/22
to Ajantha Bhat, projectnessie
Hi,

They have the concept of a "LocationProviders" (core/src/main/java/org/apache/iceberg/LocationProviders.java). Will have to dig deeper, maybe this helps ?

best regards,
Vikram

Ajantha Bhat

unread,
Dec 20, 2022, 2:15:39 AM12/20/22
to Vikram Roopchand, projectnessie
They have the concept of a "LocationProviders" (core/src/main/java/org/apache/iceberg/LocationProviders.java). Will have to dig deeper, maybe this helps ?

I am aware of the LocationProviders. It is mainly for these three table properties (write.data.path, write.metadata.path, write.object-storage.enabled).
This is to provide a location for a subsequent write operation. This doesn't help the readers if the files are moved or the path is changed (our scenario).

So, we do need a relative path feature as I mentioned in the previous email. If not we need to generate Iceberg metadata again (as you did with the migrate tool).

Thanks, 
Ajantha 

Reply all
Reply to author
Forward
0 new messages