Managed vs Unmanaged (Extenal) Table for very large data

65 views
Skip to first unread message

Gaurav Aggarwal

unread,
Feb 21, 2024, 12:29:14 PMFeb 21
to Delta Lake Users and Developers
Hi everyone,

I am a newbie and have few questions. I need to write Spark Dataframe (using Azure HDInsight, a Spark solution) to ADLS Gen 2 in Delta Table format. The write size will be lower double digit TeraBytes per day. Table will be partitioned using date. I am checking if I should use External or Managed tables. Can someone please help me with these questions? My use case is that this ADLS Gen 2 storage will be mounted by a customer on their solution and they will only read data (no writing or modifying by them). Answers on the internet are mostly based on Databricks solutions.

1. Will there be any difference in the storage costs on ADLS Gen2? I am guessing no but I am not sure if External table takes more space because of metadata. I understand that delete command will only delete metadata in External, but I am not worried about delete command for now.
2. For performance, people mention that if you use other non-Databricks solutions to load data then External will be faster. Managed will be faster for Databricks scenario. My understanding is that performance should not change. What is correct?
3. ACID: My understanding is that ACID will be available in both. Some websites say that ACID is not available in External, is that correct?

Thanks,
Gaurav

Reply all
Reply to author
Forward
0 new messages