External tables are pretty familiar to most big Data practitioners, used in Apache Hive, Google BigQuery etc.
"The Unity Catalog UI makes it easy to discover, audit and govern data assets in one place. Data lineage, role-based security policies, table or column level tags, and central auditing capabilities make it easy for data stewards to confidently manage and secure data access to meet compliance and privacy needs, directly on the lakehouse."
So it is about fine grain access
method plus others (see below) that most Data warehouses offer. For example Google BigQuery offers this at the relational level.
New Table
CREATE TABLE iot_events
or
-- Existing data
CREATE EXTERNAL TABLE iot_events LOCATION gs://etcbucket/iot_events
WITH CREDENTIAL iot_iam_role
GRANT SELECT ON iot_events to developers -- all columns
GRANT SELECT(date, country) on iot_events to analysts -- restricted columns
So the major difference is that data already exists on gs//etcbucket for the EXISTING TABLE iot_events. All is happening here that the metadata is created for that external table in the catalog to points to that table data in the bucket
This obviously creates fine grain access within delta but allows others to share that external data.
When data is stored on the bucket say gs://etcbucket, one can query that data directly even though the data is not stored in delta itself. For example that data can be shared with Google BigQuery. An external process loads that data (whatever that can be, say Google Storage Transfer Service from on-premise), So in short Instead of loading or streaming the data directly, you create a table in delta that references the external data source. If I am correct if you drop that external table in delta, the reference to table in the catalog will be lost but crucially data will be left intact (not dropped)
Worth mentioning that if that external data is used with another artefact like BigQuery, then the permission may be different with regard to additional grants given by delta
Use cases for external data sources include:
- Loading and cleaning your data in one pass by querying the data from an external data source (a location external to delta ) and writing the cleaned result into delta storage.
- Having a small amount of frequently changing data that you join with other tables. As an external data source, the frequently changing data does not need to be reloaded every time it is updated. For example this is very common for batch and streaming ETL pipelines
So I guess all this Unity Catalog is doing is adding additional access control layers (ACL).
HTH
Mich
view my Linkedin profile
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.