What are external tables?

47 views
Skip to first unread message

Jacek Laskowski

unread,
May 30, 2021, 9:38:40 AM5/30/21
to Delta Lake Users and Developers
Hi,

I found this concept of external tables in the article about Databricks Unity Catalog [1] and wondered its purpose.

It turns out that Delta supports the following query but no idea what this EXTERNAL qualifier means.

CREATE EXTERNAL TABLE t1
USING delta
AS SELECT * FROM VALUES(1,2,3)

How to know whether it's an external table or not? What happens with and without it? Why is this important? Can you help me with this? Thank you.

Mich Talebzadeh

unread,
May 30, 2021, 4:43:50 PM5/30/21
to Jacek Laskowski, Delta Lake Users and Developers
External tables are pretty familiar to most big Data practitioners, used in Apache Hive, Google BigQuery etc.

First from this link

"The Unity Catalog UI makes it easy to discover, audit and govern data assets in one place. Data lineage, role-based security policies, table or column level tags, and central auditing capabilities make it easy for data stewards to confidently manage and secure data access to meet compliance and privacy needs, directly on the lakehouse."

So it is about fine grain access method plus others (see below) that most Data warehouses offer. For example Google BigQuery offers this at the relational level.

From Databrick's link above

New Table

CREATE TABLE iot_events

or
-- Existing data

CREATE EXTERNAL TABLE iot_events LOCATION  gs://etcbucket/iot_events
WITH CREDENTIAL iot_iam_role

GRANT SELECT ON iot_events to developers -- all columns
GRANT SELECT(date, country) on iot_events  to analysts -- restricted columns

So the major difference is that data already exists on gs//etcbucket for the EXISTING TABLE iot_events. All is happening here that the metadata is created for that external table in the catalog to points to that table data in the bucket

This obviously creates fine grain access within delta but allows others to share  that external data.

When data is stored on the bucket say gs://etcbucket, one can query that data directly even though the data is not stored in delta itself. For example that data can be shared with Google BigQuery. An external process loads that data (whatever that can be, say Google Storage Transfer Service from on-premise), So in short Instead of loading or streaming the data directly, you create a table in delta that references the external data source. If I am correct if you drop that external table in delta, the reference to table in the catalog will be lost but crucially data will be left intact (not dropped)

Worth mentioning that if that external data is used with another artefact like BigQuery, then the permission may be different with regard to additional grants given by delta

Use cases for external data sources include:

  • Loading and cleaning your data in one pass by querying the data from an external data source (a location external to delta ) and writing the cleaned result into delta storage.
  • Having a small amount of frequently changing data that you join with other tables. As an external data source, the frequently changing data does not need to be reloaded every time it is updated. For example this is very common for batch and streaming ETL pipelines
So I guess all this Unity Catalog is doing is adding additional access control layers (ACL).


HTH

Mich


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/CAB_G8ZtE%2B8kdqVr0saEKuVtdwfRtFN6MbdT8q-xNTySw7NMEpA%40mail.gmail.com.

Mich Talebzadeh

unread,
May 31, 2021, 5:16:35 AM5/31/21
to Delta Lake Users and Developers
Minor correction

New Table

CREATE TABLE iot_events

or
-- Existing data

CREATE EXTERNAL TABLE iot_events LOCATION  gs://etcbucket/iot_events
WITH CREDENTIAL iot_iam_role

GRANT SELECT ON iot_events to developers -- all columns
GRANT SELECT(date, country) on iot_events  to analysts -- restricted columns

So the major difference is that data already exists on gs//etcbucket for the EXISTING TABLE iot_events. All is happening here that the metadata is created for that external table in the catalog to points to that table data in the bucket

should read

So the major difference between a new table iot_events and the external table iot_events is that data pertaining to iot_events already exists on gs//etcbucket external storage.  What is happening is that the metadata pertaining to that external data is created in delta. Hence the keyword EXTERNAL TABLE.

Jacek Laskowski

unread,
Jun 1, 2021, 7:45:28 AM6/1/21
to Mich Talebzadeh, Delta Lake Users and Developers
Hi Mich,

Thank you for such a detailed answer! Your answer and this comment [1] helped a lot!

BTW, when I first saw this EXTERNAL keyword I got confused as it was clearly UNMANAGED that I think I'd been seeing here and there (being opposite to MANAGED keyword). This official Databricks docs [2] don't help making it clearer name-wise either :(

Reply all
Reply to author
Forward
0 new messages