Announcing Delta Lake 0.2.0

Liwen Sun

unread,

Jun 19, 2019, 3:04:08 PM6/19/19

to delta...@googlegroups.com, us...@spark.apache.org

We are delighted to announce the availability of Delta Lake 0.2.0!

To try out Delta Lake 0.2.0, please follow the Delta Lake Quickstart:
https://docs.delta.io/0.2.0/quick-start.html

To view the release notes:
https://github.com/delta-io/delta/releases/tag/v0.2.0

This release introduces two main features:

Cloud storage support

In addition to HDFS, you can now configure Delta Lake to read and write data on cloud storage services such as Amazon S3 and Azure Blob Storage. For configuration instructions, please see: https://docs.delta.io/0.2.0/delta-storage.html

Improved concurrency
Delta Lake now allows concurrent append-only writes while still ensuring serializability. For concurrency control in Delta Lake, please see: https://docs.delta.io/0.2.0/delta-concurrency.html

We have also greatly expanded the test coverage as part of this release.

We would like to acknowledge all community members for contributing to this release.

Best regards,
Liwen Sun

Gourav Sengupta

unread,

Jun 19, 2019, 3:06:21 PM6/19/19

to Liwen Sun, delta...@googlegroups.com, user

Hi,

this is fantastic :)

Regards,

Gourav Sengupta

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/CAE4dWq9g90NkUr_SLs2J6kFPbOpxx4wy6MEgb%3DQ5pBxkUcK%2B-A%40mail.gmail.com.

Gourav Sengupta

unread,

Jun 19, 2019, 3:11:45 PM6/19/19

to Liwen Sun, delta...@googlegroups.com, user

Hi,

does Delta support external tables? I think that most users will be needing this.

Regards,

Gourav

On Wed, Jun 19, 2019 at 8:04 PM Liwen Sun <liwe...@databricks.com> wrote:

Liwen Sun

unread,

Jun 19, 2019, 7:52:09 PM6/19/19

to Gourav Sengupta, delta...@googlegroups.com, user

Hi Gourav,

Thanks for the suggestion. Please open a Github issue at https://github.com/delta-io/delta/issues to describe your use case and requirements for "external tables" so we can better track this feature and also get feedback from the community.

Regards,

Liwen

Gourav Sengupta

unread,

Jun 20, 2019, 2:06:21 AM6/20/19

to Liwen Sun, delta...@googlegroups.com, user

Hi Liwen,

its done https://github.com/delta-io/delta/issues/73

Please let me know in case the description looks fine. I can also contribute to the test cases in case required.

Regards,

Gourav

Gourav Sengupta

unread,

Jun 20, 2019, 2:08:46 AM6/20/19

to ayan guha, Liwen Sun, delta...@googlegroups.com, user

Hi Ayan,

Delta is obviously well thought through, its been available in Databricks since last year and a half now I think and besides that it is from some of the best minds at work :)

But what may not be well tested in Delta is its availability as a storage class for HIVE.

How about your testing? Are you doing it in S3? What is the kind of volume you are testing it with if I may ask.

Regards,

Gourav Sengupta

On Thu, Jun 20, 2019 at 12:58 AM ayan guha <guha...@gmail.com> wrote:

Hi

We are using Delta features. The only problem we faced till now is Hive can not read DELTA outputs by itself (even if the Hive metastore is shared). However, if we create hive external table pointing to the folder (and with Vacuum), it can read the data.

Other than that, the feature looks good and well thought out. We are doing a volume testing now....

Best
Ayan

--
Best Regards,
Ayan Guha

Liwen Sun

unread,

Jun 20, 2019, 9:05:39 PM6/20/19

to James Cotrotsios, delta...@googlegroups.com, user

Hi James,

Right now we don't have plans for having a catalog component as part of Delta Lake, but we are looking to support Hive metastore and also DDL commands in the near future.

Thanks,

Liwen

On Thu, Jun 20, 2019 at 4:46 AM James Cotrotsios <jamesco...@gmail.com> wrote:

Is there a plan to have a business catalog component for the Data Lake? If not how would someone make a proposal to create an open source project related to that. I would be interested in building out an open source data catalog that would use the Hive metadata store as a baseline for technical metadata.

Gourav Sengupta

unread,

Jun 21, 2019, 12:21:42 AM6/21/19

to Liwen Sun, James Cotrotsios, delta...@googlegroups.com, user

Hi Liwen,

thanks a ton, I think that there is a difference between a storage class and metastore, just like there is a difference between a database and file system and coffee and cup.

It will be wonderful to keep the focus on the fantastic opportunity that Delta creates for us :)

Regards,

Gourav Sengupta

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/CAE4dWq-4rC8n5OXuB7NRfDhY4ZLwC8w20cLf7wbktvLKWotHow%40mail.gmail.com.

Gourav Sengupta

unread,

Jun 21, 2019, 3:48:31 AM6/21/19

to ayan guha, Liwen Sun, James Cotrotsios, delta...@googlegroups.com, user

Hi Ayan,

I may be wrong about this, but I think that Delta files are in Parquet format. But I am sure that you have already checked this. Am I missing something?

Regards,

Gourav Sengupta

On Fri, Jun 21, 2019 at 6:39 AM ayan guha <guha...@gmail.com> wrote:

Hi
We used spark.sql to create a table using DELTA. We also have a hive metastore attached to the spark session. Hence, a table gets created in Hive metastore. We then tried to query the table from Hive. We faced following issues:
SERDE is SequenceFile, should have been Parquet

Scema fields are not passed.
Essentially the hive DDL looks like:
CREATE TABLE `TABLE NAME`( `col` array<string> COMMENT 'from deserializer')

ROW FORMAT SERDE   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES (   'path'=WASB PATH')  STORED AS INPUTFORMAT   'org.apache.hadoop.mapred.SequenceFileInputFormat'

OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'  LOCATION   ' WASB PATH'

TBLPROPERTIES (   'spark.sql.create.version'='2.4.0', 'spark.sql.sources.provider'='DELTA', 'spark.sql.sources.schema.numParts'='1',  'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[]}', 'transient_lastDdlTime'='1556544657')
Is this expected? And will the use case be supported in future releases?

We are now experimenting
Best
Ayan

Joao Pedro Afonso Cerqueira

unread,

Jun 21, 2019, 3:50:27 AM6/21/19

to Delta Lake Users and Developers

Hi Liwen Sun,

I am exploring plugins for Metadata management over Data Lakes, and the consolidation of Hive Metastore is a must. with Delta Lake a big ACID problem for our ETL is being resolved but new ones are being open, management of the parquet files, as a hole, by a 3rd party tool.

To give you an exampled, I use Dremio-OSS ( https://github.com/dremio/dremio-oss ) for catalogue curation and maintenance of my cluster. Dremio by default tries to monitor parquet files in Blob S3, ADLS and Hadoop.

When would the plugging for Hive Metastore would be ready for delta format, so I can relly on the standard Hive Metastore pluggin and connect my Data Lakes via Hive into a tool like dremio-oss for the Delta format ?

Thank You

Joao Cerqueira - https://FuelBigData.com

.

Michael Armbrust

unread,

Jun 21, 2019, 3:42:53 PM6/21/19

to ayan guha, Tathagata Das, Gourav Sengupta, Liwen Sun, James Cotrotsios, Delta Lake Users and Developers, user

Thanks for confirmation. We are using the workaround to create a separate Hive external table STORED AS PARQUET with the exact location of Delta table. Our use case is batch-driven and we are running VACUUM with 0 retention after every batch is completed. Do you see any potential problem with this workaround, other than during the time when the batch is running the table can provide some wrong information?

This is a reasonable workaround to allow other systems to read Delta tables. Another consideration is that if you are running on S3, eventual consistency my increase the amount of time before external readers see a consistent view. Also note, that this prevents you from using time travel.

In the near future, I think we should also support generating manifest files that list the data files in the most recent version of the Delta table (see #76 for details). This will give support for Presto, though Hive would require some additional modifications on the Hive side (if there are any Hive contributors / committers on this list let me know!).

In the longer term, we are talking with authors of other engines to build native support for reading the Delta transaction log (e.g. this announcement from Starburst). Please contact me if you are interested in contributing here!

Reply all

Reply to author

Forward