Delta Lake for unstructure and semi-structured data

2,673 views
Skip to first unread message

Pavel Kazlou

unread,
Jan 21, 2021, 12:39:05 PM1/21/21
to Delta Lake Users and Developers
Can I use Delta Lake to store and process unstructured and semi-structured data(json, xml, images, videos, text, etc.)? Or I should use another solution?

Chris Hoshino-Fish

unread,
Jan 21, 2021, 3:33:19 PM1/21/21
to Pavel Kazlou, Delta Lake Users and Developers
Delta lake is a great option for storing unstructured data due to the data types it can support. Delta stores all your data in the Parquet file format currently, so your data will be stored in a columnar file format. But Parquet supports unstructured/semi-structured types like: 
  • StructType
  • MapType
  • ArrayType
  • BinaryType
This means you can do things like parse some of your JSON into top-level columns, but leave other parts in a Struct with the same key-value structure. Or you can store videos/images as Binary types and do your own decoding.

On Thu, Jan 21, 2021 at 9:39 AM Pavel Kazlou <pavel....@fortegrp.com> wrote:
Can I use Delta Lake to store and process unstructured and semi-structured data(json, xml, images, videos, text, etc.)? Or I should use another solution?

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/74af1dde-6cd1-47ca-9f89-7263d3bf2a57n%40googlegroups.com.


--

Chris Hoshino-Fish

Sr. Solutions Architect

Databricks Inc.

fi...@databricks.com

(415) 610-8520

a...@databricks.com

unread,
Jan 22, 2021, 11:27:31 AM1/22/21
to Delta Lake Users and Developers
Also, have a look at this blog which uses Delta to store images in Delta to make ML ETL and inference much faster (processing millions of tiny images on S3 can be very slow and costly, much more efficient in Delta).

Denny Lee

unread,
Jan 28, 2021, 9:44:46 PM1/28/21
to a...@databricks.com, Delta Lake Users and Developers
Quick note, there was a fun Data + AI Online Meetup on this topic on Tuesday: 

You can watch the video at: 

HTH!


Chidananda Unchi

unread,
Jan 29, 2021, 10:29:20 PM1/29/21
to Delta Lake Users and Developers
Hi Team,

I want to tokenization data after reading data . can you please suggest any UDF or pattern to tokenization data.

Regards,
Chidananda

On Thu, 21 Jan 2021, 23:09 Pavel Kazlou, <pavel....@fortegrp.com> wrote:
Can I use Delta Lake to store and process unstructured and semi-structured data(json, xml, images, videos, text, etc.)? Or I should use another solution?

--

Shixiong(Ryan) Zhu

unread,
Feb 2, 2021, 1:11:03 PM2/2/21
to Chidananda Unchi, Delta Lake Users and Developers
Hey Chidananda,

What kind of tokenization are you looking for? Spark has a SQL expression split which can be used to do simple tokenization. For complex cases, I usually write my own Scala UDF to manipulate strings directly.

Best Regards,

Ryan


Reply all
Reply to author
Forward
0 new messages