Using Data Lake on S3

157 views
Skip to first unread message

Hakan Ilter

unread,
May 22, 2019, 5:52:39 AM5/22/19
to Delta Lake Users and Developers
Hi everyone,

I was planning to use DL (0.1.0) on S3 but had some issues. I know DL doesn't support S3 for transactions but all I need is storing the data on S3 for now. I managed to run it but these are the issues:

1. It doesn't support s3 scheme, but s3a scheme works fine. 

2. I'm using Spark 2.4.3, which comes with hadoop-aws-2.7.3.jar but DL requires "spark.hadoop.fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A" in the config and this class exists in hadoop-aws-2.8.5.jar, so it works if you put two versions of the same jar in the classpath. Crazy but working.

Do you think is this a good idea to use it with this settings or should I wait until it fully supports S3?

Any comments will be appreciated.

Kind regards.

Hakan.

Michael Armbrust

unread,
May 22, 2019, 2:52:42 PM5/22/19
to Hakan Ilter, Delta Lake Users and Developers
I would not recommend running with multiple versions of the same library on classpath.  It is too hard to reason about what is actually happening.

Other users on the slack channel have reported success getting this working by following the directions here: https://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html

Otherwise we'll publish instructions when we add official S3 support as part of the 0.2.0 release (targeting the beginning of June).

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta-users...@googlegroups.com.
To post to this group, send email to delta...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/delta-users/05bee706-e1fa-49bf-a547-819ec21898bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hakan Ilter

unread,
May 23, 2019, 4:12:47 AM5/23/19
to Delta Lake Users and Developers
Thanks for your reply Michael. 


On Wednesday, May 22, 2019 at 7:52:42 PM UTC+1, Michael Armbrust wrote:
I would not recommend running with multiple versions of the same library on classpath.  It is too hard to reason about what is actually happening.

Other users on the slack channel have reported success getting this working by following the directions here: https://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html

Otherwise we'll publish instructions when we add official S3 support as part of the 0.2.0 release (targeting the beginning of June).

On Wed, May 22, 2019 at 2:52 AM Hakan Ilter <Hakan...@oaknorth.ai> wrote:
Hi everyone,

I was planning to use DL (0.1.0) on S3 but had some issues. I know DL doesn't support S3 for transactions but all I need is storing the data on S3 for now. I managed to run it but these are the issues:

1. It doesn't support s3 scheme, but s3a scheme works fine. 

2. I'm using Spark 2.4.3, which comes with hadoop-aws-2.7.3.jar but DL requires "spark.hadoop.fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A" in the config and this class exists in hadoop-aws-2.8.5.jar, so it works if you put two versions of the same jar in the classpath. Crazy but working.

Do you think is this a good idea to use it with this settings or should I wait until it fully supports S3?

Any comments will be appreciated.

Kind regards.

Hakan.

--
You received this message because you are subscribed to the Google Groups "Delta Lake Users and Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to delta...@googlegroups.com.

Gourav Sengupta

unread,
Jun 17, 2019, 4:29:31 AM6/17/19
to Delta Lake Users and Developers
On another note I think it will be wise to refrain from using S3a.

Regards,
Gourav

Gourav Sengupta

unread,
Jun 17, 2019, 4:35:49 AM6/17/19
to Delta Lake Users and Developers
Hi,

my apologies, I think that it will be better to mention why to avoid using s3a: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html.

EMRFS (s3) supports quite a few performance improvements and it is a bit confusing to me to read this: https://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html

According to the link from AWS:
1. Previously, Amazon EMR used the S3 Native FileSystem with the URI scheme, s3n. While this still works, we recommend that you use the s3 URI scheme for the best performance, security, and reliability.
2. The s3a protocol is not supported. We suggest you use s3 in place of s3a.

Besides this, I have seen quite a few issues while using anything other than s3://

Regards,
Gourav

On Wednesday, May 22, 2019 at 10:52:39 AM UTC+1, Hakan Ilter wrote:

Steven Moy

unread,
Jun 17, 2019, 12:34:26 PM6/17/19
to Delta Lake Users and Developers
There are many instances you need something other than EMRFS. For example, for anyone that deploys Spark in AWS on barebone EC2 because EMR carries a hefty premium compared to barebone EC2. s3a has bee integrated to Spark for a while now. 

https://stackoverflow.com/a/40872184 <-- this is a good summary from Steve Loughran (a s3a developer) 
Reply all
Reply to author
Forward
0 new messages