HDFS to S3 Data movement issue

60 views
Skip to first unread message

pras.sr...@glassdoor.com

unread,
Mar 30, 2017, 6:58:21 PM3/30/17
to reair
Hello ReAir community, 

I am running into a replication issue when attempting to use the Large HDFS Copy tool documented here: https://github.com/airbnb/reair/blob/master/docs/hdfs_copy.md

I used the tool to successfully move the data within HDFS (source: HDFS, target: HDFS). But, when I attempt to move data from HDFS to S3, I found that the data lands in the incorrect location in S3. The tool copied over the entire dir structure of the tmp directory (argument provided as -temp) into the S3 bucket, and not into the S3 directory within the bucket. (using s3n instead of s3a results in the same behavior)

Here is the command that I executed:

hadoop jar airbnb-reair-main-1.0.0-all.jar com.airbnb.reair.batch.hdfs.ReplicationJob -Dmapreduce.job.reduces=10 -Dmapreduce.map.memory.mb=8000 -Dmapreduce.map.java.opts="-Djava.net.preferIPv4Stack=true -Xmx7000m" -source hdfs://<hdfs_dir_path>/ -destination s3a://<s3_key>:<s3_secret>@<s3_bucket>/<s3_dir_name>/ -log hdfs://<hdfs_log_path>/$JOB_START_TIME -temp hdfs://<hdfs_tmp_dir_path>/$JOB_START_TIME$ -blacklist ".*/tmp/.*" -operations a,u,d

Expected:
[s3_bucket] > [s3_dir_name] > [File 1] [File 2] [File 3]

Actual:
[s3_bucket] > [s3_dir_name] > (empty)
[s3_bucket] > tmp > reair > 1490203731429 > [__tmp_copy__file_attempt_1490...] [__tmp_copy__file_attempt_1490...] [__tmp_copy__file_attempt_1490...] [__tmp_copy__file_attempt_1490...]


Paul Yang

unread,
Mar 30, 2017, 7:01:15 PM3/30/17
to pras.sr...@glassdoor.com, reair
Unfortunately, the tool makes use of rename operations on the destination filesystem, so it's not a suitable tool for copying data to S3.

--
You received this message because you are subscribed to the Google Groups "reair" group.
To unsubscribe from this group and stop receiving emails from it, send an email to airbnb-reair+unsubscribe@googlegroups.com.
To post to this group, send email to airbnb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/airbnb-reair/a19ecc0c-727d-48d9-a5c6-8f6549391e90%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paul Yang

unread,
Mar 30, 2017, 7:01:48 PM3/30/17
to pras.sr...@glassdoor.com, reair, Jingwei Lu
CC'ing Jingwei is more familiar with the tool.

pras.sr...@glassdoor.com

unread,
Mar 31, 2017, 8:26:42 PM3/31/17
to reair
Thank you very much for the fast response Paul. Would using the batch or incremental replication services offer out of the box S3 connectivity support, if our org chooses to pursue that in the future?

On Thursday, March 30, 2017 at 4:01:48 PM UTC-7, Paul Yang wrote:
CC'ing Jingwei is more familiar with the tool.
Unfortunately, the tool makes use of rename operations on the destination filesystem, so it's not a suitable tool for copying data to S3.

Paul Yang

unread,
Apr 3, 2017, 12:06:56 AM4/3/17
to pras.sr...@glassdoor.com, reair
As it is now, ReAir doesn't support S3 connectivity well. Unfortunately, we don't have plans to build out that feature out in the near future. 

Reply all
Reply to author
Forward
0 new messages