Ability to resume progress / avoid copying identical files in Batch Replication

18 views
Skip to first unread message

Zheng Shao

unread,
Nov 2, 2016, 8:52:42 PM11/2/16
to reair
In Batch Replication (MetastoreReplicationJob), it would be great to not recopy identical files if the files on the destination directory has the same size and timestamp.

It seems that the current code is not able to that because Stage2DirectoryCopyMapper always calls hdfsCleanDirectory - this is before Stage2DirectoryCopyReducer kicks in and call BatchUtils.doCopyFileAction which tried to avoid copying if possible.

Is that expected behavior or an inconsistency in the code? Or maybe I missed something in the logic.

--
Zheng

Zheng Shao

unread,
Nov 2, 2016, 10:09:25 PM11/2/16
to reair
By the way, the logic in TaskEstimator checking the whole (unpartitioned) table, and checking a partition of a table is correct - they can avoid copying identical directories.

Zheng

--
Zheng

Reply all
Reply to author
Forward
0 new messages