Bulk Replication: Infinite Loop: WARN [main] com.airbnb.reair.batch.BatchUtils: Not renaming

43 views
Skip to first unread message

Scott Wallace

unread,
Nov 7, 2016, 3:12:45 PM11/7/16
to reair
In batch replication, we reach a certain point and then hit endless warnings:

2016-11-07 18:59:35,071 WARN [main] com.airbnb.reair.batch.BatchUtils: Not renaming tmpDstPath to dstPath since checksums do not match between srcPath and tmpDstPath.

From looking at the source, it appears to be retrying this operation over and over in an infinite loop.

In our case, the tmpDstPath is there and srcPath is also there (and the file sizes match), but dstPath never gets created.

Any idea what could be causing this issue?

Scott Wallace

unread,
Nov 7, 2016, 3:23:06 PM11/7/16
to reair
Also, worth noting. As a consequence, for every time the copy is retried we get another file added to the destination /tmp/hive_replication directory. A couple times, we ended up running out of disk space on the destination cluster when we left running over night. It may be a good idea to clean up the file before retrying.

Paul Yang

unread,
Nov 7, 2016, 5:23:58 PM11/7/16
to Scott Wallace, reair
If it's retrying infinitely, it sounds like there's a bug. Are you running the same version of HDFS on the source and destination?

--
You received this message because you are subscribed to the Google Groups "reair" group.
To unsubscribe from this group and stop receiving emails from it, send an email to airbnb-reair+unsubscribe@googlegroups.com.
To post to this group, send email to airbnb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/airbnb-reair/98d20756-0e47-4e2f-a591-5d220b1580a4%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Scott Wallace

unread,
Nov 7, 2016, 5:30:40 PM11/7/16
to reair, wall...@gmail.com
Our source is:

Hadoop 2.6.0-cdh5.7.0

Our destination is:

Hadoop 2.6.0-cdh5.7.3

Scott Wallace

unread,
Nov 7, 2016, 5:57:30 PM11/7/16
to reair, wall...@gmail.com
Two example checksums of files that appear to match:

MD5-of-262144MD5-of-512CRC32C 000002000000000000040000adc8da7645203b7c31f3c8e4a6aae184

MD5-of-0MD5-of-512CRC32C 000002000000000000000000246b5509dae3115af86cd50a803ff1f8

Paul Yang

unread,
Nov 7, 2016, 6:37:52 PM11/7/16
to Scott Wallace, reair
Reair does not calculate checksums itself but instead uses the checksum supplied via the Hadoop filesystem:


Ideally, you would want to track down as to why HDFS is reporting two different checksums for seemingly identical files. This can happen when using two different filesystems, but that doesn't seem to be your case.

If you want to turn off checksum verification, can you pull the latest source, rebuild, and then try setting

airbnb.reair.batch.copy.checksum.verify

to false in the configuration file? I've also pushed out a fix to resolve the infinite loop you're seeing.

Scott Wallace

unread,
Nov 8, 2016, 1:35:46 PM11/8/16
to reair, wall...@gmail.com
Thanks, Paul. It's working fine with airbnb.reair.batch.copy.checksum.verify set to false. We were wondering whether block size could impact checksum. The checksums I posted were output of hadoop fs -checksum. Presumably that's similar to the Java function you posted.

zh...@uber.com

unread,
Jan 3, 2017, 5:11:39 PM1/3/17
to reair, wall...@gmail.com
Hi Scott,

I believe this pull-request (already merged by Jingwei) would solve your problem: https://github.com/airbnb/reair/pull/55
In order to use it, we need to start the copy from scratch so that the newly copied files will have the same block size as the source cluster.

Zheng
Reply all
Reply to author
Forward
0 new messages