Reair is not able to delete partitions from destination side which are removed from source side.

54 views
Skip to first unread message

Vishwanath Sharma

unread,
Oct 8, 2017, 8:44:07 AM10/8/17
to reair
Is it true that if we remove some partitions from Source side that will not be removed by Reair on destination side.I am facing this issue.I added 2 partitions then ran batch load everything was fine.Then i drop one partitions and again ran still i can see 2 partitions on Source side.
Is it bug in Reair or it is because of misconfigurations of properties.

My config.xml is given below:-

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
        <name>airbnb.reair.clusters.src.name</name>
        <value>DDH CLUSTER</value>
        <comment> Name of the source cluster. It can be an arbitrary string and is used in logs, tags, etc. </comment>
    </property>

    <property>
        <name>airbnb.reair.clusters.src.metastore.url</name>
        <value>thrift://ip-addr:9083</value>
        <comment>Source metastore Thrift URL.</comment>
    </property>

    <property>
        <name>airbnb.reair.clusters.src.hdfs.root</name>
        <value>hdfs://table_location</value>
        <comment>Source cluster HDFS root. Note trailing slash.</comment>
    </property>

    <property>
        <name>airbnb.reair.clusters.src.hdfs.tmp</name>
        <value>dest-path_tmp</value>
        <comment>Directory for temporary files on the src cluster. Will be however kept at destination location</comment>
    </property>

    <property>
        <name>airbnb.reair.clusters.dest.name</name>
        <value>GOOGLE CLOUD</value>
        <comment>Name of the source cluster. It can be an arbitrary string and is used inlogs, tags, etc.</comment>
    </property>

    <property>
        <name>airbnb.reair.clusters.dest.metastore.url</name>
        <value>thrift://ip-addr:9083</value>
        <comment>Destination metastore Thrift URL.</comment>
    </property>

    <property>
        <name>airbnb.reair.clusters.dest.hdfs.root</name>
        <value>dest-path</value>
        <comment>Destination cluster HDFS root. Note trailing slash.</comment>
    </property>

    <property>
        <name>airbnb.reair.clusters.dest.hdfs.tmp</name>
        <value>dest-path_tmp</value>
        <comment>TEMP TABLE DETAILS</comment>
    </property>

    <property>
        <name>airbnb.reair.clusters.batch.output.dir</name>
        <value>dest-path_output</value>
        <comment>This configuration must be provided. It gives location to store each stage MR job output.</comment>
    </property>

    <property>
        <name>airbnb.reair.clusters.batch.metastore.blacklist</name>
        <value></value>
        <comment>Comma separated regex blacklist. dbname_regex:tablename_regex</comment>
    </property>

    <property>
        <name>airbnb.reair.batch.metastore.parallelism</name>
        <value>20</value>
        <comment> The parallelism to use for jobs requiring metastore calls. This translates to the number of mappers
            or reducers in the relevant jobs. </comment>
    </property>

    <property>
        <name>airbnb.reair.batch.copy.parallelism</name>
        <value>40</value>
        <comment>The parallelism to use for jobs that copy files. This translates to the number of reducers in the relevant jobs.</comment>
    </property>

    <property>
        <name>airbnb.reair.batch.overwrite.newer</name>
        <value>true</value>
        <comment>Whether the batch job will overwrite newer tables/partitions on the destination. Default is true.</comment>
    </property>

    <property>
        <name>mapreduce.map.speculative</name>
        <value>false</value>
        <comment>Speculative execution is currently not supported for batch replication.</comment>
    </property>

    <property>
        <name>mapreduce.reduce.speculative</name>
        <value>false</value>
        <comment>Speculative execution is currently not supported for batch replication.</comment>
    </property>


Any help would be appreciated.

Thanks

Paul Yang

unread,
Oct 9, 2017, 2:02:38 PM10/9/17
to Vishwanath Sharma, reair
Then i drop one partitions and again ran still i can see 2 partitions on Source side.

Can you clarify what you mean here? Dropped a partition on the source or destination? 

--
You received this message because you are subscribed to the Google Groups "reair" group.
To unsubscribe from this group and stop receiving emails from it, send an email to airbnb-reair+unsubscribe@googlegroups.com.
To post to this group, send email to airbnb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/airbnb-reair/46ed952e-a631-4972-9184-eb3e696bd5dd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zheng Shao

unread,
Oct 9, 2017, 2:12:51 PM10/9/17
to Paul Yang, Taikun Liu, Vishwanath Sharma, reair
I believe you are running Batch Replication only, right?  As you can imagine, the Batch Replication is not capable of deleting "deleted tables" from src in the dest cluster, because it simply scans the source to find out all tables to be replicated.

To solve this problem, Taikun (cced) implemented a Hive Metastore Hook to record the timestamp of all table deletions, and use that in Batch Replication to determine whether a table and/or a partition should be dropped in the destination cluster.  That part of the code is not open-sourced yet (due to configuration complexity), but we may consider that if there is enough interest.



For more options, visit https://groups.google.com/d/optout.



--
Zheng

Vishwanath Sharma

unread,
Oct 9, 2017, 3:44:51 PM10/9/17
to Zheng Shao, Paul Yang, Taikun Liu, reair
Hi Zheng,

I did some POC's with Reair and observed that:-
 
If the table is internal type table, Reair works fine.It drops the partitions. In case of EXTERNAL table partitions will be dropped but data from physical location will not be deleted so in future if user run msck repair command the dropped partitions will be added again. 

Thanks

Zheng Shao

unread,
Oct 9, 2017, 7:04:47 PM10/9/17
to Vishwanath Sharma, Paul Yang, Taikun Liu, reair
Oh I see.  That's expected behavior from Hive.   For external tables, Hive doesn't own the data directory, so users should delete that by themselves.

--
Zheng

Vishwanath Sharma

unread,
Oct 24, 2017, 7:37:29 AM10/24/17
to Zheng Shao, Taikun Liu, reair, Paul Yang
Hi Zheng,

This time I am facing other issue when using Reair.I ran Reair for one table and I got less records on destination side.
Distcp command should copy complete file. I ran Reair multiple times but not able to get complete data. Is there any error with data which is not copied by distcp command. I couldn't find anything in distcp doc.
Please suggest some ideas.

Thanks & Regards
Vishunath 

Paul Yang

unread,
Oct 25, 2017, 9:25:35 PM10/25/17
to Vishwanath Sharma, Zheng Shao, Taikun Liu, reair
This is unexpected. Before creating the metadata in Hive, there are verification steps that check to see that the data was copied correctly. Are you using batch or incremental replication?
Reply all
Reply to author
Forward
0 new messages