ReAir Batch Job sync hive DWH between two cdh clusters.

47 views
Skip to first unread message

rei sivan

unread,
Jun 7, 2017, 5:24:48 AM6/7/17
to reair
HI Guys,

I have two CDH clusters (CDH community edition 5.10.1 on AWS ec2) with hive metastore in each of them (embedded DB). 
I'm running ReAir-batch tool with the following configuration:(batch_replication_configuration_template.xml)
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <value>Cluster2</value>
    <comment>
      Name of the source cluster. It can be an arbitrary string and is used in
      logs, tags, etc.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.metastore.url</name>
    <value>thrift://internal_ip:10000</value>
    <comment>Source metastore Thrift URL.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.hdfs.root</name>
    <value>hdfs:///user</value>
    <comment>Source cluster HDFS root. Note trailing slash.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.hdfs.tmp</name>
    <value>hdfs:///tmp/replication</value>
    <comment>
      Directory for temporary files on the source cluster.
    </comment>
  </property>

  <property>
    <value>Cluster1</value>
    <comment>
      Name of the source cluster. It can be an arbitrary string and is used in
      logs, tags, etc.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.metastore.url</name>
    <value>thrift://internal_ip:10000</value>
    <comment>Destination metastore Thrift URL.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.hdfs.root</name>
    <value>hdfs:///user</value>
    <comment>Destination cluster HDFS root. Note trailing slash.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.hdfs.tmp</name>
    <value>hdfs:///tmp/hive_replication</value>
    <comment>
      Directory for temporary files on the source cluster. Table / partition
      data is copied to this location before it is moved to the final location,
      so it should be on the same filesystem as the final location.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.batch.output.dir</name>
    <value>hdfs:///user/batchOutput/output1</value>
    <comment>
      This configuration must be provided. It gives location to store each stage
      MR job output.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.batch.metastore.blacklist</name>
    <value>testdb:test.*,tmp_.*:.*</value>
    <comment>
      Comma separated regex blacklist. dbname_regex:tablename_regex,...
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.metastore.parallelism</name>
    <value>150</value>
    <comment>
      The parallelism to use for jobs requiring metastore calls. This translates to the number of
      mappers or reducers in the relevant jobs.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.copy.parallelism</name>
    <value>150</value>
    <comment>
      The parallelism to use for jobs that copy files. This translates to the number of reducers
      in the relevant jobs.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.overwrite.newer</name>
    <value>true</value>
    <comment>
      Whether the batch job will overwrite newer tables/partitions on the destination. Default is true.
    </comment>
  </property>

  <property>
    <name>mapreduce.map.speculative</name>
    <value>false</value>
    <comment>
      Speculative execution is currently not supported for batch replication.
    </comment>
  </property>

  <property>
    <name>mapreduce.reduce.speculative</name>
    <value>false</value>
    <comment>
      Speculative execution is currently not supported for batch replication.
    </comment>
  </property>

</configuration>

the tool finished successfully but I don't see the data or the metadata in the destination cluster.
I'm running the job in the destination cluster as instructed in the docs 
Any ideas what I'm missing here?

Thanks,

Paul Yang

unread,
Jun 7, 2017, 2:07:00 PM6/7/17
to rei sivan, reair
There are several things off with your configuration. For example, airbnb.reair.clusters.src.metastore.url and airbnb.reair.clusters.dest.metastore.url have the same value of thrift://internal_ip:10000.

--
You received this message because you are subscribed to the Google Groups "reair" group.
To unsubscribe from this group and stop receiving emails from it, send an email to airbnb-reair+unsubscribe@googlegroups.com.
To post to this group, send email to airbnb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/airbnb-reair/946a3303-842d-47dd-b82a-be1c5bfd368a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

rei sivan

unread,
Jun 8, 2017, 8:21:17 AM6/8/17
to reair, rsiva...@gmail.com
Thanks for your quick reply, my bad regarding the internal_ip I should be more clear... I meant to src\dest_internal_ip respectively to the properties.
anyway, I've already solved my issue. my mistake was the port in the thrift 9083 instead of 10000 &  in the properties airbnb.reair.clusters.(src\dest).hdfs.root it require the full path of the root dir, for example: hdfs://internal_dns:8020/hdfs_root because of the directories validation in step 3.  
continues to the incremental tool.

Thanks again.
Reply all
Reply to author
Forward
0 new messages