reair - batch

60 views
Skip to first unread message

loran...@gmail.com

unread,
Jun 10, 2018, 6:43:57 AM6/10/18
to reair
Hi Guys,
 I'm running batch tool with the following configuration:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <value>cluster</value>
    <comment>
      Name of the source cluster. It can be an arbitrary string and is used in
      logs, tags, etc.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.metastore.url</name>
    <value>thrift://host:9083</value>
    <comment>Source metastore Thrift URL.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.hdfs.root</name>
    <value>hdfs:///host:8020/</value>
    <comment>Source cluster HDFS root. Note trailing slash.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.src.hdfs.tmp</name>
    <value>hdfs:///tmp/replication</value>
    <comment>
      Directory for temporary files on the source cluster.
    </comment>
  </property>

  <property>
    <value>cluster</value>
    <comment>
      Name of the source cluster. It can be an arbitrary string and is used in
      logs, tags, etc.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.metastore.url</name>
    <value>thrift://host:9083</value>
    <comment>Destination metastore Thrift URL.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.hdfs.root</name>
    <value>hdfs:///host:8020/</value>
    <comment>Destination cluster HDFS root. Note trailing slash.</comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.dest.hdfs.tmp</name>
    <value>hdfs:///tmp/hive_replication</value>
    <comment>
      Directory for temporary files on the source cluster. Table / partition
      data is copied to this location before it is moved to the final location,
      so it should be on the same filesystem as the final location.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.batch.output.dir</name>
    <value>hdfs:///user/batchOutput/output1</value>
    <comment>
      This configuration must be provided. It gives location to store each stage
      MR job output.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.clusters.batch.metastore.blacklist</name>
    <value>testdb:test.*,tmp_.*:.*</value>
    <comment>
      Comma separated regex blacklist. dbname_regex:tablename_regex,...
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.metastore.parallelism</name>
    <value>150</value>
    <comment>
      The parallelism to use for jobs requiring metastore calls. This translates to the number of
      mappers or reducers in the relevant jobs.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.copy.parallelism</name>
    <value>150</value>
    <comment>
      The parallelism to use for jobs that copy files. This translates to the number of reducers
      in the relevant jobs.
    </comment>
  </property>

  <property>
    <name>airbnb.reair.batch.overwrite.newer</name>
    <value>true</value>
    <comment>
      Whether the batch job will overwrite newer tables/partitions on the destination. Default is true.
    </comment>
  </property>

  <property>
    <name>mapreduce.map.speculative</name>
    <value>false</value>
    <comment>
      Speculative execution is currently not supported for batch replication.
    </comment>
  </property>

  <property>
    <name>mapreduce.reduce.speculative</name>
    <value>false</value>
    <comment>
      Speculative execution is currently not supported for batch replication.
    </comment>
  </property>

</configuration>


the tool finished successfully metadata created successfully, but I don't see the data in the destination cluster.
I'm running the job in the destination cluster as instructed in the docs 
Any ideas what I'm missing here?

p.s.
don't know if it's related but , i'm using one oracle db for metastore.

Thanks,

Paul Yang

unread,
Jun 15, 2018, 8:27:47 PM6/15/18
to loran...@gmail.com, reair
Can you take a look at the logs in the mappers to see where it copied the data to?

--
You received this message because you are subscribed to the Google Groups "reair" group.
To unsubscribe from this group and stop receiving emails from it, send an email to airbnb-reair+unsubscribe@googlegroups.com.
To post to this group, send email to airbnb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/airbnb-reair/cd6a3c09-f25a-44e9-b493-fe196871a6a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

loran...@gmail.com

unread,
Jun 20, 2018, 11:14:43 AM6/20/18
to reair
yes, I get a warning message: srcPath "hdfs://nameservice1/..../db/table " doesn't start with my actual path "hdfs:///host:8020/.../db/table" 
I'm assuming that it is something on my configuration but I can't say what exactly 

Paul Yang

unread,
Jun 20, 2018, 1:58:06 PM6/20/18
to loran...@gmail.com, reair
It might be something to do with your *.hdfs.root directory configuration.

loran...@gmail.com

unread,
Jun 24, 2018, 10:53:09 AM6/24/18
to reair
ok so in both clusters we define namenode nameserivce with the same name for HA. (probably not best practice but that's what we have)
now if I put in the batch_configuration hdfs.root as hdfs://nameservice/ for both src & dest I see that it should copy the data (and metadata if it's new table) in the logs and in the additional prints that I added in the map\reduces in all the steps. however, I don't see any prints from step2 reducer and the data doesn't copy to dest. (metadata copied just fine)  
and if I put in the batch_configuration hdfs.root as hdfs://host: port/ for both src & dest retrospectively I get task_type=NO_OP with the warning I described above. it looks like it's got the srcPath with hdfs://nameservice/srcPath... 

can you explain how reair load the configuration if any (hdfs-site\core-site) for each cluster(src\dest)?
if reair desn't load hdfs-site, how it know to distinguish between the clusters if both of them has the same namenode nameservice?

thanks for your help!!!

Paul Yang

unread,
Jun 25, 2018, 3:04:28 AM6/25/18
to loran...@gmail.com, reair
both clusters we define namenode nameserivce with the same name for HA

That's likely your issue. You should have different names for each. 

loran...@gmail.com

unread,
Jun 25, 2018, 11:22:38 AM6/25/18
to reair
ok so I have a problem with my namenode nameservices and if i understand correctly in HA mode I need to give the dest cluster the properties for the src cluster like distcp, but why reair doesn't copy data when I give the exact Host:port for both src and dest(when I run reair with the exact host:port it replaces the srcHdfsRoot with the nameservice, why?)? like distcp  hdfs://srchost:8020/  hdfs://desthost:8020/ it should work.
is reair understand that the hdfs is in HA mode?

Paul Yang

unread,
Jun 25, 2018, 8:13:24 PM6/25/18
to loran...@gmail.com, reair
Reair doesn't have specific logic about HDFS HA, so it's not clear why that behavior would occur.

loran...@gmail.com

unread,
Jul 10, 2018, 6:45:37 AM7/10/18
to reair
Hi, just an update I managed to figure it out...the job is running from dest is connecting to srcMetastore and getting the full path(with the nameservice) of the tables that it needs to copy, but because my both src\dest nameservice are identical it's trying to copy the data from it itself. I managed to overcome this issue by adding a check in destinationObjectFactory srcPath = (!srcPath.toString().startWith(srcFsRootWithSlash) ? new Path(srcPath.toString().replaceAll("\\bhdfs://.*?\\/\\b", srcFsRootWithSlash)) : srcPath)
is there a place to add this check for both src\dest in the project?(I've already mapped all the places). 

Paul Yang

unread,
Jul 10, 2018, 2:39:36 PM7/10/18
to loran...@gmail.com, reair
Can you elaborate on why your source / destination nameservices are identical, and what the replacement `new Path(srcPath.toString().replaceAll("\\bhdfs://.*?\\/\\b", srcFsRootWithSlash))` intends to accomplish?

Reply all
Reply to author
Forward
0 new messages