ReAir Batch job to sync the two warehouses for a specific window period

27 views
Skip to first unread message

adityak...@exadatum.com

unread,
May 14, 2017, 6:47:00 AM5/14/17
to reair
Hi All, 

So here is the challenge I am facing with ReAir. I have over 10 years of data in my remote cluster which I want to sync with Amazon EC2 instance only for the last 3 years and not full 10 years, and thereby incremental. So basically I want to sync batch and incremental between my remote and EC2 instance only for the last 3 years and not beyond. Can some help me understand how to implement it, if that feature is already available and if it has to be coded what specific changes needs to be plugged in the code to achieve this scenario. 

I once again heartily thank you all for the support and awesomeness. :) 

Paul Yang

unread,
May 15, 2017, 3:47:10 AM5/15/17
to adityak...@exadatum.com, reair
Can you elaborate on how your data is organized? Are they in partitioned tables, and if so, how are they partitioned?

--
You received this message because you are subscribed to the Google Groups "reair" group.
To unsubscribe from this group and stop receiving emails from it, send an email to airbnb-reair+unsubscribe@googlegroups.com.
To post to this group, send email to airbnb...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/airbnb-reair/cd2de646-00a9-446e-8789-cae844beae18%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

adityak...@exadatum.com

unread,
May 15, 2017, 7:12:16 AM5/15/17
to reair
Hi Paul, So my data is organized in the partitioned table with following partition columns, current_year (ex, 2016), current_month (ex, 201612 - yyyyMM), current week, current_day, etc. I want to sync only data for say last 2 years, so this means, any data which is in partitioned column, currrent_year=2015 and current_month=201504 should be ignored. Hope you got an understanding of the data. Is there a way to address this particular issue ?

Paul Yang

unread,
May 15, 2017, 6:52:55 PM5/15/17
to adityak...@exadatum.com, reair
In that case, you can list all the partitions that should be copied over and use the batch replication feature to do a one time copy. You can most easily generate the list by querying the Hive metastore. Once that's done, you can setup incremental replication with regex filter to only replicate changes for partitions where the year matches your criteria. See: 


--
You received this message because you are subscribed to the Google Groups "reair" group.
To unsubscribe from this group and stop receiving emails from it, send an email to airbnb-reair+unsubscribe@googlegroups.com.
To post to this group, send email to airbnb...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages