I'm trying to take the result set from a Hive query run on an Amazon EMR cluster and load it into Redshift. I think the best route would be to export the result set to s3 as a single tsv file by running a query against Hive and setting the output as an S3Target. That looks something like:
class CommercialNodeQueryTask(hive.HiveQueryTask):
date = luigi.DateParameter(default = date.today() - timedelta(1))
client = S3Client(aws_access_key_id = configuration.get_config().get('aws','aws_access_key_id'), aws_secret_access_key = configuration.get_config().get('aws', 'aws_secret_access_key'))
def output(self):
return S3Target('s3://.../../../%s/%s/%s.tsv.gz' % (self.date
.strftime('%Y'), self.date
.strftime('%m'), self.date
.strftime('%Y-%m-%d')), client=self.client, format=format.Gzip)
def query(self):
return "select ... ; "
... but, while the root directory is created in s3, nothing gets written to the output location, and no error is generated. If I run the same HQL from the command line and add shell redirection, I get a nice tab separated file.
Anyone have a good strategy for making the results of a Hive query available to a redshift COPY load?
I'm using Luigi 1.3.0.
Thanks for your help!
Dan
--
You received this message because you are subscribed to the Google Groups "Luigi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to luigi-user+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.