> It's not horrible at all, a lot of people have done this.
Good, I'm glad to hear that.
> Actually I'm not sure what recover partitions does -- does it just
> mean "parse the folder name to figure out the partition ID"?
Yeah--it looks like it's actually an Amazon extension:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html#emr-hive-recovering-partitions
Before this, our Hive scripts had to have a separate line for each
partition (alter table add partition X) they wanted to read.
The good thing was that, since our partition hints were year/month/day,
if our query included a date in its where clause (which was very
common), Hive could avoid pulling the other partitions from S3.
I don't think Spark works entirely the same way (?), because if I do
sc.textFile(s3n://bucket/path/), then a .filter { _.year >= 2012 },
Spark would have to pull in all of the data from S3 just to evaluate
"year >= 2012" (since it's unaware of the S3 partitioning scheme
encoded in the S3 path) and just end up throwing most of the data out.
AFAIK? I dunno, since Spark's RDDs know about partitions, can .filter
eval certain conditions without even loading the non-matching
partitions?
> s3n://bucket/path/year=2012,s3n://bucket/path/year=2011
Ah, cool, okay, I think this is what I want for now.
> If that's so, you can split the strings on '\0001', etc.
Ha, oh right, that would be a lot easier. Thanks!
Thanks for your help, Matei, I appreciate it.
- Stephen