How do I get mrjob to work this way? I overrode def load_options(self, args): in the derived mrjob class, though the inputs are right, i get an error when running on emr:
The following code in util.py fails:
# resolve globs
paths = glob.glob(path)
if not paths:
raise IOError(2, 'No such file or directory: %r' % path)
elif len(paths) > 1:
for path in paths:
for line in read_input(path, stdin=stdin):
yield line
return
else:
path = paths[0]
I get the 'no such file or directory' due to the glob resolver.
What is the right way to generate input paths for a job?
Shiv
This is fine (and in fact excellent) for now. However, at some point, I would assume s3 testing can be possible from local jobs or local hadoop runs, in which case, one might need to override def_load_options to support custom input paths.
Regards, Shiv
One thing you might have already thought of is allowing people to
specify the path to their log file as a string to pass to strftime()
(something like 's3://machine1/activitylogs/%Y-%m-%d/*').
Also, an issue we've run into is if we want to pass a year of logs to
a job, it creates a command line that's too long for EMR. One solution
is to use * for the day when we want all logs for a month; the
calendar module might be helpful for determining when we can do this.
(Potentially we could use * for the month as well when we want all
days in a year, though in the case of most of our logs, that would be
a pricey, pricey job.)
-Dave
Perhaps, we might only need --start-date and --end-date as the options then and leave the rest to regex magic which seems like a good solution.
I might be able to come up with a better attempt after Tuesday.
Regarding your 2nd comment, it makes sense to use the calendar module with practical optimizations to ensure the list of files are not too large.
Shiv